The perf tool can reveal that your system is spending an inordinate amount of time dealing with TLB shootdowns, which is a symptom of inter-core communication overhead when memory management changes need to be propagated across multiple CPUs.

Common Causes and Fixes for TLB Shootdowns

  1. Frequent Large Page Unmaps/Maps: When your application frequently unmaps or maps large memory pages (e.g., 2MB or 1GB pages), the kernel needs to invalidate the Translation Lookaside Buffer (TLB) entries on all cores that might have cached translations for those pages. This invalidation process, known as a TLB shootdown, can be costly.

    • Diagnosis: Use perf to check for high tlb_shootdown events. Look for patterns where these events correlate with your application’s memory allocation/deallocation patterns.
      perf record -e tlb_shootdown:common_all -a -- sleep 10
      perf report
      
      Examine the tlb_shootdown:common_all event count. If it’s high and correlates with application activity, this is a likely culprit.
    • Fix: If possible, modify your application to avoid frequent unmapping/mapping of large pages. Instead, try to reuse memory regions or manage smaller chunks within a larger mapped area. If your application must do this, consider reducing the size of the pages being unmapped/mapped if it doesn’t severely impact performance due to increased TLB misses.
    • Why it works: Reducing the frequency of these operations directly reduces the number of TLB shootdown requests initiated by the kernel.
  2. NUMA Rebalancing Overhead: On Non-Uniform Memory Access (NUMA) systems, the kernel’s memory policy can migrate pages between nodes to balance memory usage. This migration can trigger TLB shootdowns if pages are moved across NUMA nodes, requiring TLB updates on affected CPUs.

    • Diagnosis: Check perf for tlb_shootdown:common_all events and cross-reference with numa_bal events.
      perf record -e tlb_shootdown:common_all,numa_bal:all -a -- sleep 10
      perf report
      
      If numa_bal events are frequent and coincide with tlb_shootdown events, NUMA rebalancing is a strong candidate.
    • Fix: Pin critical application threads and their memory to specific NUMA nodes using numactl or by setting process affinity. For example, to bind a process to node 0:
      numactl --cpunodebind=0 --membind=0 /path/to/your/application
      
      Alternatively, configure kernel.numa_balancing to 0 (though this is a global change and might have other performance implications).
      sysctl -w kernel.numa_balancing=0
      
    • Why it works: By preventing pages from being migrated between NUMA nodes, you eliminate the need for the kernel to invalidate TLBs across cores on different nodes.
  3. CPU Hotplug Events: When CPUs are added or removed from the system (CPU hotplug), the kernel needs to synchronize memory management structures, which can involve TLB shootdowns.

    • Diagnosis: Use perf to record tlb_shootdown:common_all and cpu_hotplug:cpu_up, cpu_hotplug:cpu_down events.
      perf record -e tlb_shootdown:common_all,cpu_hotplug:cpu_up,cpu_hotplug:cpu_down -a -- sleep 60
      perf report
      
      Observe if tlb_shootdown spikes occur around the time of CPU hotplug events.
    • Fix: Avoid frequent CPU hotplugging. If it’s unavoidable, ensure your application is designed to be resilient to temporary CPU availability changes and consider reducing the frequency of these events if they are not essential.
    • Why it works: Fewer CPU hotplug events mean fewer kernel-level synchronizations that trigger TLB shootdowns.
  4. Memory Protection Changes (mprotect): When memory protection attributes (read, write, execute) are changed for a memory region using mprotect, the kernel must ensure that all TLBs on all cores reflect the new protections, leading to shootdowns.

    • Diagnosis: Monitor tlb_shootdown:common_all events in perf and correlate them with calls to mprotect in your application’s profiling.
      perf record -e tlb_shootdown:common_all -g -- your_application
      perf report
      # Then analyze the call stacks to identify mprotect calls
      
    • Fix: Minimize the use of mprotect for frequent, small regions. If possible, group mprotect calls for adjacent regions or change protections less often. Consider if the granularity of protection can be relaxed.
    • Why it works: Reducing the number of mprotect calls directly reduces the kernel’s need to broadcast TLB invalidation messages.
  5. Huge Pages Configuration/Management: While using huge pages can reduce TLB misses, incorrect management or frequent changes to the number of huge pages available can indirectly lead to shootdown overhead if the kernel has to re-evaluate page table structures.

    • Diagnosis: Observe tlb_shootdown:common_all events. If you are using huge pages, check the system’s huge page usage (/proc/meminfo for HugePages_Total, HugePages_Free, HugePages_Rss) and look for dynamic changes.
    • Fix: Set a static, sufficient number of huge pages at boot time or during system initialization rather than dynamically resizing them. For example, on systems with transparent_hugepage enabled, ensure it’s configured appropriately or disabled if causing issues.
      echo 'never' > /sys/kernel/mm/transparent_hugepage/enabled
      
      And for defrag:
      echo 'never' > /sys/kernel/mm/transparent_hugepage/defrag
      
    • Why it works: A static configuration avoids kernel overhead associated with dynamic huge page allocation and deallocation, which can sometimes trigger shootdowns.
  6. Kernel Bug or Specific Workload Pattern: In rare cases, a specific workload pattern or a kernel bug might trigger excessive TLB shootdowns.

    • Diagnosis: If none of the above causes yield a clear solution, investigate kernel versions and known issues related to TLB management. Tools like ftrace can provide deeper insights into kernel behavior during shootdowns.
    • Fix: Update to the latest stable kernel version. If a specific bug is identified, consider backporting a fix or working with your kernel distribution.
    • Why it works: Patches often fix inefficiencies or race conditions in the kernel’s TLB management and shootdown handling.

The next error you’ll likely encounter after resolving TLB shootdown issues, if your application was heavily impacted, is a significant increase in CPU utilization due to your application finally being able to perform its work without the overhead of inter-core TLB synchronization.

Want structured learning?

Take the full Perf course →