perf is showing lock contention, and you’re seeing mutex events. This means threads are waiting to acquire a lock that another thread already holds, and the kernel’s mutex implementation is reporting these waits.

Common Causes and Fixes

  1. Excessive Contention on a Single Mutex: A very common scenario is a single mutex being heavily contended for by many threads. This often happens when a shared resource (like a global counter, a cache, or a data structure) is accessed by most workers.

    • Diagnosis: Use perf record -g -e 'mutex:*' -- sleep 10 and then perf report. Look for mutex_acquire and mutex_trylock events clustered around specific call stacks. If a single mutex’s events dominate the report, that’s your prime suspect. You can also use perf lock record for a more focused view.
    • Fix: Refactor the code to reduce the scope of the lock. Can the shared resource be sharded (e.g., multiple counters instead of one)? Can operations be batched outside the lock? Can read operations be separated from write operations, allowing readers to proceed concurrently? For example, if you have a global counter g_count protected by g_count_mutex, and many threads do lock(&g_count_mutex); g_count++; unlock(&g_count_mutex);, consider using per-CPU counters or atomic operations if appropriate.
    • Why it works: Reducing the number of threads that need to acquire the lock, or allowing more threads to acquire it concurrently (e.g., with read-write locks or atomic operations), directly reduces the probability of contention.
  2. Inefficient Locking Strategy (e.g., Locking Too Broadly): A lock might be held for too long, encompassing operations that don’t actually need protection.

    • Diagnosis: In perf report, examine the call stacks associated with mutex_acquire and mutex_release. If the code between the acquire and release includes significant work that doesn’t touch the protected data, the lock is too broad.
    • Fix: Narrow the critical section. Move any operations that don’t require exclusive access outside the lock. For instance, if you’re processing data from a queue, acquire the lock, dequeue an item, release the lock, process the item, and then loop. Don’t hold the lock while processing.
    • Why it works: Shorter lock hold times mean the lock is available for other threads sooner, reducing the chance they’ll have to wait.
  3. False Sharing (Indirectly via Mutexes): While not directly a mutex issue, false sharing can lead to contention on locks that protect data that appears to be shared. If two unrelated variables reside on the same cache line and are modified by different CPUs, cache coherency protocols can cause significant overhead, potentially indirectly impacting lock performance if those variables are part of larger structures protected by locks.

    • Diagnosis: This is harder to spot directly with perf lock. You’d typically look for high CPU cache miss rates with perf stat and then analyze memory access patterns. If perf lock shows contention on a mutex protecting a structure that contains seemingly unrelated data accessed by different threads, consider false sharing.
    • Fix: Pad data structures to ensure that frequently modified, independent variables don’t share a cache line. For example, ensure a mutex object itself doesn’t share a cache line with data it protects if that data is heavily written by different CPUs.
    • Why it works: By separating data onto different cache lines, you reduce the overhead of cache coherency protocols, which can indirectly alleviate pressure on the system and potentially reduce lock contention observed by perf.
  4. Spinning on Trylock: If a thread repeatedly calls mutex_trylock and fails, it will spin uselessly, consuming CPU cycles and potentially causing other threads to miss their deadlines.

    • Diagnosis: perf lock record -M 1000000 (1 second timeout) and perf report. Look for a high number of mutex_trylock failures and repeated attempts by the same thread.
    • Fix: Replace mutex_trylock loops with a blocking mutex_lock or a more sophisticated wait strategy (e.g., exponential backoff or yielding). The trylock pattern is usually only appropriate when immediate failure is acceptable and retrying later is managed carefully.
    • Why it works: mutex_lock is designed to efficiently block the thread until the lock is available, preventing busy-waiting and allowing the CPU to do other work.
  5. Kernel-Level Mutexes in Drivers or Subsystems: Sometimes, the contention isn’t in your application code but within a kernel driver or a core subsystem (e.g., networking, block I/O).

    • Diagnosis: perf record -g -e 'mutex:*' -a -- sleep 10. Analyze the call stacks in perf report. If the top call stacks point into kernel functions (e.g., tcp_sendmsg, blk_mq_dispatch_rq_list, vfs_write), the problem is likely in the kernel.
    • Fix: This is usually harder. It might involve tuning kernel parameters (e.g., net.core.somaxconn, I/O scheduler settings), updating kernel versions, or identifying specific driver bugs. If it’s a common subsystem, search for known issues and patches.
    • Why it works: Addressing the specific kernel component’s bottleneck (e.g., by increasing queue depths, optimizing data paths, or fixing race conditions) removes the kernel-level contention.
  6. High Thread Count Leading to Schedule Latency: A very large number of active threads can lead to increased scheduling overhead and latency. When threads are constantly being swapped in and out, a thread trying to acquire a lock might be descheduled just before it gets it, or the thread holding the lock might be descheduled while holding it, prolonging the lock’s acquisition time for others.

    • Diagnosis: Use perf stat -e context-switches,migrations and perf sched record / perf sched script. High context switch counts and migrations, especially if correlated with mutex events, can indicate this.
    • Fix: Reduce the number of active threads. Consider using a thread pool with a fixed, reasonable number of threads. Re-evaluate the workload to see if fewer threads can achieve the same throughput.
    • Why it works: Fewer threads mean less contention for CPU time and the scheduler’s attention, leading to more predictable execution and faster lock acquisition.

The next error you’ll likely encounter after fixing mutex contention is related to CPU cache thrashing or increased scheduling latency as threads now have more CPU time but might compete for other limited resources.

Want structured learning?

Take the full Perf course →