Linux perf can profile scheduler activity, giving you a window into how your system’s CPU time is being allocated.

Let’s see perf in action. Imagine you’re running a multi-threaded application, and you suspect it’s spending too much time waiting for the CPU. You can use perf to pinpoint this.

First, let’s capture context switch events. A context switch happens when the Linux scheduler decides to stop running one process or thread and start running another. This involves saving the state of the current task and loading the state of the next.

sudo perf record -e cs -a -- sleep 10

This command perf record will record events. -e cs tells perf to specifically listen for context switch events. -a means we’re profiling the entire system, not just a single process. -- sleep 10 is a simple way to keep perf running for 10 seconds.

After it runs, you’ll have a perf.data file. Now, let’s analyze it:

sudo perf report

This perf report command will show you a breakdown of where context switches are occurring. You’ll see a list of tasks (processes or threads) and the number of context switches associated with them.

What does this tell us? If you see a high number of context switches for a specific application, it means that application is either being preempted frequently (another task is getting CPU time) or it’s voluntarily yielding the CPU. This can be a sign of:

  • CPU Contention: Too many active tasks trying to share a limited number of CPU cores.
  • I/O Bound Tasks: A task waiting for I/O might be scheduled out, and when the I/O completes, it gets scheduled back in, causing a context switch.
  • Throttling: If a process is hitting CPU limits (e.g., in a container), it might be frequently scheduled out.
  • Frequent Task Creation/Destruction: A highly dynamic workload can lead to many context switches.

Let’s say perf report shows a process named my_app has a disproportionately high number of context switches. You might then investigate further.

You can also look at involuntary context switches (cs) and voluntary context switches (vcs) separately. A voluntary context switch happens when a task explicitly yields the CPU, often by waiting for an event (like I/O or a mutex). An involuntary context switch happens when the scheduler preempts the task, usually because its time slice has expired or a higher-priority task needs to run.

sudo perf record -e cs,vcs -a -- sleep 10
sudo perf report

This gives you a more granular view. If my_app has many vcs, it might be spending a lot of time waiting. If it has many cs, it’s being interrupted often.

Consider the scenario where you’re seeing a lot of involuntary context switches on your application, and perf report highlights the kernel’s scheduler (swapper or ksoftirqd related entries) as the source. This often points to CPU starvation. Your application isn’t getting enough CPU time because other processes or kernel activities are consuming it.

To fix this, you’d typically look at:

  1. Resource Limits: Check ulimit -u for thread limits and cgroups for CPU limits if you’re in a containerized environment. If my_app is hitting a thread limit, you might need to increase it. For example, on some systems, you can edit /etc/security/limits.conf to add:

    * soft nproc 65536
    * hard nproc 65536
    

    This increases the maximum number of processes/threads a user can run. The change takes effect on new login sessions. The "why it works" is that the kernel enforces these limits, and by raising them, you allow my_app to create more threads if needed without being artificially constrained and causing excess context switches due to resource exhaustion.

  2. CPU Affinity: If you have multiple CPU cores, ensure your application’s threads are distributed efficiently. You can use taskset to bind a process or its threads to specific CPU cores. For example, to run my_app on cores 0 and 1:

    taskset -c 0,1 ./my_app
    

    This can reduce contention if my_app’s threads are constantly being migrated between cores by the scheduler, which itself can cause context switches. By pinning them, you provide predictable access to those cores.

  3. Kernel Scheduling Parameters: For advanced tuning, you can adjust kernel scheduler parameters (e.g., using sysctl for kernel.sched_migration_cost_ns or kernel.sched_latency_ns), but this is highly system-dependent and can have unintended consequences. A common adjustment might be to slightly increase kernel.sched_latency_ns to give tasks longer time slices, reducing the frequency of involuntary context switches due to preemption. You’d apply this with:

    sudo sysctl -w kernel.sched_latency_ns=30000000 # Example: 30ms
    

    This tells the scheduler to aim for longer runtimes before considering preemption, giving tasks more contiguous CPU time.

  4. Identify Other Resource Hogs: Use top or htop to identify other processes consuming significant CPU. If ksoftirqd is high, it might indicate heavy interrupt load from network or disk I/O. You may need to investigate the hardware or drivers causing this.

  5. Application Logic: If my_app has many voluntary context switches (vcs), examine its internal threading model. Is it creating and destroying threads excessively? Is it blocking on I/O unnecessarily? Optimizing the application’s code to reduce blocking or frequent thread management is often the most effective solution.

  6. NUMA (Non-Uniform Memory Access): On NUMA systems, a process’s memory access patterns can influence scheduling decisions. If a process frequently accesses memory on a different NUMA node than its current CPU, the scheduler might try to migrate it, leading to context switches. Tools like numactl can help manage NUMA affinity.

The one thing that often trips people up is attributing all context switches to application misbehavior. Kernel threads, interrupt handlers, and even system daemons can be significant contributors. perf report is excellent at showing the source of the switch, but understanding the reason requires correlating that with system load and the behavior of other processes. For instance, a high number of context switches attributed to kworker threads might indicate heavy I/O operations or other kernel tasks that are indirectly affecting your application’s scheduling.

Once you’ve optimized your application and system configuration to reduce context switches, the next thing you’ll likely encounter is profiling specific CPU cycles lost to cache misses.

Want structured learning?

Take the full Perf course →