Linux perf can profile scheduler activity, giving you a window into how your system’s CPU time is being allocated.
Let’s see perf in action. Imagine you’re running a multi-threaded application, and you suspect it’s spending too much time waiting for the CPU. You can use perf to pinpoint this.
First, let’s capture context switch events. A context switch happens when the Linux scheduler decides to stop running one process or thread and start running another. This involves saving the state of the current task and loading the state of the next.
sudo perf record -e cs -a -- sleep 10
This command perf record will record events. -e cs tells perf to specifically listen for context switch events. -a means we’re profiling the entire system, not just a single process. -- sleep 10 is a simple way to keep perf running for 10 seconds.
After it runs, you’ll have a perf.data file. Now, let’s analyze it:
sudo perf report
This perf report command will show you a breakdown of where context switches are occurring. You’ll see a list of tasks (processes or threads) and the number of context switches associated with them.
What does this tell us? If you see a high number of context switches for a specific application, it means that application is either being preempted frequently (another task is getting CPU time) or it’s voluntarily yielding the CPU. This can be a sign of:
- CPU Contention: Too many active tasks trying to share a limited number of CPU cores.
- I/O Bound Tasks: A task waiting for I/O might be scheduled out, and when the I/O completes, it gets scheduled back in, causing a context switch.
- Throttling: If a process is hitting CPU limits (e.g., in a container), it might be frequently scheduled out.
- Frequent Task Creation/Destruction: A highly dynamic workload can lead to many context switches.
Let’s say perf report shows a process named my_app has a disproportionately high number of context switches. You might then investigate further.
You can also look at involuntary context switches (cs) and voluntary context switches (vcs) separately. A voluntary context switch happens when a task explicitly yields the CPU, often by waiting for an event (like I/O or a mutex). An involuntary context switch happens when the scheduler preempts the task, usually because its time slice has expired or a higher-priority task needs to run.
sudo perf record -e cs,vcs -a -- sleep 10
sudo perf report
This gives you a more granular view. If my_app has many vcs, it might be spending a lot of time waiting. If it has many cs, it’s being interrupted often.
Consider the scenario where you’re seeing a lot of involuntary context switches on your application, and perf report highlights the kernel’s scheduler (swapper or ksoftirqd related entries) as the source. This often points to CPU starvation. Your application isn’t getting enough CPU time because other processes or kernel activities are consuming it.
To fix this, you’d typically look at:
-
Resource Limits: Check
ulimit -ufor thread limits andcgroupsfor CPU limits if you’re in a containerized environment. Ifmy_appis hitting a thread limit, you might need to increase it. For example, on some systems, you can edit/etc/security/limits.confto add:* soft nproc 65536 * hard nproc 65536This increases the maximum number of processes/threads a user can run. The change takes effect on new login sessions. The "why it works" is that the kernel enforces these limits, and by raising them, you allow
my_appto create more threads if needed without being artificially constrained and causing excess context switches due to resource exhaustion. -
CPU Affinity: If you have multiple CPU cores, ensure your application’s threads are distributed efficiently. You can use
tasksetto bind a process or its threads to specific CPU cores. For example, to runmy_appon cores 0 and 1:taskset -c 0,1 ./my_appThis can reduce contention if
my_app’s threads are constantly being migrated between cores by the scheduler, which itself can cause context switches. By pinning them, you provide predictable access to those cores. -
Kernel Scheduling Parameters: For advanced tuning, you can adjust kernel scheduler parameters (e.g., using
sysctlforkernel.sched_migration_cost_nsorkernel.sched_latency_ns), but this is highly system-dependent and can have unintended consequences. A common adjustment might be to slightly increasekernel.sched_latency_nsto give tasks longer time slices, reducing the frequency of involuntary context switches due to preemption. You’d apply this with:sudo sysctl -w kernel.sched_latency_ns=30000000 # Example: 30msThis tells the scheduler to aim for longer runtimes before considering preemption, giving tasks more contiguous CPU time.
-
Identify Other Resource Hogs: Use
toporhtopto identify other processes consuming significant CPU. Ifksoftirqdis high, it might indicate heavy interrupt load from network or disk I/O. You may need to investigate the hardware or drivers causing this. -
Application Logic: If
my_apphas many voluntary context switches (vcs), examine its internal threading model. Is it creating and destroying threads excessively? Is it blocking on I/O unnecessarily? Optimizing the application’s code to reduce blocking or frequent thread management is often the most effective solution. -
NUMA (Non-Uniform Memory Access): On NUMA systems, a process’s memory access patterns can influence scheduling decisions. If a process frequently accesses memory on a different NUMA node than its current CPU, the scheduler might try to migrate it, leading to context switches. Tools like
numactlcan help manage NUMA affinity.
The one thing that often trips people up is attributing all context switches to application misbehavior. Kernel threads, interrupt handlers, and even system daemons can be significant contributors. perf report is excellent at showing the source of the switch, but understanding the reason requires correlating that with system load and the behavior of other processes. For instance, a high number of context switches attributed to kworker threads might indicate heavy I/O operations or other kernel tasks that are indirectly affecting your application’s scheduling.
Once you’ve optimized your application and system configuration to reduce context switches, the next thing you’ll likely encounter is profiling specific CPU cycles lost to cache misses.