The Linux perf scheduler events can reveal when and why your tasks are getting scheduled and unscheduled, which is crucial for understanding performance bottlenecks.

Let’s see perf in action. Imagine we have a simple C program that spins in a tight loop, and we want to see how often it gets preempted.

// busy_loop.c
#include <stdio.h>
#include <stdlib.h>

int main() {
    long long count = 0;
    printf("Starting busy loop...\n");
    while (1) {
        count++;
        if (count % 1000000000 == 0) {
            printf("Count: %lld\n", count);
        }
    }
    return 0;
}

We can compile and run this:

gcc busy_loop.c -o busy_loop
./busy_loop &

Now, let’s use perf to record scheduler events. We’re interested in sched:sched_switch, which fires every time the scheduler decides to stop one task and start another.

sudo perf record -e 'sched:sched_switch' -g -p $(pgrep busy_loop) -- sleep 10

This command does a few things:

  • sudo perf record: Starts recording performance data with root privileges.
  • -e 'sched:sched_switch': Specifies the event to record – scheduler switches.
  • -g: Captures call graphs, which helps us see why a schedule decision was made (e.g., was it a system call or an interrupt?).
  • -p $(pgrep busy_loop): Attaches perf to the busy_loop process. pgrep busy_loop finds the Process ID.
  • -- sleep 10: Runs the recording for 10 seconds.

After running, you’ll have a perf.data file. Let’s analyze it:

sudo perf script

This will dump a lot of output. To make it more readable, we can filter it to show only the sched:sched_switch events and group them by the process name:

sudo perf script | grep 'sched:sched_switch' | awk '{print $5}' | sort | uniq -c | sort -nr

This command chain will show you a count of how many times sched:sched_switch events occurred for each process. You’ll likely see your busy_loop process appearing frequently.

The mental model for scheduler events revolves around the scheduler’s job: efficiently allocating CPU time to all runnable tasks. When a task runs, it eventually stops. This stop can be voluntary (e.g., waiting for I/O) or involuntary (preemption). sched:sched_switch captures all these transitions.

The key events you’ll see are:

  • sched:sched_switch: The core event. It tells you when the CPU switched from prev_pid (previous process) to next_pid (next process). The important fields are prev_pid, next_pid, prev_comm, next_comm, and cpu.
  • sched:sched_wakeup: Fired when a task is woken up and becomes runnable. This often precedes a sched_switch where the woken task is chosen to run.
  • sched:sched_yield: Fired when a task voluntarily gives up the CPU, usually by calling sched_yield().
  • sched:sched_process_fork and sched:sched_process_exec: Show process creation and execution, which can impact scheduling.

Understanding the call graph (-g) associated with sched:sched_switch is vital. If the call graph shows a system call like read() or write(), it means the task voluntarily yielded the CPU because it was waiting for I/O. If the call graph points to an interrupt handler (like irq_enter or irq_exit), it means an interrupt occurred, potentially waking up a higher-priority task or causing the scheduler to re-evaluate. For our busy_loop, you’ll likely see it being preempted by the scheduler itself, often due to its time slice expiring or a higher-priority task (like perf or kernel threads) needing CPU.

The scheduler’s primary goal is to ensure fairness and responsiveness. It uses complex algorithms (like the Completely Fair Scheduler, CFS) to assign "virtual runtimes" to tasks. A task with a lower virtual runtime is considered to have run less and is thus prioritized. sched:sched_switch shows the direct outcome of these decisions.

When analyzing perf script output for sched:sched_switch, pay attention to:

  • High frequency of switches for a single task: This could indicate the task is constantly being preempted, possibly due to high system load or a high-priority task hogging the CPU.
  • Frequent switches between specific pairs of tasks: This might reveal inter-process communication patterns or contention.
  • Unexpected tasks appearing as next_comm: If a kernel thread or an unrelated user process is consistently taking the CPU, it can point to resource contention or unexpected system behavior.

A lesser-known aspect is how sched:sched_switch relates to the prev_state field. This field (often visible in perf script -s or by digging into the raw event data) indicates why the previous task stopped running. Common values include R (running, but preempted), S (sleeping, waiting for an event), D (uninterruptible sleep, typically waiting for I/O), and I (idle). For our busy_loop, you’ll predominantly see R if it’s being preempted by the scheduler’s time slicing, or S if it’s voluntarily yielding.

The next step after understanding task switching is often to investigate what each task is doing when it gets the CPU, which leads to profiling with events like CPU cycles (cycles:pp) or instruction execution (instructions:pp).

Want structured learning?

Take the full Perf course →