The Linux perf tool can detect cross-CPU scheduling events, revealing when a process or thread is moved between different CPU cores by the operating system’s scheduler.

Let’s see what this looks like in practice. Imagine you have a multi-threaded application where each thread is supposed to stay on a specific CPU core for cache efficiency. If the scheduler keeps migrating these threads, performance tanks.

Here’s a simple C program that creates a few threads, each intended to run on a particular CPU.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>
#include <sched.h>

#define NUM_THREADS 4

void *worker(void *arg) {
    long id = (long)arg;
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(id, &cpuset); // Assign thread to CPU 'id'

    if (sched_setaffinity(0, sizeof(cpu_set_t), &cpuset) == -1) {
        perror("sched_setaffinity");
    }

    printf("Thread %ld running on CPU %d\n", id, sched_getcpu());

    // Busy loop to keep the thread active
    volatile int counter = 0;
    while (1) {
        counter++;
        if (counter % 100000000 == 0) {
            printf("Thread %ld alive, on CPU %d\n", id, sched_getcpu());
        }
    }
    return NULL;
}

int main() {
    pthread_t threads[NUM_THREADS];
    for (long i = 0; i < NUM_THREADS; ++i) {
        if (pthread_create(&threads[i], NULL, worker, (void *)i) != 0) {
            perror("pthread_create");
            return 1;
        }
    }

    for (int i = 0; i < NUM_THREADS; ++i) {
        pthread_join(threads[i], NULL);
    }

    return 0;
}

Compile this with gcc -o affinity_test affinity_test.c -pthread. If you run it on a system with at least 4 cores, you’ll see output like:

Thread 0 running on CPU 0
Thread 1 running on CPU 1
Thread 2 running on CPU 2
Thread 3 running on CPU 3
Thread 0 alive, on CPU 0
Thread 1 alive, on CPU 1
Thread 2 alive, on CPU 2
Thread 3 alive, on CPU 3
Thread 0 alive, on CPU 0
Thread 1 alive, on CPU 1
Thread 2 alive, on CPU 2
Thread 3 alive, on CPU 3
...

This looks good, but the scheduler might still be moving these threads around, even if they are assigned to a CPU. To see this, we use perf.

The key event we’re interested in is sched:sched_switch. This event fires every time the scheduler deschedules one task and schedules another. We want to filter this event to see when a task is rescheduled onto a different CPU than it was previously running on.

The perf command to capture this is:

sudo perf record -e 'sched:sched_switch' -g -o perf.data -- ./affinity_test

This records sched_switch events, captures call graphs (-g), and saves them to perf.data. Run the affinity_test program for a while. You’ll see the output from the program itself, and perf will be silently collecting data.

After a minute or two, stop the affinity_test program with Ctrl+C (or let it run until you’re satisfied). Then, analyze the perf.data file:

perf script | grep -E 'CPU [0-9]+ -> CPU [0-9]+' | grep -v 'CPU [0-9]+ -> CPU [0-9]+'

This command chain does the following:

  1. perf script: Dumps the raw perf.data into a human-readable text format.
  2. grep -E 'CPU [0-9]+ -> CPU [0-9]+': Filters for lines that represent a context switch, showing the old and new CPU.
  3. grep -v 'CPU [0-9]+ -> CPU [0-9]+': This is a bit of a trick. We want to see switches where the source CPU is different from the destination CPU. By piping the output of the first grep to a grep -v that matches all such lines, we effectively isolate the lines where the source and destination CPUs are different.

A more direct way to see cross-CPU migrations is to use perf’s built-in filtering capabilities:

sudo perf record -e 'sched:sched_switch' -g -o perf.data -- bash -c './affinity_test & sleep 30 && killall affinity_test'
sudo perf script -i perf.data | awk '$0 ~ /sched_switch/ {
    prev_cpu = $7; # Assuming CPU is the 7th field after split by space
    curr_cpu = $8; # Assuming CPU is the 8th field
    if (prev_cpu != curr_cpu) {
        print "Migration: Task " $5 " from CPU " prev_cpu " to CPU " curr_cpu
    }
}'

This awk script specifically looks for sched_switch events and compares the previous CPU ($7) with the current CPU ($8). If they differ, it prints a "Migration" message.

The core problem sched:sched_switch events reveal is that the Linux scheduler’s primary goal is to ensure all CPUs are busy and to respond to system events (like I/O completion, new tasks arriving, or tasks yielding). It doesn’t inherently prioritize keeping a specific task on a specific core unless explicitly instructed via CPU affinity settings (which we used). When a CPU becomes idle, or when a higher-priority task becomes runnable, the scheduler might move a task.

The sched:sched_switch event fires when the scheduler decides to stop running one task and start running another. The event data includes the PID and command name of the task being switched out, the PID and command name of the task being switched in, and crucially, the CPU core each was running on. By examining these events, we can see if a task that was running on CPU 0 is now running on CPU 5, for example.

The sched_setaffinity system call attempts to bind a process or thread to a specific set of CPUs. However, it’s a hint to the scheduler, not a hard guarantee, especially under heavy load or specific kernel scheduling policies. The scheduler can still preempt a task and reschedule it onto another CPU if it deems it necessary for system-wide load balancing or responsiveness.

The perf script output will show lines like: `100.000000000 @ 1000000000000000: sched:sched_switch: prev_comm: affinity_te:12345 prev_pid: 12345 prev_prio: 120 prev_state: S ([]) next_comm: affinity_te:12346 next_pid: 12346 next_prio: 120 next_state: R ([]) next_cpu: 2

`

This indicates task affinity_te:12345 (PID 12345) was switched out on CPU 10, and task affinity_te:12346 (PID 12346) was switched in on CPU 2. If task 12345 was previously running on CPU 10, and task 12346 was previously running on CPU 2, this isn’t a cross-CPU migration for task 12346. We need to look at the same task over multiple sched_switch events.

A more advanced perf script analysis could group events by PID and track the next_cpu field. If a specific PID appears with different next_cpu values across multiple switches, that’s your migration.

sudo perf record -e 'sched:sched_switch' -o perf.data -- ./affinity_test &
sleep 30
sudo pkill affinity_test

# Analyze all switch events for a specific PID (e.g., 12345)
# This requires scripting to parse and track CPU for a given PID across events.
# A simplified approach using awk for demonstration:
sudo perf script -i perf.data | awk '
BEGIN { FS=" "; OFS=" " }
/sched_switch:/ {
    # Extracting relevant fields - these might vary slightly based on kernel version
    # prev_pid, prev_comm, next_pid, next_comm, next_cpu
    # Example line structure: ... prev_pid:12345 prev_comm:affinity_te:12345 ... next_pid:12346 next_comm:affinity_te:12346 next_cpu:2
    prev_pid_str = $7; # e.g., "prev_pid:12345"
    next_pid_str = $11; # e.g., "next_pid:12346"
    next_cpu_str = $15; # e.g., "next_cpu:2"

    # Extract numeric PIDs and CPU
    sub("prev_pid:", "", prev_pid_str); pid_out = prev_pid_str;
    sub("next_pid:", "", next_pid_str); pid_in = next_pid_str;
    sub("next_cpu:", "", next_cpu_str); cpu = next_cpu_str;

    # Track current CPU for each PID
    current_cpu[pid_out] = "unknown"; # Initialize if seen for the first time
    current_cpu[pid_in] = cpu;

    # Check if the task being switched IN was on a different CPU
    # We need to know its *previous* CPU. This requires state tracking.
    # A simpler check: if a task switches *to* a new CPU, and we know its *last* CPU was different.
    # This awk script is a simplification; a full solution needs more state.

    # Let's focus on a task switching *out* and then *in* later on a different CPU.
    # This requires storing the last seen CPU for each PID.
    last_cpu[pid_out] = current_cpu[pid_out]; # Store the CPU the task was just on
    current_cpu[pid_out] = cpu; # Update current CPU for the task being switched in

    # The core logic: If a task (pid_in) is scheduled onto CPU (cpu),
    # and we know its previous CPU (last_cpu[pid_in]) was different.
    if (pid_in in last_cpu && last_cpu[pid_in] != "unknown" && last_cpu[pid_in] != cpu) {
        print "Migration detected: PID " pid_in " moved from CPU " last_cpu[pid_in] " to CPU " cpu
    }
    last_cpu[pid_in] = cpu; # Update for next switch
}
'

The output might show lines like: Migration detected: PID 12345 moved from CPU 0 to CPU 2 Migration detected: PID 12346 moved from CPU 1 to CPU 3

If you see many such "Migration detected" lines for your application’s PIDs, it indicates that the scheduler is indeed moving your threads between cores, potentially negating your affinity settings and hurting cache performance.

The fix often involves:

  1. Increasing CPU affinity: Ensure your affinity masks are correctly set and cover enough cores.
  2. System Load: Reduce overall system load. If the system is heavily loaded, the scheduler has fewer choices and might ignore affinity.
  3. Scheduler Tuning: For very specific workloads, you might explore advanced scheduler tunables (/proc/sys/kernel/sched_*), but this is complex and usually not the first step.
  4. Application Design: Re-architecting to be less sensitive to CPU locality or to utilize CPU sets more effectively.

The next error you’ll hit is perf: command not found if perf isn’t installed, or Segmentation fault if your awk script has a bug in parsing the perf script output.

Want structured learning?

Take the full Perf course →