Tracepoints are the Linux kernel’s built-in hooks for observing specific events, allowing tools like perf to gain deep visibility into subsystem behavior.

Let’s watch perf trace kernel scheduler events.

sudo perf record -e 'sched:sched_switch' -a -- sleep 5
sudo perf script

This command records every time the scheduler switches from one task to another for 5 seconds, then displays those events in a human-readable format. You’ll see output like this, detailing the old task, new task, and their PIDs:

...
        0.000012:    0.000000:    0.000000:   <idle> (0) -> swapper/0 (0)
        0.000056:    0.000000:    0.000000:   swapper/0 (0) -> systemd (1)
        0.000089:    0.000000:    0.000000:   systemd (1) -> (null) (0)
        0.000123:    0.000000:    0.000000:   (null) (0) -> kworker/0:1H-123 (123)
...

The sched:sched_switch tracepoint is fundamental to understanding how the kernel manages CPU time. When you see sched_switch, it means the scheduler has decided to stop running one process (the prev_state task) and start running another (the next_state task) on a CPU. This is the heartbeat of concurrency on Linux.

The power of tracepoints lies in their granularity. perf can tap into thousands of these predefined points across various kernel subsystems: scheduling, memory management, network stack, file system operations, and more. Each tracepoint represents a specific, observable moment in the kernel’s execution.

To find available tracepoints, you can explore the debugfs filesystem:

sudo find /sys/kernel/debug/tracing/events/ -type d -print | sort

This will list directories, each representing a kernel subsystem that exposes tracepoints. Within these directories, you’ll find files corresponding to individual tracepoints. For example, to see all ftrace events related to the block layer:

sudo find /sys/kernel/debug/tracing/events/block/ -type f -print

You can enable or disable tracepoints directly by writing to their respective files in debugfs, though perf abstracts this for you. For instance, to manually enable sched_switch:

echo 1 | sudo tee /sys/kernel/debug/tracing/events/sched/sched_switch/enable

The perf tool then acts as a sophisticated collector and analyzer, allowing you to sample these events efficiently without overwhelming your system. The -e flag specifies which events to trace, -a traces across all CPUs, and -- sleep 5 limits the tracing duration.

The perf script command converts the raw perf.data recording into a more readable format, showing timestamps, CPU, PID, and event-specific details. Understanding the output requires familiarity with the subsystem being traced; for sched_switch, knowing about process states and scheduling priorities is key.

One of the most surprising aspects of tracepoints is how they are often implemented using static markers within the kernel source code. These aren’t dynamic probes in the traditional sense; they are deliberate trace_event() calls placed by kernel developers at critical junctures. This makes them incredibly efficient as they have virtually zero overhead when not being actively traced by a tool like perf. The kernel only incurs the cost of the tracepoint when a tracing subsystem is registered and listening.

When you see a tracepoint like syscalls:sys_enter_read, it signifies that a user-space process is about to execute the read() system call. The arguments to that system call (file descriptor, buffer, count) are typically available as part of the tracepoint’s data, giving you unparalleled insight into application interactions with the kernel.

The next step in your exploration of kernel observability will likely involve correlating events across different subsystems. For example, you might want to see which network events (net:netif_rx) occur immediately before a sched_switch event, or how file system operations (ext4:ext4_sync_fs) impact scheduler latency.

Want structured learning?

Take the full Perf course →