Linux perf’s kprobes let you instrument the kernel on the fly, without recompiling or rebooting, by dynamically inserting probes into kernel functions.

Let’s see perf in action. Imagine we want to know how many times the sys_read system call is being invoked on our system.

First, we need to find the exact name of the kernel function that handles sys_read. We can use grep on the kernel source (if available) or often, just knowing common names is enough. For sys_read, it’s usually __sys_read or ksys_read. Let’s assume it’s __sys_read for this example.

Now, we tell perf to count events happening at the entry and exit of this function.

sudo perf record -e 'kprobe:__sys_read' -a -- sleep 10

This command does a few things:

  • sudo perf record: We’re using perf to record events.
  • -e 'kprobe:__sys_read': This is the core. We’re specifying an event (-e) which is a kprobe. The syntax kprobe:<function_name> tells perf to insert a probe at the entry of the specified kernel function.
  • -a: This means we’re sampling across all CPUs.
  • -- sleep 10: This tells perf to run for 10 seconds and then stop.

After 10 seconds, perf will have recorded information about every time __sys_read was entered. You’ll see a perf.data file appear in your current directory.

To see the results, we use perf report:

sudo perf report

This will show you a TUI (Text User Interface) where you can navigate. You’ll see __sys_read listed, and the count associated with it. If you want to see both entry and exit, you’d use kretprobe:

sudo perf record -e 'kretprobe:__sys_read' -a -- sleep 10
sudo perf report

This is powerful because you can attach probes to any exported kernel symbol. You don’t need to know the function signature or what arguments it takes. perf handles the low-level mechanics of dynamically modifying the kernel’s code in memory to call into its tracing infrastructure.

The problem perf kprobes solve is the need for deep, dynamic kernel introspection without the overhead of static tracepoints or the disruption of recompiling the kernel. Imagine debugging a race condition in a storage driver. You can’t just add printk statements everywhere; it would flood your logs and drastically alter the timing, potentially hiding the bug. With kprobes, you can attach a probe to a specific function within that driver, record its entry and exit, and even inspect its arguments.

Let’s say you want to see the arguments passed to sys_read. This requires a bit more scripting. We can use bpftrace which leverages kprobes (and other tracing mechanisms) under the hood.

sudo bpftrace -e 'kprobe:__sys_read { printf("read called with fd: %d, buf: %p, count: %d\n", arg0, arg1, arg2); }'

Here:

  • bpftrace -e '...': We’re running a bpftrace script.
  • kprobe:__sys_read: We’re hooking into the entry of __sys_read.
  • { ... }: This is the action to take when the probe hits.
  • printf(...): We’re printing formatted output.
  • arg0, arg1, arg2: These are special bpftrace variables that represent the first, second, and third arguments to the probed kernel function. The kernel’s read syscall typically takes file descriptor, buffer pointer, and count.

Running this would show you, in real-time, every read call and its arguments. This is invaluable for understanding data flow and identifying unexpected behavior.

The real magic is that perf (and bpftrace that uses it) doesn’t just insert a simple jump. It replaces the first few instructions of the target function with a jump to a handler in perf’s code. When the handler finishes, it re-executes the original instructions that were overwritten, ensuring the function continues as if nothing happened, albeit with a slight performance cost. This mechanism is what allows dynamic instrumentation without a kernel reboot.

What many users miss is that kprobes are not limited to function entry. You can also place kretprobes at the return of a function. This is crucial for observing the return value or understanding what happened after a function executed its core logic. For instance, to see the return value of __sys_read (which is the number of bytes read or an error code):

sudo bpftrace -e 'kretprobe:__sys_read /pid == 1234/ { printf("read returned: %d\n", retval); }'

Here, retval is a special bpftrace variable for the return value. The /pid == 1234/ is a filter, only showing read calls from process ID 1234.

The next concept you’ll want to explore is uprobes, which are the user-space equivalent of kprobes, allowing you to dynamically instrument user-space applications without modifying their source code.

Want structured learning?

Take the full Perf course →