ARM’s Statistical Profiling Extension (SPE) is a hardware feature that lets you sample program execution at a very fine-grained level without significantly impacting performance, giving you a remarkably accurate picture of where your CPU is spending its time.

Let’s see it in action. Imagine you have a simple C program that does some heavy computation:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

long long sum_of_squares(int n) {
    long long sum = 0;
    for (int i = 0; i < n; ++i) {
        sum += (long long)i * i;
    }
    return sum;
}

int main() {
    int iterations = 100000000;
    clock_t start = clock();
    long long result = sum_of_squares(iterations);
    clock_t end = clock();
    double cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

    printf("Sum of squares for %d iterations: %lld\n", iterations, result);
    printf("CPU time used: %f seconds\n", cpu_time_used);

    return 0;
}

To profile this with perf and ARM SPE, you first need to ensure your ARM system supports SPE and that perf is compiled with SPE support. On a compatible system, you’d typically run something like this:

# Compile the program (assuming gcc)
gcc -o compute compute.c

# Run perf to collect SPE data
sudo perf record -e arm_spe_ /path/to/compute

The -e arm_spe_ flag tells perf to use the ARM SPE event. This will run your compute program and, in the background, the CPU’s SPE hardware will periodically capture the program counter (PC) value. It’s "statistical" because it doesn’t capture every instruction, but rather samples at a rate determined by the hardware, making the overhead minimal.

After perf record finishes, you’ll have a perf.data file. You can then analyze this data with perf report:

sudo perf report

This will bring up an interactive TUI. You’ll see a list of functions and the percentage of samples attributed to them. For our compute program, you’d expect to see sum_of_squares at the top, indicating that the majority of the CPU’s time (as captured by SPE samples) was spent within that function.

The fundamental problem SPE solves is the performance bottleneck of traditional, precise profiling. If you try to profile every single instruction or event (like with -g debug builds or some older tracing mechanisms), the profiling overhead itself becomes so large that it distorts the program’s execution, making the profile unreliable. SPE sidesteps this by using dedicated hardware to sample the PC at a configurable rate (often thousands or millions of times per second). This low overhead means you can profile your application in a production-like environment without worrying about the profiler itself slowing things down dramatically.

Internally, ARM SPE uses a dedicated buffer in the CPU to store these PC samples. When the buffer fills up, or at specific intervals, the CPU can signal an event that perf can hook into. The SPE mechanism can be configured to sample based on different criteria, such as instruction execution or branch events, but the most common usage is for general instruction flow analysis. The perf tool then reads these samples from the kernel and correlates them back to your program’s symbols (functions, source lines) using debug information.

The key levers you control are the sampling rate and the type of event being sampled. While arm_spe_ is the generic event, specific ARM architectures might expose more granular SPE events. The sampling rate is often implicitly managed by the hardware and the kernel’s perf implementation, but understanding that it’s a rate is crucial. It’s not about if a sample is taken, but when and how often.

The most surprising thing about SPE is how it handles cache misses. By default, many SPE configurations are tied to instruction fetches. This means that if your program is spending a lot of time stalled waiting for data from memory (a cache miss), SPE will naturally capture fewer samples within the code that triggered the miss, and more samples in the code that handles the stall or the code that was executing before the stall. This can lead to a profile that under-represents time spent in memory-bound loops if you’re not careful about the event you’re using. For true "time spent" profiling, you often want to consider events that are less directly tied to instruction execution and more to CPU cycles or micro-architectural events, if available and supported by your SPE implementation.

Once you’ve got your SPE samples, the next logical step is to dive into call graphs to understand the hierarchical relationships between functions and identify the true bottlenecks across your entire application stack.

Want structured learning?

Take the full Perf course →