Python’s perf utility, when pointed at a Python process, doesn’t actually sample Python bytecode execution. Instead, it’s sampling the C-level operations happening within the Python interpreter and its C extensions.

Let’s see what that looks like in practice. Imagine you have a simple Python script, slow_script.py:

import time

def busy_loop():
    x = 0
    for i in range(1000000):
        x += i

def main():
    print("Starting...")
    busy_loop()
    time.sleep(1)
    print("Done.")

if __name__ == "__main__":
    main()

To profile this with perf, you’d first find its PID. Let’s say it’s 12345.

Then, you’d run perf top -p 12345.

You won’t see busy_loop or main dominating the output. Instead, you’ll see a lot of time spent in interpreter internals:

Overhead  Command  Shared Object      Symbol
--------------------------------------------------------------------------------
 30.10%   python   python             PyObject_GenericGetAttrWithDict
 25.00%   python   python             _PyObject_MakeTpCall
 15.50%   python   python             dictobject.c:dict_get_item_common
 10.20%   python   libc-2.31.so       __GI___nanosleep
  5.10%   python   python             frameobject.c:frame_dealloc
  ...

This output shows that the bulk of the CPU time is consumed by Python’s internal C functions for attribute access (PyObject_GenericGetAttrWithDict), function calls (_PyObject_MakeTpCall), dictionary lookups (dict_get_item_common), and even sleep operations (__GI___nanosleep from libc). The busy_loop function itself, as pure Python code, is largely invisible to perf at this level.

The problem perf solves here is understanding the overhead of running Python code. When you see PyObject_GenericGetAttrWithDict taking up a significant portion of CPU, it tells you that the Python interpreter is spending a lot of time doing the work of looking up attributes and executing your Python functions at the C level. This is the fundamental cost of Python’s dynamic nature.

The key levers you control are how you structure your Python code and which libraries you use. If perf shows heavy time in PyObject_GenericGetAttrWithDict, it might indicate frequent attribute lookups or function calls that could potentially be optimized by reducing indirection, using local variables where possible, or employing libraries that perform critical operations in C. Similarly, high time in dictobject.c might suggest a hot loop involving heavy dictionary manipulation.

The perf record -p 12345 -g --call-graph dwarf command is your friend for getting more context. The -g flag enables call graph recording, and --call-graph dwarf tells perf to use DWARF debug information to unwind the stack. This allows you to see not just the C functions, but the Python functions that called them, albeit still through the interpreter’s C API.

With perf report -g, you can then navigate these call graphs. You’ll see that PyObject_GenericGetAttrWithDict is often called by internal Python C functions that manage attribute access, which in turn are called by the interpreter’s execution loop when it encounters an attribute access in your Python code. The time.sleep(1) in our example will show up as time spent in __GI___nanosleep, but the Python call that initiated it might be visible when you expand the call graph.

The one thing most people don’t realize is that perf is primarily measuring the cost of the interpreter, not the cost of your algorithm when written in Python. A Python function that seems computationally intensive might appear to take very little time in perf if it’s implemented efficiently in C (e.g., a NumPy operation), while a simple Python loop can show up as consuming significant CPU because every operation (addition, assignment, loop iteration) incurs interpreter overhead.

The next step after understanding Python interpreter overhead is to investigate how to profile specific Python functions more directly, often using Python-native profilers like cProfile or specialized tools like py-spy.

Want structured learning?

Take the full Perf course →