Go’s perf is surprisingly powerful for profiling, but most people treat it like a black box, missing its deepest insights.

Let’s see it in action. Imagine a simple Go web server:

package main

import (
	"fmt"
	"net/http"
	"runtime"
	"time"
)

func handler(w http.ResponseWriter, r *http.Request) {
	fmt.Fprintf(w, "Hello, world!")
}

func busyWork() {
	for i := 0; i < 1000000; i++ {
		_ = i * i
	}
}

func main() {
	http.HandleFunc("/", handler)
	go func() {
		for {
			busyWork()
			time.Sleep(10 * time.Millisecond)
		}
	}()
	fmt.Println("Server starting on :8080")
	http.ListenAndServe(":8080", nil)
}

We want to profile this with perf. First, we need to make sure perf can see our Go code. This usually means ensuring perf_events_paranoid is set to 1 or 0. Check with cat /proc/sys/kernel/perf_event_paranoid. If it’s 2 or 3, you’ll need root privileges or to adjust it: sudo sysctl kernel.perf_event_paranoid=1.

Now, let’s start perf to record events. We’ll focus on CPU cycles (cycles) and context switches (context-switches) as they’re fundamental.

sudo perf record -g -F 999 -e cycles,context-switches --call-graph dwarf -p $(pgrep -f "go run main.go")
  • -g: Enables call graph recording. Crucial for understanding why something is happening.
  • -F 999: Samples at 999 Hz, a common frequency that balances detail and overhead.
  • -e cycles,context-switches: The events we’re interested in.
  • --call-graph dwarf: Tells perf to use DWARF debugging information for call graphs, which Go produces by default.
  • -p $(pgrep -f "go run main.go"): Targets the specific Go process.

Let it run for a bit, then stop with Ctrl+C. Now, analyze the data:

sudo perf report

This opens an interactive TUI. Navigate with arrow keys. Look for functions consuming high percentages of cycles. You’ll likely see runtime.gcBg (garbage collection) or runtime.duffcopy/runtime.duffzero (memory copying) if your application is memory-intensive.

The real power comes from digging into the call graph. When you select a hot function in perf report, press Enter to see its callers and callees. This is where you’ll connect perf’s low-level observations to your Go code. You might see busyWork calling runtime.duffcopy because of the loop’s operations, or main.main calling busyWork.

The mental model: perf operates at the kernel level, observing hardware performance counters and software events. It doesn’t inherently understand Go’s runtime specifics. However, Go’s compiler injects DWARF debugging information and uses predictable runtime structures. perf leverages DWARF to map raw CPU events back to Go functions and their call stacks. So, cycles in runtime.gcBg means the Go garbage collector is busy; cycles in main.handler means your request handler is taking CPU time. context-switches often points to goroutine scheduling or I/O blocking.

The surprising part is how perf can reveal issues outside your explicit Go code. For instance, you might see a significant amount of time spent in sched_yield or futex calls. These are kernel-level operations related to thread scheduling and synchronization. In a Go context, excessive futex waits often indicate goroutines blocking on channel operations or mutexes, and sched_yield can signal high contention for CPU resources among many goroutines. This allows you to diagnose issues that aren’t just "my Go code is slow" but "the Go runtime, interacting with the OS, is experiencing contention."

Once you’ve identified a hot function in perf report, like busyWork in our example, you can examine its call graph to see what is calling it. If busyWork is consuming 30% of CPU cycles, and perf report shows it’s called by main.main, you know where to look in your Go source. You can then correlate this with Go’s built-in profiling tools (pprof) for a more granular view of heap allocations or goroutine states.

The next thing you’ll likely encounter is needing to profile specific goroutines, which perf doesn’t directly do, pushing you towards pprof for that level of detail.

Want structured learning?

Take the full Perf course →