Benchmarking performance is less about finding the absolute fastest speed and more about understanding how your system behaves under specific, reproducible loads.
Let’s see what this looks like in practice. Imagine we’re benchmarking a simple API endpoint that fetches user data. We’ll use wrk, a modern HTTP benchmarking tool.
wrk -t4 -c100 -d30s --latency http://localhost:8080/users/123
Here’s what’s happening:
-t4: We’re using 4 threads to generate requests. More threads can saturate the CPU, which is often a bottleneck.-c100: We’re keeping 100 connections open to the server. This simulates concurrent users.-d30s: The benchmark will run for 30 seconds. Long enough to smooth out startup effects but short enough to be practical.--latency: We’re askingwrkto record detailed latency percentiles.http://localhost:8080/users/123: This is the target URL.
The output will look something like this:
Running 30s test @ http://localhost:8080/users/123
4 threads and 100 connections
Thread Stats Avg Stdev Max Median 99% 99.9% 99.99% 100% Sock
1 err 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0
2 err 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0
3 err 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0
4 err 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0
Total:
Thread Stats Avg Stdev Max Median 99% 99.9% 99.99% 100% Sock
4 err 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0
Latency Distribution
50% 0.15ms
75% 0.21ms
90% 0.30ms
99% 0.75ms
99.9% 1.50ms
99.99% 3.00ms
100% 5.00ms
Requests/sec: 45000.50
This output tells us a lot:
- Requests per second (RPS): 45,000.50. This is the throughput.
- Latency: The
99%latency is 0.75ms. This means 99% of requests were served in under 0.75 milliseconds. This is critical for user experience. High99.9%or100%latencies (tail latency) often indicate problems. - Errors: 0 errors. If this number is non-zero, it’s the first thing you fix.
The core problem this process addresses is understanding how your application or service behaves under load before it impacts real users. It’s about identifying the breaking points and understanding the trade-offs. The goal isn’t just to achieve a high RPS, but to do so with predictable, low latency for the vast majority of requests.
The mental model here is about simulating realistic user traffic. You’re not just hitting a single endpoint once; you’re mimicking a crowd of users making requests concurrently. The key levers you control are the number of threads (CPU bound work), the number of connections (network/IO bound work and server concurrency), and the duration (to ensure stability).
Crucially, you must ensure your benchmark environment is isolated. Running benchmarks on your laptop while browsing the web will yield meaningless results. Dedicate a machine or a controlled environment. Also, ensure your application’s configuration is set to production-like values. For example, if your database connection pool is set to a maximum of 5 connections in development, your benchmark will hit that limit immediately and show artificially low throughput.
The most surprising thing about performance benchmarking is how often the bottleneck isn’t where you expect it. People often assume it’s the CPU or the network, but it’s frequently an overlooked piece of internal logic, like inefficient string concatenation in a hot loop, or a poorly optimized database query that only manifests under concurrent access. Another common culprit is excessive garbage collection pauses in managed languages, which can cause sporadic, high-latency spikes that are hard to pinpoint without detailed profiling alongside your benchmark.
When you’re looking at your latency percentiles, especially the 99.9% and above, and you see a sharp jump from the median or 99%, it’s a strong indicator of a resource contention issue. This could be lock contention in your code, a shared resource like a file handle being exhausted, or even the operating system’s scheduler struggling to keep up with many concurrent threads. You need to look at the shape of the latency distribution, not just the average.
The next step is usually to analyze the causes of high tail latency, often involving profiling tools like perf or language-specific profilers to dig into the code execution.