Benchmarking: Beyond the Numbers

Benchmarking is often lauded as the ultimate arbiter of performance, but the most surprising truth is that most benchmarks are fundamentally flawed because they measure the wrong thing.

Let’s say you’re trying to benchmark a new web server. You set up two servers, one with the old version, one with the new. You hit them with ab (ApacheBench) for 30 seconds, requesting /index.html. You see the new server handles 10% more requests per second. Great, right? Not necessarily. What if /index.html is served directly from the filesystem on the old server, but the new server’s benchmark test actually triggers a complex database query and cache invalidation for every request? You just measured the performance of a synthetic, unrealistic workload that doesn’t reflect your actual application’s behavior.

The problem is that "performance" isn’t a single, abstract metric. It’s a complex interplay of factors that directly impact user experience and business outcomes. A benchmark should isolate and measure the performance of a specific, representative operation within your system, under conditions that mimic production as closely as possible.

Consider a simple API endpoint that retrieves user profile data. In production, this might involve:

Receiving an HTTP request at /api/users/{id}.
Authenticating the user (e.g., checking a JWT).
Querying a database (e.g., PostgreSQL) for user details.
Potentially fetching related data (e.g., recent posts from another service).
Serializing the response to JSON.
Returning the HTTP response.

A good benchmark wouldn’t just hit /api/users/{id} repeatedly. It would simulate the entire flow, including authentication, and use realistic data distributions.

Here’s how you might approach this using wrk, a modern HTTP benchmarking tool, with a Lua script to simulate a realistic request:

-- wrk2 with Lua script for realistic user profile fetch
-- Save this as `profile_benchmark.lua`

local json = require("json")

-- Simulate a realistic distribution of user IDs
local function get_user_id()
    -- Example: 80% of requests are for frequently accessed users (IDs 1-100)
    -- 20% for less frequent users (IDs 101-1000)
    local r = math.random()
    if r < 0.8 then
        return math.random(1, 100)
    else
        return math.random(101, 1000)
    end
end

wrk.method = "GET"
wrk.headers["Authorization"] = "Bearer fake_jwt_token_for_testing" -- Simulate auth

wrk.request = function()
    local user_id = get_user_id()
    return wrk.format("GET", "/api/users/" .. user_id)
end

Then, you’d run it against your API server:

# Assuming your API server is running on localhost:8080
# and the wrk script is saved as profile_benchmark.lua
wrk -t4 -c100 -d30s --script=profile_benchmark.lua --latency http://localhost:8080/api/users/

Here, -t4 uses 4 threads, -c100 establishes 100 concurrent connections, and -d30s runs the benchmark for 30 seconds. The --script argument is key, telling wrk to use our Lua script for request generation.

The output will show metrics like Requests/sec, Latency (average, percentiles), and Transfer/sec. Crucially, you’ll also see --latency output, giving you percentiles like 50.00%, 75.00%, 90.00%, 99.00%. This is where the real story is. A high average request rate is meaningless if the 99th percentile latency is measured in seconds, meaning some users will experience terrible performance.

The actual performance of your system is the sum of the latencies of its constituent parts. If your API calls a microservice that takes 200ms, and that microservice calls a database that takes 100ms, your API call will inherently take at least 300ms before any application logic or network overhead. Benchmarking should reflect these dependencies.

The most important aspect of a benchmark is fidelity to production. This means:

Workload: Use realistic request patterns, data distributions, and traffic volumes. Don’t benchmark /. Benchmark your most frequent, most critical, or most resource-intensive endpoints.
Environment: Run benchmarks on hardware and network configurations as close to production as possible. Even differences in CPU cache sizes or network card drivers can matter.
Data: Use realistic datasets. If your database has millions of rows, don’t benchmark against an empty table.
Dependencies: If your service depends on other services or databases, ensure those dependencies are available and performing adequately during the benchmark.

Many people miss that the state of the system matters as much as the requests. If you benchmark a database after it’s been idle for hours, you’ll get different results than benchmarking it during peak load when its caches are warm and its buffers are full. You should pre-warm caches, run a warm-up phase before collecting metrics, and ensure data is in a representative state.

The next step after establishing a solid benchmarking methodology is understanding how to interpret performance regressions and identify the specific component that has degraded.