Benchmark Rust Code with Criterion: Accurate Measurements (2026)

Criterion is a benchmarking framework for Rust that gives you statistically sound measurements of your code’s performance. The most surprising thing about Criterion is that it actively fights against the natural tendency of benchmarks to be misleading, using statistical methods that would make a scientist proud.

Let’s see it in action. Imagine we have a simple function to calculate the sum of a range of numbers:

fn sum_range(n: u64) -> u64 {
    (1..=n).sum()
}

fn sum_range_optimized(n: u64) -> u64 {
    n * (n + 1) / 2
}

To benchmark these, we’ll create a benches directory at the root of our crate and put a file, say sum_benchmark.rs, inside it. We’ll need to add criterion to our dev-dependencies in Cargo.toml:

[dev-dependencies]
criterion = "0.5"

[[bench]]
name = "sum_benchmark"
harness = false

Now, for sum_benchmark.rs:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn sum_range(n: u64) -> u64 {
    (1..=n).sum()
}

fn sum_range_optimized(n: u64) -> u64 {
    n * (n + 1) / 2
}

fn benchmark_sums(c: &mut Criterion) {
    let mut group = c.benchmark_group("Summation");

    let n_small = 100;
    group.bench_function(
        format!("sum_range small ({})", n_small),
        |b| b.iter(|| sum_range(black_box(n_small))),
    );
    group.bench_function(
        format!("sum_range_optimized small ({})", n_small),
        |b| b.iter(|| sum_range_optimized(black_box(n_small))),
    );

    let n_large = 1_000_000;
    group.bench_with_input(
        format!("sum_range large ({})", n_large),
        &n_large,
        |b, &n| b.iter(|| sum_range(black_box(n))),
    );
    group.bench_with_input(
        format!("sum_range_optimized large ({})", n_large),
        &n_large,
        |b, &n| b.iter(|| sum_range_optimized(black_box(n))),
    );

    group.finish();
}

criterion_group!(benches, benchmark_sums);
criterion_main!(benches);

To run this, you’d execute cargo bench. Criterion will compile your benchmark code and run it multiple times, collecting data. You’ll see output like this:

     Running benches/sum_benchmark.rs (target/release/deps/sum_benchmark-...)
Benchmarking Summation/sum_range small (100)
Benchmarking Summation/sum_range_optimized small (100)
Benchmarking Summation/sum_range large (1000000)
Benchmarking Summation/sum_range_optimized large (1000000)
Summation/sum_range small (100)
                        time:   [1.5889 ns 1.5939 ns 1.5991 ns]
                        change: [-1.1737% -0.8536% -0.5283%] (p = 0.00 < 0.05)
                        strs:   "1.59 ns" "1.59 ns" "1.60 ns"
Summation/sum_range_optimized small (100)
                        time:   [1.5865 ns 1.5907 ns 1.5953 ns]
                        change: [-0.6345% -0.3159% +0.0050%] (p = 0.00 < 0.05)
                        strs:   "1.59 ns" "1.59 ns" "1.60 ns"
Summation/sum_range large (1000000)
                        time:   [11.667 ns 11.716 ns 11.770 ns]
                        change: [-0.6143% -0.2445% +0.1218%] (p = 0.00 < 0.05)
                        strs:   "11.67 ns" "11.72 ns" "11.77 ns"
Summation/sum_range_optimized large (1000000)
                        time:   [1.5909 ns 1.5960 ns 1.6014 ns]
                        change: [-0.5373% -0.1712% +0.1943%] (p = 0.00 < 0.05)
                        strs:   "1.59 ns" "1.60 ns" "1.60 ns"

Here, black_box is crucial. It prevents the compiler from optimizing away the code being benchmarked, ensuring that the work you intend to measure is actually performed. Criterion itself handles the statistical analysis: it runs your code many times, measures the execution time, and uses techniques like regression analysis to estimate the true performance, accounting for noise from the operating system, CPU frequency scaling, and other factors. It reports confidence intervals and p-values, giving you a clear picture of whether observed differences are statistically significant.

The benchmark_group helps organize related benchmarks. bench_function is for cases where the input is fixed or implicitly handled, while bench_with_input is for scenarios where you want to explicitly vary inputs, as we did with n_small and n_large. Criterion automatically determines how many times to run each function to gather sufficient data.

One of the most powerful features, often overlooked, is Criterion’s ability to detect and report performance regressions automatically. When you run cargo bench, it compares the results against a baseline stored in results/ (created on the first run or after cargo bench --save-baseline). If performance degrades significantly, Criterion will fail the build, preventing you from accidentally merging slow code.

The output shows you the estimated time per iteration, along with a confidence interval and a change percentage compared to a baseline. A p-value less than 0.05 typically indicates a statistically significant difference.

Beyond basic timing, Criterion offers parameter tuning, customizes measurement configurations, and can even generate detailed HTML reports with interactive graphs for deeper analysis.

Understanding how Criterion’s statistical analysis works, particularly its use of regressions and confidence intervals, is key to interpreting benchmark results correctly and avoiding common pitfalls like microbenchmarking on noisy systems.

The next step is often to explore how to benchmark specific allocations or memory usage with Criterion.