The surprising truth about lock-free data structures is that they don’t eliminate contention; they just push it down to the atomic hardware primitives.

Let’s see what this looks like in practice. Imagine you have a shared counter that multiple threads need to increment.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use std::thread;

fn main() {
    let counter = Arc::new(AtomicUsize::new(0));
    let mut handles = vec![];

    for _ in 0..10 {
        let counter_clone = Arc::clone(&counter);
        let handle = thread::spawn(move || {
            for _ in 0..1000 {
                // This is where the magic happens (and contention)
                counter_clone.fetch_add(1, Ordering::SeqCst);
            }
        });
        handles.push(handle);
    }

    for handle in handles {
        handle.join().unwrap();
    }

    println!("Result: {}", counter.load(Ordering::SeqCst));
}

This code uses AtomicUsize to create a shared counter. Each thread tries to increment it 1000 times. Without locks, how does it avoid a race condition where two threads read the same value, increment it, and then one write overwrites the other’s update? The answer lies in fetch_add and Ordering::SeqCst.

The core problem these structures solve is avoiding the performance bottlenecks and deadlocks inherent in traditional mutex-based concurrency. When threads need to access shared mutable state, locks (like std::sync::Mutex) serialize access: only one thread can hold the lock at a time. This is safe but can be slow if threads frequently contend for the lock, leading to context switches and cache invalidation. Lock-free structures aim for better scalability by allowing multiple threads to attempt access concurrently, relying on atomic operations to ensure correctness.

The mental model for lock-free programming revolves around atomic operations. These are operations guaranteed by the hardware to be indivisible – they either complete entirely or not at all, without any intermediate state being visible to other threads. Common atomic operations include load (read), store (write), swap (read, write a new value, return old value), and read-modify-write operations like fetch_add, fetch_sub, fetch_and, fetch_or, and fetch_xor. These read-modify-write operations are crucial because they perform the entire sequence of reading the current value, performing a calculation, and writing the new value back in a single, uninterruptible step.

The Ordering parameter in atomic operations is where things get subtle. It dictates the memory visibility guarantees between threads.

  • Ordering::Relaxed: The weakest ordering. Guarantees atomicity but no specific ordering guarantees with respect to other memory operations. Useful for simple counters where the exact timing of updates doesn’t matter, only the final result.
  • Ordering::Acquire: Ensures that memory operations after this atomic operation in program order are visible to other threads that perform a Release operation on the same atomic variable. Think of it as a "consumer" barrier.
  • Ordering::Release: Ensures that memory operations before this atomic operation in program order are visible to other threads that perform an Acquire operation on the same atomic variable. Think of it as a "producer" barrier.
  • Ordering::AcqRel: Combines both Acquire and Release semantics. Used for read-modify-write operations that need to both acquire visibility of previous writes and release visibility of new writes.
  • Ordering::SeqCst (Sequentially Consistent): The strongest ordering. Guarantees that all threads see atomic operations in the same global order. This is the easiest to reason about but can be the most expensive. It imposes a global total order on all SeqCst operations across all threads.

In our counter example, fetch_add(1, Ordering::SeqCst) ensures that each increment is atomic. If two threads try to increment counter when it’s 5, one thread will perform the fetch_add and see 5 as the old value, writing 6. The other thread will then perform its fetch_add, see 6 as the old value, and write 7. The final result will be 7, not 6, which would happen with a non-atomic increment. SeqCst ensures that even with multiple threads, the sequence of operations appears consistent to all of them.

The "magic" that prevents the data race is the Compare-and-Swap (CAS) instruction that fetch_add (and other read-modify-write atomics) often compiles down to. A CAS operation takes a memory location, an expected old value, and a new value. It atomically checks if the current value at the memory location matches the expected old value. If it does, it writes the new value and returns success. If it doesn’t match, it does nothing and returns failure. fetch_add internally uses a loop: it loads the current value, calculates the new value, and then tries to CAS the memory location from the old value to the new value. If the CAS fails (because another thread modified the value in the meantime), it retries the entire load-calculate-CAS loop.

The most commonly misunderstood aspect of lock-free programming is the performance implications of Ordering::SeqCst. While it provides the strongest guarantees and is often the default choice for simplicity, SeqCst operations can be significantly more expensive than weaker orderings. This is because the CPU must ensure a total global order of all SeqCst operations, which might involve expensive memory fence instructions that stall the pipeline. In many scenarios, weaker orderings like Acquire/Release or even Relaxed can achieve the same logical outcome with better performance. For instance, if you’re just counting events and don’t care about the precise ordering of when those events happened relative to other shared state, Ordering::Relaxed might suffice for the counter itself.

The next frontier after mastering atomic operations is understanding how to build more complex lock-free data structures like queues, stacks, or lists, which often involve more intricate CAS loops and careful handling of memory reclamation.

Want structured learning?

Take the full Rust course →