Lock contention is the bottleneck that prevents your concurrent program from scaling, because threads are spending more time waiting for locks than doing actual work.

Let’s see what this looks like in practice. Imagine a simple counter that multiple threads need to increment.

package main

import (
	"fmt"
	"sync"
	"time"
)

var counter int
var mu sync.Mutex

func main() {
	numGoroutines := 1000
	incrementsPerGoroutine := 1000

	start := time.Now()

	var wg sync.WaitGroup
	wg.Add(numGoroutines)

	for i := 0; i < numGoroutines; i++ {
		go func() {
			defer wg.Done()
			for j := 0; j < incrementsPerGoroutine; j++ {
				mu.Lock()
				counter++
				mu.Unlock()
			}
		}()
	}

	wg.Wait()

	duration := time.Since(start)
	fmt.Printf("Final counter: %d\n", counter)
	fmt.Printf("Execution time: %s\n", duration)
}

If you run this with go run main.go, you’ll see a final counter value of 1,000,000 (as expected) and an execution time. Now, let’s try to increase the number of goroutines and increments to stress the lock.

If we increase numGoroutines to 10,000 and incrementsPerGoroutine to 10,000, the execution time will jump dramatically. This isn’t because the work itself is hard, but because most goroutines are stuck waiting for mu.Lock() to return.

The problem lock contention solves is the race condition: multiple threads trying to modify shared data simultaneously, leading to unpredictable and incorrect results. A mutex (mutual exclusion lock) is the most basic tool to prevent this. Only one goroutine can hold the lock at a time, ensuring that the critical section of code (where counter++ happens) is executed atomically by one goroutine.

Internally, when a goroutine calls mu.Lock() and the lock is already held, the Go runtime puts that goroutine to sleep. It won’t wake up until the lock is Unlocked. This sleeping and waking up, along with the scheduler’s overhead of managing these blocked goroutines, is what causes the performance degradation. The more goroutines contend for the same lock, the more time is spent in this context-switching and waiting cycle.

The core levers you control are:

  1. The scope of the lock: How much code is inside the mu.Lock() and mu.Unlock() calls. Shorter critical sections are better.
  2. The number of contending goroutines: More goroutines fighting for the same lock means more contention.
  3. The frequency of lock acquisition: How often goroutines need to acquire and release the lock.

Consider this alternative: instead of each goroutine incrementing a single shared counter, each goroutine maintains its own local counter. Only at the very end, after all goroutines have finished their work, do we sum up these local counts.

package main

import (
	"fmt"
	"sync"
	"time"
)

func main() {
	numGoroutines := 10000
	incrementsPerGoroutine := 10000

	start := time.Now()

	var wg sync.WaitGroup
	wg.Add(numGoroutines)

	// Use a slice to store local counts for each goroutine
	localCounts := make([]int, numGoroutines)

	for i := 0; i < numGoroutines; i++ {
		go func(idx int) { // Pass index to the goroutine
			defer wg.Done()
			for j := 0; j < incrementsPerGoroutine; j++ {
				localCounts[idx]++ // Each goroutine increments its own local count
			}
		}(i) // Pass the loop variable i to the goroutine
	}

	wg.Wait()

	// Sum up local counts at the end
	var finalCounter int
	for _, count := range localCounts {
		finalCounter += count
	}

	duration := time.Since(start)
	fmt.Printf("Final counter: %d\n", finalCounter)
	fmt.Printf("Execution time: %s\n", duration)
}

Running this revised code with the same high numbers will show a vastly reduced execution time. Why? Because the critical section (the increment operation) is no longer protected by a mutex. Each goroutine operates on its own private data (localCounts[idx]), and there’s no shared mutable state being modified concurrently. The only shared resource is the localCounts slice itself, but each goroutine only writes to its own index, so there’s no race condition. The final aggregation is done sequentially.

The truly surprising thing about lock contention, especially in systems with many cores, is how quickly the overhead of waiting and context switching can outweigh the actual computation. A lock that is acquired and released millions of times per second, even if the critical section is just a few instructions, can serialize your entire application. The system might have plenty of CPU power, but it’s all being spent on managing blocked threads instead of doing useful work. This is why reducing the frequency and duration of lock holding is paramount, often more so than the complexity of the algorithm within the critical section.

If you find yourself still experiencing performance issues with concurrent code after eliminating obvious lock contention, it’s often because you’ve moved from lock contention to another form of synchronization overhead, such as channel blocking or excessive garbage collection pressure from allocating many small objects that need to be tracked.

Want structured learning?

Take the full Performance course →