Gradient accumulation lets you train models with effectively larger batch sizes than your GPU memory can hold, by accumulating gradients over several smaller forward/backward passes before performing an optimizer step.
Let’s see it in action. Imagine we have a simple model and want to train it with a batch size of 64, but our GPU can only handle a batch size of 8.
import torch
import torch.nn as nn
import torch.optim as optim
# Model and data setup
model = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
# Simulate data
batch_size_per_step = 8
effective_batch_size = 64
accumulation_steps = effective_batch_size // batch_size_per_step
# Dummy input and target
input_data = torch.randn(effective_batch_size, 10)
target_data = torch.randn(effective_batch_size, 2)
# Gradient accumulation loop
model.train()
optimizer.zero_grad() # Zero gradients once at the start
for i in range(accumulation_steps):
# Get a slice of data for the current mini-batch
start_idx = i * batch_size_per_step
end_idx = start_idx + batch_size_per_step
inputs = input_data[start_idx:end_idx]
targets = target_data[start_idx:end_idx]
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)
# Normalize loss to account for accumulation
loss = loss / accumulation_steps
# Backward pass
loss.backward() # Accumulates gradients
# If it's the last accumulation step, then update weights
if (i + 1) == accumulation_steps:
optimizer.step() # Perform optimizer step
optimizer.zero_grad() # Zero gradients for the next batch
This setup allows us to simulate training with effective_batch_size = 64 by performing 8 forward/backward passes, each with batch_size_per_step = 8, and accumulating the gradients before updating the model’s weights.
The core problem gradient accumulation solves is memory-bound training. Modern deep learning models are often massive, and even a single batch can consume gigabytes of GPU VRAM. When your desired batch size exceeds available memory, you hit an OutOfMemoryError. Gradient accumulation bypasses this by breaking down the large effective batch into smaller, manageable mini-batches that fit within memory. Each mini-batch computes gradients, but these gradients aren’t immediately used to update the model. Instead, they are summed up (accumulated) across multiple mini-batches. Only after processing a sufficient number of mini-batches to collectively represent the desired effective batch size does the optimizer take a step, using the combined gradients. This effectively mimics the behavior of a single large batch, leading to more stable training and potentially better convergence, without requiring more VRAM.
The key mechanical insight is how loss.backward() works. When you call backward() multiple times without calling optimizer.zero_grad() in between, PyTorch adds the new gradients to the existing ones stored in .grad attributes of the parameters. This is precisely the accumulation we need. The crucial step is then dividing the loss by accumulation_steps before calling backward(). This ensures that the gradients computed from each mini-batch are scaled down proportionally. When these scaled gradients are summed up over accumulation_steps, their total magnitude will be equivalent to the gradients you would have obtained from a single, full-sized batch. Without this scaling, your accumulated gradients would be accumulation_steps times larger than intended, leading to excessively large weight updates and unstable training.
The optimizer.zero_grad() call is also critical. It must be called before the first loss.backward() of a new effective batch, and then again after optimizer.step() has been performed. This ensures that gradients from previous effective batches don’t carry over and interfere with the current accumulation cycle.
The model parameters are updated only once per accumulation_steps mini-batches. This means the effective learning rate should ideally be adjusted. If you’re using a fixed learning rate lr with a batch size B and switch to gradient accumulation with an effective batch size B_eff = B * N (where N is accumulation_steps), you might consider scaling your learning rate by N or, more commonly, keeping the learning rate the same and accepting that the effective batch size is now B_eff. The typical approach is to keep the learning rate the same and treat the accumulated gradient as if it came from a single large batch. However, some research suggests that scaling the learning rate proportionally to the effective batch size can be beneficial for maintaining similar convergence dynamics.
One thing most people don’t realize is that the .grad attributes of your model’s parameters are not cleared by loss.backward(). They are additive. This is the fundamental mechanism that makes gradient accumulation possible. If backward() cleared the gradients each time, you’d have to manually sum them up. By default, PyTorch accumulates them for you, which simplifies the implementation considerably. You only need to explicitly zero_grad() when you want to start accumulating for a new effective batch.
The next concept you’ll likely encounter is learning rate scheduling with gradient accumulation, and how it interacts with the effective batch size.