PyTorch DeepSpeed: ZeRO Optimization for Large Models (2026)

DeepSpeed’s ZeRO is a memory optimization technique that partitions your model’s state across multiple GPUs, allowing you to train models that wouldn’t otherwise fit into single GPU memory.

Let’s see it in action. Imagine you have a huge transformer model, say, 10 billion parameters. Training this on a single GPU is impossible. Here’s how ZeRO changes that.

First, you need to install DeepSpeed:

pip install deepspeed

Then, you’ll need a DeepSpeed configuration file, typically in JSON format. This is where you tell DeepSpeed how to partition your model.

{
  "train_batch_size": 32,
  "train_micro_batch_size_per_gpu": 8,
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "nvme_path": "/local_nvme/deepspeed"
    }
  },
  "fp16": {
    "enabled": true
  },
  "gradient_accumulation_steps": 4
}

In this config:

"train_batch_size" is the global batch size.
"train_micro_batch_size_per_gpu" is the batch size processed by each GPU in one forward/backward pass.
"zero_optimization" is the core.
- "stage": 2 means we’re partitioning optimizer states and gradients. Stage 3 partitions parameters too.
- "offload_optimizer" moves optimizer states to CPU RAM or NVMe storage, freeing up GPU VRAM.
"fp16": {"enabled": true} enables mixed-precision training, which further reduces memory.
"gradient_accumulation_steps" allows you to effectively increase the global batch size by accumulating gradients over multiple micro-batches.

Your PyTorch training script will need a few modifications. You’ll typically initialize DeepSpeed and then wrap your model and optimizer.

import torch
import deepspeed

# Assume model and optimizer are defined as usual
model = YourLargeModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

# Load DeepSpeed config
config_params = {
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 8,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "nvme_path": "/local_nvme/deepspeed"
        }
    },
    "fp16": {
        "enabled": True
    },
    "gradient_accumulation_steps": 4
}

# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config_params=config_params
)

# Training loop
for data, labels in dataloader:
    outputs = model_engine(data)
    loss = criterion(outputs, labels)
    model_engine.backward(loss)
    model_engine.step()

The deepspeed.initialize function returns a model_engine object that replaces your original model and optimizer. You use model_engine.backward() and model_engine.step() instead of loss.backward() and optimizer.step().

ZeRO works by partitioning the optimizer states (like momentum and variance in Adam) and gradients across the available GPUs. In ZeRO-2, each GPU only stores a fraction of the optimizer states and gradients corresponding to its assigned parameters. During the backward pass, gradients are reduced and then partitioned. During the optimizer step, each GPU only updates its own partition of the parameters. This dramatically reduces the memory footprint per GPU.

The most surprising thing about ZeRO-2 is how it manages to keep the optimizer state and gradients partitioned without requiring a full parameter synchronization at every step. While the optimizer state is partitioned, the gradients are gathered and reduced across all ranks before being partitioned back based on which rank owns which parameter partition. This means that while each GPU doesn’t hold the entire optimizer state, it receives the full gradient for the parameters it’s responsible for updating.

When you use zero_optimization.offload_optimizer, the entire optimizer state is moved off the GPU to CPU RAM or NVMe. This is particularly beneficial when your optimizer states (like in Adam) are much larger than your model parameters themselves. The CPU or NVMe device then performs the update for the parameters that belong to that partition.

The next hurdle you’ll likely encounter is understanding the nuances of ZeRO-3, which partitions parameters as well, and how to effectively tune the train_micro_batch_size_per_gpu and gradient_accumulation_steps for optimal throughput.