PyTorch’s DataParallel (DP) and DistributedDataParallel (DDP) both aim to speed up training by utilizing multiple GPUs, but they go about it in fundamentally different ways, and one is almost always the better choice.

Let’s see DataParallel in action first. Imagine you have a simple model and some dummy data.

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class DummyDataset(Dataset):
    def __init__(self, num_samples=1000, input_size=100):
        self.num_samples = num_samples
        self.input_size = input_size
        self.data = torch.randn(num_samples, input_size)
        self.targets = torch.randint(0, 10, (num_samples,))

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        return self.data[idx], self.targets[idx]

class SimpleModel(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.fc = nn.Linear(input_size, num_classes)

    def forward(self, x):
        return self.fc(x)

# Setup
input_size = 100
num_classes = 10
batch_size = 32
num_gpus = 2

# Create model and dataset
model = SimpleModel(input_size, num_classes)
dataset = DummyDataset(input_size=input_size)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# --- DataParallel Example ---
if torch.cuda.is_available() and num_gpus > 1:
    device = torch.device("cuda:0")
    model.to(device)
    # Wrap model with DataParallel
    model = nn.DataParallel(model)

    # Dummy training step
    for batch_idx, (data, target) in enumerate(dataloader):
        data, target = data.to(device), target.to(device)
        optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
        criterion = nn.CrossEntropyLoss()

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        print(f"DP Batch {batch_idx}, Loss: {loss.item():.4f}")
        if batch_idx == 0: break # Just show one step
else:
    print("CUDA not available or not enough GPUs for DP example.")

DataParallel is deceptively simple: you wrap your model, specify the GPUs it should use (defaults to all available), and PyTorch handles the rest. It splits the input batch across your GPUs, runs the forward pass on each, gathers the outputs on the first GPU, calculates the loss there, and then performs the backward pass, scattering gradients back to the respective GPUs for parameter updates.

The problem is that all the computation for loss calculation and gradient reduction happens on GPU 0. This creates a significant bottleneck. GPU 0 does all the work: it receives the data, performs the forward pass, aggregates outputs, computes the loss, and then aggregates gradients. The other GPUs (GPU 1, GPU 2, etc.) only perform a portion of the forward pass and backward pass. This leads to uneven GPU utilization, where GPU 0 is often overloaded while others are underutilized, and the overall training speedup is much less than ideal, often hovering around 1.5x for 2 GPUs instead of a theoretical 2x. You’ll see GPU 0’s utilization maxed out while others are lower.

DistributedDataParallel (DDP) takes a completely different, and much more efficient, approach. Instead of one process controlling multiple GPUs, DDP launches a separate Python process for each GPU. Each process has a complete copy of your model, and they communicate with each other to synchronize gradients.

Here’s what a basic DDP setup looks like (note: this requires launching multiple processes, typically via torch.distributed.launch or torchrun):

import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.distributed import DistributedSampler

# Assume DummyDataset and SimpleModel are defined as above

def setup(rank, world_size):
    # Initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train_process(rank, world_size, num_epochs=1):
    setup(rank, world_size)

    input_size = 100
    num_classes = 10
    batch_size_per_gpu = 32 # This is the effective batch size per GPU

    # Create model and move to GPU
    model = SimpleModel(input_size, num_classes).to(rank)
    # Wrap model with DDP
    model = DDP(model, device_ids=[rank])

    # Create dataset and sampler
    dataset = DummyDataset(num_samples=1000, input_size=input_size)
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank, shuffle=True)
    dataloader = DataLoader(dataset, batch_size=batch_size_per_gpu, sampler=sampler)

    optimizer = optim.SGD(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch) # Important for shuffling each epoch
        for batch_idx, (data, target) in enumerate(dataloader):
            data, target = data.to(rank), target.to(rank)

            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward() # DDP handles gradient all-reduce automatically
            optimizer.step() # DDP ensures parameters are synchronized

            if rank == 0 and batch_idx % 10 == 0: # Print only from rank 0 to avoid clutter
                print(f"Rank {rank}, Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")

    cleanup()

if __name__ == "__main__":
    world_size = 2 # Number of GPUs
    if torch.cuda.is_available() and world_size <= torch.cuda.device_count():
        mp.spawn(train_process,
                 args=(world_size,),
                 nprocs=world_size,
                 join=True)
    else:
        print("CUDA not available or not enough GPUs for DDP example.")

In DDP, each process is responsible for a slice of the data. When loss.backward() is called, PyTorch automatically performs an "all-reduce" operation. This means each GPU computes its local gradients, and then all GPUs communicate to average these gradients. The optimizer.step() then uses these averaged gradients to update the model parameters, ensuring all model replicas stay synchronized. The key is that the gradient reduction happens in parallel across all GPUs, rather than being funneled through a single GPU. This leads to much better GPU utilization and scaling, typically approaching the theoretical maximum speedup.

The most surprising thing about DataParallel is how it handles the backward pass. It doesn’t truly scatter the backward computation; instead, it gathers the gradients from all GPUs onto the primary GPU (GPU 0) and then performs the backward pass and gradient aggregation there. This is a major performance bottleneck.

The DistributedSampler is crucial for DDP. It ensures that each process receives a distinct subset of the dataset for each epoch, preventing data duplication and ensuring that each GPU is working on unique samples. Without it, all processes would iterate over the same data, and the effective batch size would be the per-GPU batch size, not the global batch size.

The torch.distributed.launch utility (or the newer torchrun) is your gateway to running DDP. You’ll typically launch your script like this: torchrun --nproc_per_node=2 your_script.py. This command starts two processes, each assigned to a GPU, and sets up the necessary environment variables for dist.init_process_group.

The most common pitfall with DDP is forgetting to call sampler.set_epoch(epoch) at the beginning of each epoch. If you don’t do this, the DistributedSampler will produce the same data split for every epoch, meaning your model sees the same shuffled order repeatedly, which is detrimental to training.

The next challenge you’ll face is efficiently handling large datasets that don’t fit into memory, which often involves custom Dataset implementations and efficient data loading strategies.

Want structured learning?

Take the full Pytorch course →