The PyTorch CUDA runtime is failing to release memory back to the host system, leading to gradual or sudden out-of-memory errors because the GPU is holding onto tensors and intermediate computation results that are no longer needed.

1. Unreleased Tensors in Python Scope:

  • Diagnosis: Run torch.cuda.memory_allocated() and torch.cuda.max_memory_allocated() before and after suspected operations. If memory_allocated() doesn’t decrease after operations complete, you likely have dangling tensors. Use gc.collect() and torch.cuda.empty_cache() to see if memory returns; if it does, the issue is Python’s garbage collection not seeing the tensors as unreferenced.
  • Cause: Python’s garbage collector might not be reclaiming tensors if they are still referenced by variables that are technically in scope but not actively used, or if they are part of complex Python objects that are slow to be collected.
  • Fix: Explicitly set variables holding tensors to None and call gc.collect():
    del my_tensor
    my_tensor = None
    gc.collect()
    torch.cuda.empty_cache() # This frees memory that PyTorch *can* reclaim
    
    This tells Python explicitly that the reference is gone, allowing the garbage collector to eventually free the memory.
  • Why it works: del and setting to None decrease the reference count of the Python object holding the tensor. When the reference count hits zero and the object is no longer reachable, Python’s garbage collector reclaims it, allowing PyTorch to free the associated CUDA memory. torch.cuda.empty_cache() then tells the CUDA driver to release any memory that PyTorch has marked as free but the driver still holds.

2. Inadvertent retain_graph=True in backward():

  • Diagnosis: If you’re calling loss.backward(retain_graph=True) unnecessarily, especially within a loop or multiple times for the same computation graph, you’re keeping the graph alive. Check the memory usage with nvidia-smi or torch.cuda.memory_allocated(). If it grows with each backward pass without a corresponding forward pass to reset it, this is a prime suspect.
  • Cause: retain_graph=True prevents the computation graph from being freed after backward(). This is useful for certain advanced use cases like training with reinforcement learning or meta-learning, but it causes memory to accumulate if used indiscriminately.
  • Fix: Remove retain_graph=True unless you have a specific, documented reason for it. If you need to call backward() multiple times on the same graph, ensure you’re doing it within a single loss.backward() call or are managing the graph lifecycle carefully.
    # Incorrect:
    # loss.backward(retain_graph=True)
    # loss.backward(retain_graph=True) # Leaks memory
    
    # Correct for typical training:
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    
  • Why it works: By default, backward() frees the computation graph immediately after computing gradients. Removing retain_graph=True allows this natural cleanup, releasing the memory associated with the graph’s intermediate activations.

3. Accumulating Gradients (No optimizer.zero_grad()):

  • Diagnosis: Memory usage steadily increases with each training step, and gradients are much larger than expected. torch.cuda.memory_allocated() grows over epochs.
  • Cause: The optimizer.zero_grad() call is missing before each loss.backward() call. PyTorch accumulates gradients by default.
  • Fix: Ensure optimizer.zero_grad() is called at the beginning of each training iteration, before loss.backward():
    for epoch in range(num_epochs):
        for i, data in enumerate(trainloader):
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)
    
            optimizer.zero_grad() # <--- This is crucial
    
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
    
  • Why it works: optimizer.zero_grad() resets the .grad attribute for all model parameters. Without it, gradients from previous iterations are added to the new ones, consuming more memory and leading to incorrect training updates.

4. Large Batch Sizes or Model Parameters:

  • Diagnosis: Consistent high memory usage, even with correct zero_grad() and no retain_graph=True. torch.cuda.memory_allocated() is high from the start.
  • Cause: The sheer size of the model’s parameters and the activations generated by a large batch size exceed available GPU memory.
  • Fix: Reduce batch_size or use gradient accumulation.
    • Reduce Batch Size:
      # Example: Change from batch_size=64 to batch_size=32
      train_dataset = ...
      trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
      
    • Gradient Accumulation: Process smaller batches and accumulate gradients over several steps before calling optimizer.step().
      accumulation_steps = 4
      for i, data in enumerate(trainloader):
          # ... forward and backward passes ...
          loss.backward()
          if (i + 1) % accumulation_steps == 0:
              optimizer.step() # Update weights
              optimizer.zero_grad() # Reset gradients
      
  • Why it works: Smaller batches produce smaller activation tensors during the forward pass. Gradient accumulation allows you to effectively use a larger batch size for gradient calculation without the memory overhead of a single large batch, by averaging gradients over several smaller batches before an update.

5. Unnecessary Tensor Copies or Intermediate Variables:

  • Diagnosis: Memory grows unexpectedly during complex operations or within custom layers. torch.cuda.memory_summary() can sometimes reveal large, persistent allocations.
  • Cause: Creating multiple copies of tensors, or not reusing memory for intermediate results when possible. For example, x = x + y might create a new tensor x instead of modifying x in-place if y requires a different memory layout.
  • Fix: Use in-place operations where appropriate (e.g., x.add_(y)) and be mindful of variable assignments. If you need to preserve the original tensor, make an explicit .clone() before modification.
    # Potentially leaky:
    # temp_tensor = model(input_tensor)
    # result = temp_tensor * 2 # Might create a new tensor
    
    # Better:
    # temp_tensor = model(input_tensor)
    # temp_tensor.mul_(2) # In-place operation, reuses memory
    # result = temp_tensor # Now result points to the modified tensor
    
  • Why it works: In-place operations modify the tensor directly, often reusing its existing memory buffer. Explicitly cloning ensures that a new, independent copy is made, preventing unintended modifications to the original.

6. Data Loading Issues (e.g., pin_memory=True with many workers):

  • Diagnosis: High CPU RAM usage, and GPU memory might not be fully utilized but still shows leaks. nvidia-smi might show less GPU memory used than torch.cuda.memory_allocated().
  • Cause: When pin_memory=True is used in DataLoader, tensors are allocated in pinned (page-locked) host memory, which speeds up CPU-to-GPU transfers. If you have many workers (num_workers > 0) and the data loader is holding onto these pinned tensors longer than necessary, it can consume significant host RAM and indirectly impact GPU memory management.
  • Fix: Reduce num_workers in your DataLoader or set pin_memory=False if it’s not critical for performance. Ensure tensors are moved to the GPU and then released promptly.
    trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=2, pin_memory=False)
    
  • Why it works: Reducing num_workers decreases the number of parallel data loading processes, and pin_memory=False avoids allocating page-locked host memory, both of which reduce the chances of pinned memory becoming a bottleneck or a source of leaks.

If you fix all of these, the next error you’ll likely hit is a CUDA kernel launch failure because your model is simply too large for the GPU’s VRAM to even load.

Want structured learning?

Take the full Pytorch course →