The core issue is that your PyTorch model’s gradients are becoming astronomically large, causing numerical instability and preventing effective training. This typically happens when the gradients computed during backpropagation exceed the representable range of floating-point numbers, leading to NaN (Not a Number) or inf (infinity) values, effectively breaking the training process.

Here are the common culprits and how to fix them:

  1. Exploding Gradients Due to Deep Networks or Recurrent Connections:

    • Diagnosis: Observe NaN or inf values in your loss function or model parameters during training. This is the most direct symptom.
    • Fix: Implement gradient clipping.
      import torch.nn.utils as utils
      
      # Assuming 'model' is your PyTorch model and 'optimizer' is your optimizer
      # Clip gradients to a maximum norm of 1.0
      utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
      
      # Then perform the optimizer step
      optimizer.step()
      
    • Why it works: clip_grad_norm_ scales down all gradients whose L2 norm exceeds max_norm. This prevents any single gradient component from becoming excessively large, thereby stabilizing the update step. The max_norm value is a hyperparameter; common starting points are 1.0, 5.0, or 10.0. You’ll need to tune this based on your specific model and data.
  2. Large Learning Rate:

    • Diagnosis: Similar to exploding gradients, but often the loss will jump erratically rather than consistently going to NaN. You might also see very large parameter updates.
    • Fix: Reduce the learning rate. For example, if you’re using lr=0.001, try lr=0.0001.
      # Example with Adam optimizer
      optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
      
    • Why it works: A high learning rate can cause the optimizer to overshoot the minimum of the loss function, especially in regions with steep gradients. Reducing it makes the steps smaller and less likely to diverge.
  3. Poor Initialization of Weights:

    • Diagnosis: The model might start diverging very early in training, even with a small learning rate and gradient clipping. The initial gradients might be disproportionately large.
    • Fix: Use a better weight initialization scheme. For most modern networks, Kaiming (He) initialization for ReLU activations or Xavier (Glorot) initialization for Tanh/Sigmoid are good choices.
      # Example for Kaiming initialization (common for ReLU)
      import torch.nn as nn
      for module in model.modules():
          if isinstance(module, nn.Linear):
              nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
          elif isinstance(module, nn.Conv2d):
              nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
      
    • Why it works: Proper initialization ensures that the variance of activations and gradients is maintained across layers, preventing them from vanishing or exploding in the initial stages of training.
  4. Unstable Activation Functions (e.g., Sigmoid/Tanh in deep networks):

    • Diagnosis: The gradients might become very small (vanishing) or very large (exploding) as they propagate through many layers with these activations, particularly if the inputs push them into saturated regions.
    • Fix: Switch to more stable activation functions like ReLU or its variants (Leaky ReLU, GELU).
      # Replace an existing Sigmoid layer:
      # model.fc.add_module('activation', nn.Sigmoid())
      # With:
      model.fc.add_module('activation', nn.ReLU())
      
    • Why it works: ReLU has a derivative of 1 for positive inputs, which helps prevent gradients from shrinking or growing exponentially. It also avoids the saturation problem of sigmoid/tanh.
  5. High-Magnitude Input Data:

    • Diagnosis: If your input features have very large values, they can disproportionately influence the gradients, leading to explosions.
    • Fix: Normalize or standardize your input data.
      # Example for standardization (mean=0, std=1)
      from sklearn.preprocessing import StandardScaler
      scaler = StandardScaler()
      X_train_scaled = scaler.fit_transform(X_train)
      X_test_scaled = scaler.transform(X_test)
      
    • Why it works: Scaling down input features reduces their impact on the initial forward pass and subsequent gradient calculations, making the training process more stable.
  6. Issues with Specific Layers (e.g., large kernel sizes in CNNs, large embedding dimensions):

    • Diagnosis: If the problem persists after addressing the above, inspect layers with potentially large weight matrices or operations that can amplify values. For example, a large matrix multiplication or an operation that squares values.
    • Fix: Apply gradient clipping specifically to the parameters of problematic layers if you can isolate them, or consider architectural changes. Sometimes, reducing the size of a specific layer (e.g., fewer filters in a CNN, smaller embedding dimension) can help.
      # Example: Clipping gradients only for a specific layer's parameters
      # Assuming 'my_specific_layer' is the layer in question
      utils.clip_grad_norm_(my_specific_layer.parameters(), max_norm=0.5)
      
    • Why it works: This is a targeted application of gradient clipping, focusing on the most likely sources of large gradient magnitudes if the global clipping isn’t sufficient or if you want finer control.

After implementing gradient clipping and ensuring your learning rate and initialization are reasonable, you might encounter the dreaded "optimizer is not defined" error if you forget to call optimizer.step() after optimizer.zero_grad() and before clip_grad_norm_.

Want structured learning?

Take the full Pytorch course →