The default weight initialization in PyTorch, Kaiming uniform, is often too conservative for deep networks, leading to slower convergence than you might expect.

Let’s see Kaiming uniform in action. Imagine we have a simple feed-forward network with a few layers.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Define a simple network
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(128, 64)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        x = self.relu2(x)
        x = self.fc3(x)
        return x

# Instantiate the network
net = SimpleNet()

# Print the initial weights of the first layer
print("Initial weights of fc1 (Kaiming uniform):")
print(net.fc1.weight.data)

# Simulate a forward pass and then a backward pass to see gradients
# This doesn't train, just shows the mechanism
dummy_input = torch.randn(1, 784)
output = net(dummy_input)
loss = F.mse_loss(output, torch.zeros_like(output)) # Dummy loss
loss.backward()

print("\nGradients of fc1 after backward pass:")
print(net.fc1.grad.data)

This code snippet initializes a SimpleNet using PyTorch’s default nn.Linear which, by default, uses Kaiming uniform initialization. We then print the initial weights and, after a dummy forward and backward pass, their gradients. Notice how the weights are centered around zero but have a bounded range. The Kaiming initialization aims to keep the variance of activations and gradients constant across layers, especially for ReLU activations, by scaling based on the fan-in and fan-out. For ReLU, it typically uses a uniform distribution where the upper bound is sqrt(3 / fan_in) and the lower bound is -sqrt(3 / fan_in).

The problem this solves is the vanishing or exploding gradient problem, which plagues deep neural networks. If weights are too small, gradients shrink exponentially as they propagate backward (vanishing), making early layers learn very slowly or not at all. If weights are too large, gradients grow exponentially (exploding), leading to unstable training and large, erratic weight updates. Proper initialization acts as a crucial first step in controlling gradient flow.

Internally, PyTorch’s nn.Linear layer, when initialized, calls a specific initialization function from torch.nn.init. For Kaiming uniform, it calculates std = sqrt(2.0 / ((1 + a*a) * fan_in)), where a is the negative slope for ReLU (typically 0). Then, it samples from a uniform distribution U(-bound, bound) where bound = sqrt(3.0) * std. The fan_in is the number of input features to the layer.

The exact levers you control are the weight_initializer and bias_initializer arguments when creating layers, or you can call torch.nn.init functions directly after layer creation. For example, to use Xavier uniform initialization for fc1:

net.fc1 = nn.Linear(784, 128)
nn.init.xavier_uniform_(net.fc1.weight, gain=nn.init.calculate_gain('relu'))

Here, xavier_uniform_ samples from a uniform distribution between -gain * sqrt(6 / (fan_in + fan_out)) and gain * sqrt(6 / (fan_in + fan_out)). The gain is often set to match the activation function’s properties; for ReLU, nn.init.calculate_gain('relu') is sqrt(2).

The most surprising truth is that even with "good" initialization strategies like Kaiming or Xavier, the variance of the initial weights is often what matters most for ReLU-like activations. Kaiming init’s formula sqrt(2.0 / fan_in) for the standard deviation (or sqrt(3.0 / fan_in) for the uniform bound) directly targets keeping the variance of the output of a ReLU layer roughly equal to the variance of its input, assuming the input is zero-mean. This is a critical property for stable gradient propagation.

The next concept you’ll grapple with is how batch normalization can significantly reduce the sensitivity of your network to the initial choice of weight initialization.

Want structured learning?

Take the full Pytorch course →