Transfer learning with PyTorch is less about transferring knowledge and more about repurposing a model’s learned feature detectors.
Let’s look at a common scenario: classifying images of cats and dogs, but you only have a few hundred examples. Training a deep convolutional neural network (CNN) from scratch would be a disaster. Instead, we can leverage a model like ResNet50, pre-trained on ImageNet, a massive dataset of millions of diverse images.
Imagine ResNet50 as a highly sophisticated image understanding machine. Its early layers have learned to detect very general features like edges, corners, and color blobs. Its later layers have learned to combine these into more complex patterns, like textures, shapes, and eventually, object parts. For our cat/dog task, these general feature detectors are incredibly useful. We don’t need to re-teach the model how to see edges; we just need to teach it how to use those existing edge detectors to distinguish between cats and dogs.
Here’s how we do it in practice. We’ll start with a pre-trained ResNet50 model and modify its final classification layer.
import torch
import torchvision.models as models
import torch.nn as nn
import torch.optim as optim
# Load a pre-trained ResNet50 model
# 'pretrained=True' downloads weights trained on ImageNet
model = models.resnet50(pretrained=True)
# Freeze all the parameters in the pre-trained model
# This prevents their gradients from being updated during training
for param in model.parameters():
param.requires_grad = False
# Get the number of input features for the final classification layer
# For ResNet50, this is the output of the 'avgpool' layer before the final fc layer
num_ftrs = model.fc.in_features
# Replace the final classification layer with a new one
# Our new layer will have 'num_ftrs' input features and 2 output features (for cats and dogs)
model.fc = nn.Linear(num_ftrs, 2)
# Now, we only need to train the parameters of the new 'fc' layer.
# We can create an optimizer that only updates these parameters.
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)
# For demonstration, let's create some dummy data
# Batch size of 4, 3 color channels, 224x224 image size
dummy_input = torch.randn(4, 3, 224, 224)
dummy_labels = torch.randint(0, 2, (4,)) # 0 for cat, 1 for dog
# Define a loss function (CrossEntropyLoss is common for classification)
criterion = nn.CrossEntropyLoss()
# --- Training Step ---
model.train() # Set the model to training mode
# Forward pass
outputs = model(dummy_input)
loss = criterion(outputs, dummy_labels)
# Backward pass and optimization
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute gradients
optimizer.step() # Update weights
print(f"Loss: {loss.item()}")
In this code:
- We load
resnet50withpretrained=True. This automatically downloads the weights learned from ImageNet. - We iterate through
model.parameters()and setrequires_grad = Falsefor all of them. This is crucial. It tells PyTorch not to calculate gradients for these layers, effectively freezing them. - We inspect
model.fc.in_featuresto know how many features the previous layer outputs. - We replace
model.fcwith a newnn.Linearlayer. This new layer has the same number of input features but outputs 2 values (one for cat, one for dog). - When we define
optimizer = optim.Adam(model.fc.parameters(), lr=0.001), we are explicitly telling the optimizer to only consider the parameters of this newmodel.fclayer.
The magic happens because the frozen layers already know how to extract a rich hierarchy of visual features. We’re just learning a new "head" on top of that feature extractor to map those features to our specific classes. This drastically reduces the amount of data and training time needed.
When you replace the final layer, the number of output classes must match your specific task. If you were classifying 10 different types of animals, you’d set nn.Linear(num_ftrs, 10). The learning rate is also critical; a small learning rate like 0.001 is usually sufficient for fine-tuning the last layer because the pre-trained weights are already very good.
The one thing most people don’t realize is that the order in which you modify the model and set requires_grad matters. If you set requires_grad = False after creating the optimizer, the optimizer might still hold references to the old parameters and try to update them. Always freeze parameters before defining your optimizer, and ensure your optimizer is only passed model.fc.parameters() (or whatever your new final layer is).
After fine-tuning the final layer, you might then unfreeze some of the later layers of the pre-trained model and retrain with a very small learning rate to slightly adjust the higher-level feature detectors for your specific dataset.
The next concept to explore is how to handle datasets that are significantly different from the one the model was pre-trained on.