PyTorch’s torch.optim.lr_scheduler module is a powerful tool for dynamically adjusting learning rates during training, but its true power often lies in combining strategies, not just using them in isolation.
Let’s see this in action. Imagine training a large image classification model. We want to start with a small learning rate to avoid large updates that could destabilize training early on, then gradually increase it (warmup), and finally, as we approach convergence, decay it following a cosine curve.
Here’s how you’d set that up in PyTorch:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import _LRScheduler
# Dummy model and optimizer for demonstration
model = nn.Linear(10, 2)
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Define warmup and cosine annealing parameters
num_epochs = 100
warmup_epochs = 10
max_lr = 0.01 # Initial LR from optimizer
min_lr = 0.0001
# Custom LR Scheduler combining Warmup and Cosine Annealing
class GradualWarmupScheduler(_LRScheduler):
def __init__(self, optimizer, multiplier, total_epoch, after_scheduler=None):
self.optimizer = optimizer
self.multiplier = multiplier
self.total_epoch = total_epoch
self.after_scheduler = after_scheduler
self.step_count = 0
super(GradualWarmupScheduler, self).__init__(optimizer, -1, -1) # Dummy values
def get_lr(self):
if self.step_count >= self.total_epoch:
return [base_lr for base_lr in self.base_lrs]
else:
return [base_lr * (self.multiplier ** (self.step_count / self.total_epoch)) for base_lr in self.base_lrs]
def step(self, epoch=None):
self.step_count += 1
if self.step_count >= self.total_epoch:
if self.after_scheduler:
self.after_scheduler.step(epoch)
else:
super(GradualWarmupScheduler, self).step(epoch)
# Cosine Annealing Scheduler
cosine_scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=num_epochs - warmup_epochs, eta_min=min_lr)
# Gradual Warmup Scheduler
warmup_scheduler = GradualWarmupScheduler(optimizer, multiplier=max_lr/optimizer.defaults['lr'], total_epoch=warmup_epochs, after_scheduler=cosine_scheduler)
# Training loop (simplified)
for epoch in range(num_epochs):
# Simulate training steps
# ... your training code ...
# Adjust learning rate
warmup_scheduler.step()
# Print current LR for demonstration
current_lr = optimizer.param_groups[0]['lr']
print(f"Epoch {epoch+1}/{num_epochs}, LR: {current_lr:.6f}")
This setup first uses GradualWarmupScheduler to linearly increase the learning rate from its initial value (e.g., 0.01) to a target maximum (e.g., 0.01 in this specific example, but could be higher) over warmup_epochs. Once the warmup is complete, it hands over control to CosineAnnealingLR, which then decays the learning rate following a cosine curve down to min_lr over the remaining epochs.
The problem this solves is that starting with a high learning rate can cause the model’s weights to fluctuate wildly, potentially leading to divergence or suboptimal convergence. A warmup phase allows the model to stabilize its initial learning, making it more robust to subsequent, larger updates. Cosine annealing, on the other hand, is a popular choice for the decay phase because it provides a smooth, gradual decrease in learning rate, allowing the model to settle into finer minima in the loss landscape without the abrupt changes that can occur with linear or step decay.
Internally, the _LRScheduler base class in PyTorch handles the core logic of updating optimizer.param_groups. Each scheduler needs to implement a get_lr() method that returns a list of learning rates for each parameter group. The step() method is then called after each epoch (or batch, depending on your setup) to trigger the learning rate update based on the scheduler’s internal state and the current epoch/step count. Our custom GradualWarmupScheduler extends this, managing its own step_count and conditionally calling the after_scheduler once the warmup period is over.
The multiplier in GradualWarmupScheduler is crucial. It’s not just a fixed value; it’s calculated as the ratio of the desired max_lr to the optimizer.defaults['lr'] (the initial learning rate set when the optimizer was created). This ensures that the warmup scales the initial learning rate correctly. The total_epoch defines the duration of the warmup. T_max in CosineAnnealingLR is the number of epochs over which the cosine decay will occur, and eta_min is the minimum learning rate it will reach.
A common point of confusion is how after_scheduler interacts with the base learning rate. When after_scheduler takes over, it uses its own T_max and eta_min relative to the learning rate it receives from the warmup scheduler at the end of the warmup phase. In our example, the warmup ends at max_lr, and CosineAnnealingLR then decays from that max_lr down to min_lr.
The learning rate at any given step s during the warmup phase (where s is self.step_count) is calculated as base_lr * (self.multiplier ** (s / self.total_epoch)). This is an exponential ramp-up. If you wanted a linear warmup, you would calculate it as base_lr + (max_lr - base_lr) * (s / self.total_epoch). The exponential version often provides a smoother and more stable initial increase.
The final thing most people miss is that the step() method of CosineAnnealingLR also takes an optional epoch argument. If you’re stepping your scheduler per batch instead of per epoch, you’d need to pass the global step count to warmup_scheduler.step(global_step). Our example uses per-epoch stepping for simplicity, where epoch is implicitly handled by warmup_scheduler.step().
After successfully implementing this combined scheduler, the next logical step is to explore learning rate finders to automatically determine optimal max_lr and min_lr values.