Distillation is how you make a student model learn from a larger, more capable teacher model, not just from the raw data.
Let’s see how this plays out in practice. Imagine we have a powerful, proprietary model (our "teacher") that’s great at classifying customer sentiment, but it’s too large and slow for real-time mobile use. We want to create a smaller, faster "student" model that performs almost as well.
Instead of just training the student on labeled examples like "This review is positive," we train it on the outputs of the teacher model. If the teacher model, given a review, outputs a probability distribution like {"positive": 0.9, "neutral": 0.08, "negative": 0.02}, we train the student to mimic that distribution, not just the hard label "positive." This is often done by minimizing the Kullback-Leibler divergence between the teacher’s and student’s output probabilities.
Here’s a simplified Python snippet demonstrating the core idea. We’ll use dummy models and data for illustration.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
# --- Dummy Teacher Model ---
class TeacherModel(nn.Module):
def __init__(self, num_classes=3):
super().__init__()
self.linear = nn.Linear(10, num_classes)
# Simulate a pre-trained, complex model
with torch.no_grad():
self.linear.weight.fill_(0.5)
self.linear.bias.fill_(-0.1)
def forward(self, x):
logits = self.linear(x)
return torch.softmax(logits, dim=-1) # Output probabilities
# --- Dummy Student Model ---
class StudentModel(nn.Module):
def __init__(self, num_classes=3):
super().__init__()
self.linear = nn.Linear(10, num_classes) # Smaller architecture
def forward(self, x):
logits = self.linear(x)
return torch.softmax(logits, dim=-1) # Output probabilities
# --- Dummy Dataset ---
class DummyDataset(Dataset):
def __init__(self, num_samples=1000):
self.features = torch.randn(num_samples, 10)
self.labels = torch.randint(0, 3, (num_samples,)) # Dummy hard labels
def __len__(self):
return len(self.features)
def __getitem__(self, idx):
return self.features[idx], self.labels[idx]
# --- Training Setup ---
teacher = TeacherModel()
student = StudentModel()
dataset = DummyDataset()
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Loss function for distillation (KL Divergence)
# We'll use a temperature parameter to soften the teacher's probabilities
criterion_distill = nn.KLDivLoss(reduction='batchmean')
# A standard cross-entropy loss for the hard labels (optional but common)
criterion_hard = nn.CrossEntropyLoss()
optimizer = optim.Adam(student.parameters(), lr=0.001)
# --- Distillation Training Loop ---
epochs = 5
temperature = 2.0 # Higher temperature softens probabilities more
alpha = 0.7 # Weight for distillation loss vs. hard label loss
print("Starting distillation training...")
for epoch in range(epochs):
for inputs, hard_labels in dataloader:
optimizer.zero_grad()
# Get teacher's "soft" targets
with torch.no_grad(): # Ensure teacher gradients are not computed
teacher_probs_soft = teacher(inputs) / temperature
# Get student's "soft" and "hard" predictions
student_logits = student.linear(inputs) # Get logits for KLDivLoss
student_probs_soft = torch.softmax(student_logits / temperature, dim=-1)
student_probs_hard = torch.softmax(student_logits, dim=-1) # For hard label loss
# Calculate distillation loss (KL divergence between soft targets)
# Note: KLDivLoss expects log-probabilities for the input
loss_distill = criterion_distill(torch.log_softmax(student_logits / temperature, dim=-1), teacher_probs_soft)
# Calculate loss on hard labels (optional, but often helps)
loss_hard = criterion_hard(student_probs_hard, hard_labels)
# Combine losses
loss = alpha * loss_distill + (1 - alpha) * loss_hard
loss.backward()
optimizer.step()
print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')
print("Distillation training finished.")
The core problem distillation solves is transferring knowledge from a large, complex model to a smaller, more efficient one. Instead of just learning the correct classification from ground truth labels, the student model learns the nuances of the teacher’s decision-making process. The teacher’s probability distribution for a given input contains more information than a single hard label. For example, if the teacher is 90% confident it’s "positive" but also assigns 8% to "neutral," that subtle uncertainty is valuable information that helps the student generalize better.
Internally, this works by training the student model to minimize a loss function that combines two components:
- Distillation Loss: Typically the Kullback-Leibler (KL) divergence between the probability distributions predicted by the student and the teacher for the same input. This encourages the student’s output probabilities to match the teacher’s.
- Student Loss (Optional but common): A standard loss function (like cross-entropy) calculated using the student’s predictions and the true ground truth labels. This anchors the student to the original task.
The temperature parameter is crucial. When applied to the logits before the softmax function, a higher temperature "softens" the probability distribution, making it less peaked and revealing more about the relative probabilities of incorrect classes. A temperature of 1.0 is standard softmax. A temperature of 5.0 might produce probabilities like [0.3, 0.35, 0.35] for an input where the teacher with temperature 1.0 would give [0.9, 0.05, 0.05]. This "dark knowledge" is what the student learns from. The alpha parameter balances the influence of the distillation loss against the standard loss.
A key aspect that many people overlook is that distillation isn’t limited to just matching the final output probabilities. Techniques like "intermediate layer distillation" train the student to mimic the activations of intermediate layers in the teacher model. This can be even more effective at transferring complex feature representations, especially if the student architecture is significantly different from the teacher’s. It’s like the student learning not just the final answer, but also how the teacher arrived at that answer by observing its internal thought process.
Once you’ve distilled a model, you’ll next want to explore techniques for further model compression like quantization or pruning.