PyTorch Profiler: Find Training Bottlenecks (2026)

The PyTorch Profiler is a powerful tool that helps you pinpoint performance bottlenecks in your training code, but its true magic lies in its ability to reveal how operations are actually being scheduled and executed, not just how long they take.

Let’s see it in action. Imagine you’re training a CNN and suspect your data loading is too slow. Here’s a snippet of how you might integrate the profiler:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.profiler import profile, record_function, ProfilerActivity

# ... (your model definition, dataset, dataloader) ...

model = YourModel()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Initialize the profiler
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True, profile_memory=True) as prof:
    with record_function("model_train"): # Tag this block
        for epoch in range(num_epochs):
            for i, (inputs, labels) in enumerate(dataloader):
                # Move data to GPU if available
                if torch.cuda.is_available():
                    inputs, labels = inputs.cuda(), labels.cuda()

                # Zero gradients, forward pass, backward pass, optimize
                optimizer.zero_grad()
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()

                # Record some operations explicitly for finer control
                with record_function("data_processing"):
                    # Simulate some data preprocessing if it's part of the loop
                    processed_inputs = inputs * 1.1 # Example operation
                
                if i % 10 == 0:
                    print(f'Epoch {epoch}, Batch {i}, Loss: {loss.item()}')

# Print the profiling results
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

This code sets up the profiler to capture both CPU and CUDA (GPU) activities. The record_shapes=True flag is crucial for understanding tensor dimensions, and profile_memory=True tracks memory allocations. We’ve also used record_function to explicitly name blocks of code, making them easier to identify in the output.

The profiler works by instrumenting your PyTorch code. When you run this, PyTorch inserts hooks that capture the start and end times of various operations (like kernel launches, tensor operations, Python function calls) and records associated metadata (like tensor shapes, memory usage). It then aggregates this information to give you a comprehensive view of where time is being spent.

The core problem the profiler solves is that training deep learning models often involves a complex interplay between CPU (data loading, preprocessing, Python logic) and GPU (computation). A bottleneck can occur in either, or even in the communication between them. Simply looking at wall-clock time for your training loop won’t tell you why it’s slow. Is the GPU waiting for data? Is a specific CUDA kernel inefficient? Is Python overhead high? The profiler answers these questions.

The prof.key_averages().table() output is your primary diagnostic tool. It presents a table summarizing the performance of different operations. Key columns include:

self_cpu_time_total: Time spent in the operation on the CPU, excluding time spent in its children.
self_cuda_time_total: Time spent in the operation on the GPU, excluding time spent in its children.
cpu_time_total: Total time spent in the operation and its children on the CPU.
cuda_time_total: Total time spent in the operation and its children on the GPU.
calls: Number of times the operation was called.

By examining this table, sorted by cuda_time_total or cpu_time_total, you can quickly identify the most time-consuming operations. For example, if you see a large self_cpu_time_total for a data loading function, you know that’s where to focus your optimization efforts. If self_cuda_time_total for your model’s forward pass is high, you might investigate kernel efficiency or model architecture.

The record_shapes=True option, when combined with the profiler’s output (which you can also visualize with TensorBoard), lets you see how tensor dimensions impact performance. Larger tensors often lead to longer computation times and increased memory bandwidth usage. This can highlight issues like unnecessarily large batch sizes or inefficient tensor operations that create large intermediate tensors.

The profiler also allows you to trace specific events. For instance, you can use torch.cuda.nvtx.range_push() and torch.cuda.nvtx.range_pop() to mark custom regions in your code that will appear as colored bars in GPU-specific visualization tools like Nsight Systems. This is invaluable for correlating your Python code structure with actual GPU execution.

One aspect many users overlook is the interaction between CPU and GPU. The profiler explicitly shows cpu_time_total and cuda_time_total for operations. If an operation has a high cpu_time_total but a low cuda_time_total, and it’s followed by a GPU operation, it often indicates the CPU is a bottleneck, and the GPU is waiting. Conversely, a high cuda_time_total with low cpu_time_total for a data operation might mean your data loading is too slow, and the GPU is starved. The profiler’s detailed breakdown helps diagnose these interdependencies.

The next step after identifying bottlenecks is often to look at the detailed CUDA API calls or delve into the specific PyTorch operations contributing the most to the time.