PyTorch inference batching doesn’t just speed up your model; it fundamentally changes how your GPU processes data, turning a series of individual tasks into a single, highly efficient operation.

Let’s see this in action. Imagine you have a simple image classification model and you want to process 100 images. Without batching, your GPU would load the model, process image 1, unload, load again, process image 2, and so on. This involves a massive amount of overhead.

import torch
import torchvision.models as models
import time
from PIL import Image
import torchvision.transforms as transforms

# Load a pre-trained model
model = models.resnet18(pretrained=True)
model.eval() # Set model to evaluation mode

# Dummy image data (replace with your actual image loading)
# For demonstration, we'll create a batch of random tensors
num_images = 100
image_size = 224
dummy_images = [torch.randn(3, image_size, image_size) for _ in range(num_images)]

# --- Without Batching ---
print("Processing without batching...")
start_time = time.time()
results_no_batch = []
for img_tensor in dummy_images:
    # Simulate loading model and processing
    with torch.no_grad(): # Disable gradient calculation for inference
        output = model(img_tensor.unsqueeze(0)) # Add batch dimension
    results_no_batch.append(output)
end_time = time.time()
print(f"Time taken (no batching): {end_time - start_time:.4f} seconds")

# --- With Batching ---
print("\nProcessing with batching...")
# Stack images into a single batch tensor
batch_tensor = torch.stack(dummy_images)

start_time = time.time()
with torch.no_grad():
    outputs_batch = model(batch_tensor)
end_time = time.time()
print(f"Time taken (with batching): {end_time - start_time:.4f} seconds")

# You would typically see a significant speedup with batching for larger numbers of images.

The core problem batching solves is GPU underutilization. GPUs are massively parallel processors, designed to perform the same operation on many data points simultaneously. When you process images one by one, you’re only using a tiny fraction of the GPU’s potential. Each individual inference request involves overhead: data transfer to the GPU, kernel launches, and memory access patterns that aren’t optimized for parallel execution.

When you batch, you stack multiple input samples into a single tensor (e.g., a tensor of shape [batch_size, channels, height, width]). This single tensor is then fed to the model. The GPU can then execute the same operations (matrix multiplications, convolutions, etc.) across all samples in the batch in parallel. This drastically reduces overhead per sample and allows the GPU to operate at or near its peak computational capacity.

The key levers you control are:

  • Batch Size: This is the number of samples you group together. Too small, and you don’t get enough parallelism. Too large, and you might run out of GPU memory or hit diminishing returns if the model operations themselves become the bottleneck.
  • Model Architecture: Some models are more amenable to batching than others. Models with highly sequential operations or those that require dynamic input sizes might pose challenges.
  • Hardware: The GPU’s memory capacity and its compute units (CUDA cores) directly influence how large a batch you can effectively use.

The magic happens because the underlying linear algebra operations (like matrix multiplication) are designed to be efficient on parallel hardware. When you perform W @ X where W is a weight matrix and X is a batch of input vectors, the hardware can compute each W @ x_i (where x_i is a single input vector) in parallel. The memory access patterns also become more coalesced, meaning data is read from contiguous memory locations, which is much faster on GPUs.

Most people think of batching as simply processing more data at once. However, the true power comes from how it transforms the computation from a series of independent, high-overhead tasks into a single, massively parallel, low-overhead kernel execution. The GPU isn’t just doing the same work faster; it’s doing the work differently, in a way that leverages its architecture.

The next step after optimizing batch size for throughput is often dynamic batching, where you group incoming requests on the fly to form batches.

Want structured learning?

Take the full Pytorch course →