Quantization isn’t about making your model smaller, it’s about making it faster by leveraging specialized hardware instructions that only work on lower-precision integers.

Let’s see this in action. Imagine a simple linear layer in PyTorch.

import torch
import torch.nn as nn

# A simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 10)

    def forward(self, x):
        return self.fc(x)

model = SimpleModel()

# Create some random input data
x = torch.randn(1, 10)

# Run the model normally (FP32)
output_fp32 = model(x)

print("FP32 output:", output_fp32)

Now, let’s quantize this model to INT8. This involves a few steps, primarily using torch.quantization.

from torch.quantization import QuantStub, DeQuantStub, fuse_modules, prepare, convert

class QuantizedSimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        # QuantStub marks the input to the quantized section
        self.quant = QuantStub()
        self.fc = nn.Linear(10, 10)
        # DeQuantStub marks the output from the quantized section
        self.dequant = DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.fc(x)
        x = self.dequant(x)
        return x

quantized_model = QuantizedSimpleModel()

# We need to fuse modules for quantization to work effectively
# For Linear layers, we often fuse them with ReLU if present.
# Here, we don't have a ReLU, so we'll just prepare it.
# For this example, let's assume we are using dynamic quantization for simplicity
# which doesn't require a calibration step.

# For static quantization, you'd typically do:
#
# 1. Fuse modules:
#    fuse_modules(quantized_model, [['fc']], inplace=True) # Example if fc was followed by ReLU
#
# 2. Prepare for calibration:
#    quantized_model.qconfig = torch.quantization.get_default_qconfig('fbgemm') # or 'qnnpack' for ARM
#    prepare(quantized_model, inplace=True)
#
# 3. Calibrate the model (run with representative data):
#    # Example: for _ in range(num_calibration_batches):
#    #     calibration_data = get_calibration_data()
#    #     quantized_model(calibration_data)
#
# 4. Convert to quantized model:
#    convert(quantized_model, inplace=True)
#
# For dynamic quantization (simpler, good for RNNs/LSTMs/Transformers):
# We'll use a simplified approach here to demonstrate.
# PyTorch's `quantize_dynamic` is a high-level API for this.

# Let's reset and use quantize_dynamic for a clearer INT8 example
model_to_quantize = SimpleModel() # Start with the original FP32 model

# Quantize only the Linear layer dynamically
quantized_model_dynamic = torch.quantization.quantize_dynamic(
    model_to_quantize,  # The model to quantize
    {nn.Linear},        # A set of layers to dynamically quantize
    dtype=torch.qint8   # The target dtype
)

# Input data
x = torch.randn(1, 10)

# Run the dynamically quantized model
output_int8 = quantized_model_dynamic(x)

print("INT8 output (dynamic):", output_int8)

The core problem quantization solves is that most CPUs and GPUs are significantly faster at performing arithmetic operations on 8-bit integers (INT8) than on 32-bit floating-point numbers (FP32). This is because:

  1. Reduced Memory Bandwidth: INT8 values occupy 1/4 the memory of FP32 values. This means you can move more data to and from memory per second, which is often a bottleneck.
  2. Specialized Instructions: Modern hardware (CPUs with AVX-512 VNNI, GPUs with Tensor Cores) have dedicated instructions that can perform INT8 matrix multiplications or convolutions much faster than their FP32 counterparts.
  3. Lower Power Consumption: Less data movement and simpler operations generally lead to lower power draw, crucial for mobile and edge devices.

The mental model for quantization involves mapping the continuous range of FP32 values in your model’s weights and activations to a discrete set of INT8 values. This mapping uses a scale factor and a zero-point:

FP32_value = scale * (INT8_value - zero_point)

The scale determines the range of FP32 values that can be represented by the INT8 range (-128 to 127 or 0 to 255), and the zero_point is the INT8 value that corresponds to the FP32 value 0.0.

Dynamic Quantization: This is the simplest to implement. It quantizes weights offline but quantizes activations on-the-fly during inference. When a layer (like nn.Linear) is encountered, its weights are already INT8. The input activation (FP32) is converted to INT8, the operation is performed using INT8 arithmetic, and the output is converted back to FP32. This is great for models with varying activation ranges (like LSTMs or Transformers) but can have a small overhead due to the on-the-fly activation quantization.

Static Quantization: This is generally faster for CNNs. It requires a calibration step.

  1. Fuse Modules: Combine layers that are often used together (e.g., Conv + BatchNorm + ReLU) into a single, fused module. This reduces overhead and allows for more accurate quantization.
  2. Prepare: Insert observers into the model that record the range (min/max) of activations for each layer when the model is run on a small, representative dataset (calibration data).
  3. Convert: Use the observed activation ranges and the pre-quantized weights to determine the optimal scale and zero_point for both weights and activations. The model’s operations are then transformed to use INT8 arithmetic throughout.

The most surprising thing about INT8 quantization is how much performance you can gain without significant accuracy loss, especially when using static quantization with a good calibration dataset. The "magic" happens when the model’s operations are entirely performed in INT8, leveraging hardware acceleration. For example, a single mm256_dpbusd_epi32 instruction on an Intel CPU can perform 16 INT8 multiplications and 16 additions in a single clock cycle, a task that would take many more cycles with FP32.

The key to successful static quantization is a representative calibration dataset. If your calibration data doesn’t cover the full range of activations your model sees during real-world inference, the calculated scales and zero-points will be suboptimal, leading to accuracy degradation.

The next challenge you’ll likely face is optimizing the calibration process for static quantization.

Want structured learning?

Take the full Pytorch course →