Quantization isn’t about making models smaller in the sense of fewer parameters; it’s about reducing the precision of the weights, which dramatically shrinks their memory footprint and speeds up inference.

Let’s see this in action. We’ll use ollama run to load and interact with models of different quantization levels.

First, download a model. We’ll use Llama 3 8B, a good middle-ground model.

ollama pull llama3:8b

Now, let’s get its size. You can find this by inspecting the Modelfile or the downloaded blob. A full precision (FP16) Llama 3 8B is roughly 15.7 GB.

# Example: Check size of downloaded model files in ~/.ollama/models/
# This will vary based on your OS and ollama version, but expect something large.
# For illustration, let's assume the FP16 version is around 15.7 GB.

Now, let’s quantize it. Ollama’s default download for llama3:8b is usually Q4_K_M. Let’s compare it to Q5_K_M and Q8_0.

# Download Q4_K_M (often the default for 'llama3:8b')
ollama pull llama3:8b:q4_k_m

# Download Q5_K_M
ollama pull llama3:8b:q5_k_m

# Download Q8_0
ollama pull llama3:8b:q8_0

We can observe the size difference directly when running.

# Check size reported by ollama (this is an approximation, actual disk usage differs)
ollama inspect llama3:8b:q4_k_m | grep size
ollama inspect llama3:8b:q5_k_m | grep size
ollama inspect llama3:8b:q8_0    | grep size

You’ll see something like this (values are approximate and may change):

size: 4.7 GB (q4_k_m)
size: 5.7 GB (q5_k_m)
size: 8.1 GB (q8_0)

The most striking difference is in memory usage and speed. Let’s time inference.

# Time inference for Q4_K_M
time ollama run llama3:8b:q4_k_m "Tell me a short story about a robot learning to love."

# Time inference for Q5_K_M
time ollama run llama3:8b:q5_k_m "Tell me a short story about a robot learning to love."

# Time inference for Q8_0
time ollama run llama3:8b:q8_0 "Tell me a short story about a robot learning to love."

You’ll notice that Q4_K_M is significantly faster and uses less RAM than Q8_0. Q5_K_M sits in the middle. This is because the model’s weights, which are the bulk of its size, are represented using fewer bits.

The core idea of quantization is to map the original floating-point weights (typically FP16 or FP32) to a lower-bit integer representation. For example, Q4 quantization maps weights to 4-bit integers, Q5 to 5-bit, and Q8 to 8-bit. This drastically reduces the memory required to store the model.

Q4_K_M and Q5_K_M use a "K-Quant" method. This isn’t a simple uniform scaling. Instead, it’s a more sophisticated approach that uses a small number of "super-blocks" of weights that are dequantized using a higher precision (e.g., FP16) and then used to scale the remaining blocks of weights. This allows for a better trade-off between compression and accuracy than simple, uniform quantization. Q4_K_M is generally considered a very good balance for most use cases, offering substantial size and speed benefits with minimal perceived quality loss. Q8_0 is a simpler, non-K-quant 8-bit quantization, which is less aggressive than K-quants but still offers a significant reduction over FP16.

The trade-off is accuracy. More aggressive quantization (lower bit count) means more information is lost, which can lead to a degradation in the model’s output quality. However, for many tasks, especially with newer quantization techniques like K-Quants, the difference can be imperceptible or only noticeable in highly nuanced tasks. The "M" in Q4_K_M and Q5_K_M refers to a medium block size for the K-quantization, which is a common and effective setting.

The most surprising thing about quantization is how effectively modern methods like K-Quants can preserve model performance. It’s not just about brute-force bit reduction; it’s about intelligently choosing which weights to represent with higher precision and how to dequantize them efficiently. This allows models that would otherwise be too large to run on consumer hardware to become accessible, often with minimal subjective degradation in quality.

The exact method of K-Quant involves defining blocks of weights. Within each block, a minimum and maximum value are stored, along with scaling factors. The actual weights are then represented as integers relative to this range. The "K" in K-Quants refers to specific optimizations and scaling factors used, and the "M" denotes a medium block size. This allows the model to reconstruct the weights with a higher effective precision than a naive 4-bit or 5-bit representation would suggest.

The next step is understanding how to fine-tune these quantized models, which presents its own set of challenges and techniques.

Want structured learning?

Take the full Ollama course →