Ollama’s num_ctx parameter doesn’t just change how much text a model can "remember"; it fundamentally alters the transformer’s attention window, impacting both performance and memory.

Let’s see num_ctx in action. Imagine we have a large document and we want to ask questions about it using llama3:8b.

First, we need to load the model with a specific context size. We can do this directly on the command line when we start an interactive session:

ollama run llama3:8b --num-ctx 8192

Now, inside this session, we can paste a large chunk of text. For demonstration, let’s use a placeholder for a lengthy article:

>>> [Pasting a 5000-token article here...]
>>> What is the main argument presented in the article?

Ollama will process this entire 5000-token input, and the model, with its 8192 context window, can attend to all parts of it to formulate an answer. If we had tried this with the default num_ctx (often around 2048), the model would have truncated the input, and our question might not get a complete answer.

The num_ctx parameter directly controls the context_length parameter within the model’s configuration file. When you run ollama run <model> --num-ctx <value>, Ollama creates or updates a modelfile for that specific run, setting the context length. For example, if you were to inspect the configuration for llama3:8b after running it with a custom context, you’d see a line like this in its underlying configuration:

{
  "num_ctx": 8192,
  // ... other parameters
}

This num_ctx value dictates the maximum number of tokens the model can process in a single forward pass. Each token in the input prompt, along with the generated output, consumes space within this context window. A larger num_ctx means the model can consider more of the conversation history or a longer input document when generating its next token.

The primary benefit of increasing num_ctx is improved comprehension of long texts and more coherent, context-aware conversations. For tasks like summarizing lengthy documents, answering questions based on extensive reports, or maintaining multi-turn dialogues without losing track of earlier points, a larger context window is crucial.

However, increasing num_ctx comes with significant trade-offs, primarily in terms of memory usage and computational cost. The self-attention mechanism in transformers has a quadratic complexity with respect to the sequence length (the context window size). This means that doubling the num_ctx doesn’t just double the memory requirement; it quadruples it. For a model like llama3:8b with a default context of 8192, increasing it to 16384 will dramatically increase the VRAM needed to load and run the model. This can quickly lead to out-of-memory errors if your hardware cannot support the increased demand.

The num_ctx parameter effectively sets the size of the positional embeddings. These embeddings are crucial for the transformer to understand the order of tokens. When you set num_ctx, you are essentially telling the model to allocate positional embedding vectors for up to that many tokens. If the actual input exceeds this, the model has no way to represent the positions of tokens beyond num_ctx, leading to truncation or errors.

The exact memory impact of num_ctx depends not only on the value itself but also on the model’s architecture and the precision of its weights (e.g., FP16, Q8_0). For instance, a 7B parameter model might require roughly 14GB of VRAM in FP16. Doubling the context length can add several more GBs of VRAM just for the KV cache, which stores intermediate attention calculations.

When you specify --num-ctx on the command line, Ollama dynamically adjusts the model’s configuration for that session. If you want to permanently change the default context length for a model, you need to create or edit its Modelfile. For example, to set llama3:8b to use a context of 16384 by default:

  1. Create a Modelfile:
    FROM llama3:8b
    PARAMETER num_ctx 16384
    
  2. Create the new model:
    ollama create llama3:8b-long-context -f ./Modelfile
    
  3. Run the new model:
    ollama run llama3:8b-long-context
    

Many users overlook that the underlying model architecture itself might have an inherent maximum context length it was trained on or designed for. While Ollama allows you to set num_ctx higher than the model’s original training context, performance and coherence can degrade significantly beyond that point due to "context window hallucination" or a lack of effective positional encoding generalization. Always check the model card or documentation for recommended or tested num_ctx values.

The next challenge after optimizing context length is understanding how Ollama handles prompt templating and instruction following with these larger contexts.

Want structured learning?

Take the full Ollama course →