Ollama’s context window isn’t a hard limit you can just "extend" with a single flag; it’s a fundamental architectural constraint of the model itself, determined by its training data and architecture.

Let’s see what that looks like in practice. Imagine you’re chatting with a model and want to keep a long conversation history.

ollama run llama3
>>> What's the capital of France?
Paris.
>>> And what's its population?
The population of Paris is approximately 2.1 million people as of 2023.
>>> Can you summarize our conversation so far?
You asked for the capital of France, and I told you it's Paris. Then you asked for its population, and I provided the approximate figure of 2.1 million people as of 2023.

So far, so good. The model remembers the previous turns. Now, let’s push it.

>>> Now, tell me about the historical significance of the Eiffel Tower, referencing our previous questions about Paris.

If the conversation gets long enough, the model will start to "forget" earlier parts. This isn’t a bug; it’s how transformers work. They process input tokens, and once the context window is full, older tokens are effectively discarded.

The core problem Ollama solves here is making it easy to manage the context window, not break it. It allows you to load and run models that have different inherent context window sizes. You can’t magically make a llama3:8b model, trained with a 4096 token limit, suddenly handle 32000 tokens. You need to use a model that was designed with a larger context window.

Here’s how you interact with models that have different context window sizes using Ollama:

  1. Check Model Capabilities: Before anything else, know the model’s inherent context limit. You can often find this on the model’s page on Ollama Hub or its original source (e.g., Hugging Face). For instance, llama3:8b defaults to 4096 tokens, while codellama:70b might offer larger context variants.

  2. Download Models with Larger Contexts: If you need more context, you must download a model explicitly trained for it. Ollama simplifies this.

    ollama pull codellama:70b-instruct-fp16 # This variant might have a larger context than a smaller codellama
    

    The ollama pull command fetches the model weights. The token limit is baked into these weights and the model’s configuration.

  3. Running with More Context: When you run a model, Ollama automatically respects its defined context window. There’s no special flag to "extend" it beyond what the model supports. You simply use the model:

    ollama run codellama:70b-instruct-fp16
    

    This model, if trained with, say, an 8192 token context, will now accept up to 8192 tokens in its prompt and conversation history.

  4. Managing Prompt Length: The practical limit you hit is not just the model’s theoretical maximum, but the actual number of tokens you send in a single request. This includes the system prompt, user messages, assistant responses, and any retrieved documents.

    • Tokenization is Key: Remember that tokens are not words. A common word might be one token, but punctuation, sub-word units, and even spaces can consume tokens. Use a tokenizer (like tiktoken for OpenAI models, or transformers’ tokenizer for others) to get an estimate. For Ollama, the ollama CLI itself handles this internally.
    • Prompt Engineering: If you’re consistently hitting limits, you need to be more concise. Summarize previous turns, extract only relevant information, or use techniques like RAG (Retrieval Augmented Generation) to inject only the most pertinent external data.

The most surprising thing about managing context windows is how rarely you actually extend the theoretical limit of a model. Instead, the "extension" comes from choosing a different model that was pre-trained with a larger window, or by becoming incredibly efficient with the tokens you do send. The act of running a model with Ollama is inherently tied to its baked-in context size; there’s no runtime magic to expand it.

The next challenge you’ll face is optimizing the quality of the tokens within that window, often through advanced RAG techniques or fine-tuning.

Want structured learning?

Take the full Ollama course →