Ollama’s context_length setting is failing because the prompt you’re sending is longer than the model’s actual maximum context window.
Here’s what’s actually breaking: The Ollama server, when processing a request, checks if your prompt’s token count, plus any conversational history, exceeds the context_length configured for the specific model you’re using. If it does, it rejects the request with a "context length exceeds maximum" error. This isn’t a network issue or a bug in Ollama itself; it’s a fundamental limit of how much information the AI model can process at once.
Common Causes and Fixes:
-
Model’s Native Context Length is Lower Than You Think:
- Diagnosis: Check the model’s documentation or use
ollama show <model_name>. Look forcontextorcontext_length. For example,ollama show llama3might showcontext: 8192. - Fix: Reduce your prompt and history token count to be less than the reported
contextvalue. Ifollama show llama3showscontext: 8192, your prompt + history must be less than 8192 tokens. - Why it works: You’re respecting the model’s architectural limitation.
- Diagnosis: Check the model’s documentation or use
-
Ollama Server’s Default
context_lengthis Too High:- Diagnosis: When you first pulled a model, Ollama might have set a
context_lengthin its configuration that’s higher than the model supports. Check your Ollama configuration file (usually~/.ollama/configor/etc/ollama/config). Look for amodelssection with acontextoverride. - Fix: Edit the
~/.ollama/configfile. Find the entry for your model and explicitly setcontextto a value less than or equal to the model’s native limit. For example:
Then restart the Ollama server ({ "models": [ { "model": "llama3", "context": 4096 // Example: Model native context is 8192, this is a safe override } ] }ollama serveorsystemctl restart ollama). - Why it works: You’re forcing Ollama to use a context length that the model can actually handle, overriding any potentially misconfigured global or model-specific setting.
- Diagnosis: When you first pulled a model, Ollama might have set a
-
Prompt Engineering with Long Text:
- Diagnosis: You’re feeding a very long document or a large amount of text directly into the prompt without summarization or chunking.
- Fix: Implement a "chunking" strategy. Break down your large text into smaller, manageable pieces. Process each piece separately, and if necessary, summarize the results before feeding them into the next stage. For example, if you have a 20,000-token document for an 8192-token model, you’d chunk it into roughly 3 pieces.
- Why it works: You’re ensuring that no single input to the model exceeds its token limit, even if the total information is vast.
-
Conversational History Accumulation:
- Diagnosis: Your chat history has grown too long over multiple turns. Each new turn adds to the previous context.
- Fix: Implement a "summarization" or "truncation" strategy for your conversation history. Before sending a new prompt, check the token count of the history. If it’s getting close to the limit, either summarize older messages into a single, shorter message or simply discard the oldest messages.
# Example using tiktoken (install with pip install tiktoken) import tiktoken def num_tokens_from_messages(messages, model="gpt-4"): """Returns the number of tokens used by a list of messages.""" try: encoding = tiktoken.encoding_for_model(model) except KeyError: encoding = tiktoken.get_encoding("cl100k_base") if model in { "gpt-3.5-turbo-0613", "gpt-3.5-turbo-16k-0613", "gpt-4-0314", "gpt-4-32k-0314", "gpt-4-0613", "gpt-4-32k-0613", "gpt-4-1106-preview", "gpt-4-0125-preview", "gpt-4-vision-preview", }: tokens_per_message = 3 tokens_per_name = 1 elif model == "gpt-3.5-turbo-0301": tokens_per_message = 4 # every message roughly tokens_per_name = -1 # if there's a name, the role is removed elif "gpt-3.5-turbo" in model: # print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.") return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613") elif "gpt-4" in model: # print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.") return num_tokens_from_messages(messages, model="gpt-4-0613") else: raise NotImplementedError( f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for how messages are converted to tokens.""" ) num_tokens = 0 for message in messages: num_tokens += tokens_per_message for key, value in message.items(): num_tokens += len(encoding.encode(value)) if key == "name": num_tokens += tokens_per_name num_tokens += 3 # every reply is primed with <|start|>assistant<|message|> return num_tokens # Assume 'conversation_history' is a list of dicts like [{"role": "user", "content": "..."}] # Assume 'ollama_model_context_limit' is the max tokens for your model (e.g., 8192) # Before adding a new message: current_tokens = num_tokens_from_messages(conversation_history, model="llama3") # Use a placeholder if ollama model isn't in tiktoken if current_tokens + estimated_new_message_tokens > ollama_model_context_limit: # Implement summarization or truncation here pass - Why it works: You’re actively managing the size of the context window by pruning or condensing older parts of the conversation.
-
Incorrect Model Specification:
- Diagnosis: You’re requesting a model that doesn’t exist or is misspelled in your API call or
ollama runcommand, and Ollama is defaulting to a very small, generic context model. - Fix: Double-check the exact model name. Run
ollama listto see your available models and their tags. Ensure your command or API request uses the correct name, e.g.,ollama run llama3:8binstead ofollama run llama3. - Why it works: You’re ensuring the correct, larger-context model is loaded and used for the request.
- Diagnosis: You’re requesting a model that doesn’t exist or is misspelled in your API call or
-
Ollama Server Not Restarted After Config Change:
- Diagnosis: You edited
~/.ollama/configto set a newcontextvalue, but the Ollama server process is still running with the old configuration. - Fix: Always restart the Ollama server after modifying its configuration file. On Linux:
sudo systemctl restart ollama. On macOS: Stop and restart Ollama from the menu bar application. On Windows: Restart the Ollama service. - Why it works: The running server process needs to reload its configuration to apply changes.
- Diagnosis: You edited
The next error you’ll hit after resolving this is likely a model not found error if you’ve misspelled a model name, or a tokenization error if your prompt contains characters that cannot be tokenized by the specific model’s tokenizer.