Ollama’s context_length setting is failing because the prompt you’re sending is longer than the model’s actual maximum context window.

Here’s what’s actually breaking: The Ollama server, when processing a request, checks if your prompt’s token count, plus any conversational history, exceeds the context_length configured for the specific model you’re using. If it does, it rejects the request with a "context length exceeds maximum" error. This isn’t a network issue or a bug in Ollama itself; it’s a fundamental limit of how much information the AI model can process at once.

Common Causes and Fixes:

  1. Model’s Native Context Length is Lower Than You Think:

    • Diagnosis: Check the model’s documentation or use ollama show <model_name>. Look for context or context_length. For example, ollama show llama3 might show context: 8192.
    • Fix: Reduce your prompt and history token count to be less than the reported context value. If ollama show llama3 shows context: 8192, your prompt + history must be less than 8192 tokens.
    • Why it works: You’re respecting the model’s architectural limitation.
  2. Ollama Server’s Default context_length is Too High:

    • Diagnosis: When you first pulled a model, Ollama might have set a context_length in its configuration that’s higher than the model supports. Check your Ollama configuration file (usually ~/.ollama/config or /etc/ollama/config). Look for a models section with a context override.
    • Fix: Edit the ~/.ollama/config file. Find the entry for your model and explicitly set context to a value less than or equal to the model’s native limit. For example:
      {
        "models": [
          {
            "model": "llama3",
            "context": 4096 // Example: Model native context is 8192, this is a safe override
          }
        ]
      }
      
      Then restart the Ollama server (ollama serve or systemctl restart ollama).
    • Why it works: You’re forcing Ollama to use a context length that the model can actually handle, overriding any potentially misconfigured global or model-specific setting.
  3. Prompt Engineering with Long Text:

    • Diagnosis: You’re feeding a very long document or a large amount of text directly into the prompt without summarization or chunking.
    • Fix: Implement a "chunking" strategy. Break down your large text into smaller, manageable pieces. Process each piece separately, and if necessary, summarize the results before feeding them into the next stage. For example, if you have a 20,000-token document for an 8192-token model, you’d chunk it into roughly 3 pieces.
    • Why it works: You’re ensuring that no single input to the model exceeds its token limit, even if the total information is vast.
  4. Conversational History Accumulation:

    • Diagnosis: Your chat history has grown too long over multiple turns. Each new turn adds to the previous context.
    • Fix: Implement a "summarization" or "truncation" strategy for your conversation history. Before sending a new prompt, check the token count of the history. If it’s getting close to the limit, either summarize older messages into a single, shorter message or simply discard the oldest messages.
      # Example using tiktoken (install with pip install tiktoken)
      import tiktoken
      
      def num_tokens_from_messages(messages, model="gpt-4"):
          """Returns the number of tokens used by a list of messages."""
          try:
              encoding = tiktoken.encoding_for_model(model)
          except KeyError:
              encoding = tiktoken.get_encoding("cl100k_base")
          if model in {
              "gpt-3.5-turbo-0613",
              "gpt-3.5-turbo-16k-0613",
              "gpt-4-0314",
              "gpt-4-32k-0314",
              "gpt-4-0613",
              "gpt-4-32k-0613",
              "gpt-4-1106-preview",
              "gpt-4-0125-preview",
              "gpt-4-vision-preview",
          }:
              tokens_per_message = 3
              tokens_per_name = 1
          elif model == "gpt-3.5-turbo-0301":
              tokens_per_message = 4  # every message roughly
              tokens_per_name = -1  # if there's a name, the role is removed
          elif "gpt-3.5-turbo" in model:
              # print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
              return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613")
          elif "gpt-4" in model:
              # print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
              return num_tokens_from_messages(messages, model="gpt-4-0613")
          else:
              raise NotImplementedError(
                  f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for how messages are converted to tokens."""
              )
          num_tokens = 0
          for message in messages:
              num_tokens += tokens_per_message
              for key, value in message.items():
                  num_tokens += len(encoding.encode(value))
                  if key == "name":
                      num_tokens += tokens_per_name
          num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
          return num_tokens
      
      # Assume 'conversation_history' is a list of dicts like [{"role": "user", "content": "..."}]
      # Assume 'ollama_model_context_limit' is the max tokens for your model (e.g., 8192)
      
      # Before adding a new message:
      current_tokens = num_tokens_from_messages(conversation_history, model="llama3") # Use a placeholder if ollama model isn't in tiktoken
      if current_tokens + estimated_new_message_tokens > ollama_model_context_limit:
          # Implement summarization or truncation here
          pass
      
    • Why it works: You’re actively managing the size of the context window by pruning or condensing older parts of the conversation.
  5. Incorrect Model Specification:

    • Diagnosis: You’re requesting a model that doesn’t exist or is misspelled in your API call or ollama run command, and Ollama is defaulting to a very small, generic context model.
    • Fix: Double-check the exact model name. Run ollama list to see your available models and their tags. Ensure your command or API request uses the correct name, e.g., ollama run llama3:8b instead of ollama run llama3.
    • Why it works: You’re ensuring the correct, larger-context model is loaded and used for the request.
  6. Ollama Server Not Restarted After Config Change:

    • Diagnosis: You edited ~/.ollama/config to set a new context value, but the Ollama server process is still running with the old configuration.
    • Fix: Always restart the Ollama server after modifying its configuration file. On Linux: sudo systemctl restart ollama. On macOS: Stop and restart Ollama from the menu bar application. On Windows: Restart the Ollama service.
    • Why it works: The running server process needs to reload its configuration to apply changes.

The next error you’ll hit after resolving this is likely a model not found error if you’ve misspelled a model name, or a tokenization error if your prompt contains characters that cannot be tokenized by the specific model’s tokenizer.

Want structured learning?

Take the full Ollama course →