You can fit dramatically more context into your LLM prompts than you probably think, and the trick isn’t just making your prompt shorter.

Let’s see it in action. Imagine you have a large document, say, the entirety of Moby Dick, and you want to ask the LLM questions about it.

import openai

# Assume this is a very long string containing the text of Moby Dick
with open("moby_dick.txt", "r") as f:
    moby_dick_text = f.read()

# This is a simple prompt that would likely exceed token limits for many models
# prompt = f"Analyze the symbolism of the whale in Moby Dick. Here is the full text:\n\n{moby_dick_text}"

# Instead, we'll use a technique to compress the context.
# For demonstration, let's use a hypothetical compression function.
# In a real scenario, this would involve summarization, chunking, or specialized encoding.

def hypothetical_compress_context(text: str, target_tokens: int) -> str:
    # This is a placeholder. Real compression involves techniques like:
    # 1. Summarization of sections.
    # 2. Extracting key entities and relationships.
    # 3. Using more efficient tokenization schemes.
    # 4. Specialized models trained for compression.

    # For this example, we'll just truncate and add a note about the content.
    # A real implementation would be much more sophisticated.
    tokens_in_text = len(text.split()) # Very rough token estimate
    if tokens_in_text > target_tokens:
        # Simulate compression by taking a large chunk and noting it's a summary
        summary_chunk_size = int(target_tokens * 0.8) # Leave room for the question
        compressed_text = text[:summary_chunk_size*5] # Rough character estimate for word count
        return f"[Compressed context representing a significant portion of Moby Dick, focusing on key plot points and character introductions. The full text is not included directly but its essence is captured.]\n\n{compressed_text}"
    else:
        return text

# Let's say our target model has a context window of 4096 tokens
MAX_CONTEXT_TOKENS = 4096

# We'll approximate token count by word count for simplicity in this example.
# A real implementation would use `tiktoken` or a model-specific tokenizer.
approx_tokens_moby_dick = len(moby_dick_text.split())
print(f"Approximate tokens in Moby Dick: {approx_tokens_moby_dick}")

# If Moby Dick is ~200,000 words, it's far too large for a direct prompt.
# Let's simulate a smaller, but still large, document for a more direct example.
example_doc = "This is the first part of a very long document. It details the initial voyage of Captain Ahab and his crew aboard the Pequod. They set sail from Nantucket with a singular, obsessive mission: to hunt and kill the white whale, Moby Dick. The narrative introduces key characters like Ishmael, the narrator, Starbuck, the pragmatic first mate, and Stubb, the jovial second mate. The early chapters establish the ominous atmosphere and the growing sense of dread associated with the hunt. The ship's journey takes them through various oceans, encountering other whaling ships and facing the perils of the sea. Ahab's monomania is evident from the outset, driving every decision and influencing the mood of the entire crew. He views Moby Dick not just as an animal, but as the embodiment of all evil and malice in the world. The descriptions of the ocean and the whaling process are detailed and immersive, painting a vivid picture of 19th-century maritime life. The first encounter with Moby Dick is brief but terrifying, leaving the Pequod damaged and Ahab's resolve even stronger. The pursuit continues relentlessly, with the whale always seeming one step ahead, a phantom of the deep. The crew, though some are fearful, is largely bound by a mixture of duty, superstition, and Ahab's charismatic, albeit dangerous, leadership. The philosophical musings on fate, free will, and the nature of existence are woven throughout the narrative, often prompted by the immense and indifferent power of the ocean and its creatures. The document continues to describe the arduous nature of whaling, the dangers involved, and the immense effort required to capture even a single whale. It also delves into the cultural and economic significance of whaling in that era. The psychological toll on the crew, especially under Ahab's increasingly erratic command, becomes a central theme. The symbolic weight of the whale grows with each mention, representing nature's power, the unknowable, and perhaps even a divine judgment. The narrative builds tension as they approach the whale's known hunting grounds, anticipating the inevitable confrontation. The sheer scale of Moby Dick, its immense power and elusiveness, makes it a formidable antagonist. The story explores themes of obsession, revenge, and the human struggle against overwhelming forces. The philosophical undertones suggest that the pursuit itself is a metaphor for man's existential quest. The detailed descriptions of the sea and its inhabitants contribute to the epic scope of the narrative. The crew's growing unease is palpable, as they realize the futility and danger of their mission. The whale becomes a mythic figure, larger than life, embodying primal forces. The narrative arc is driven by this relentless pursuit, leading towards a climactic encounter. The vastness of the ocean mirrors the immensity of Ahab's obsession. The ethical implications of hunting such a creature are implicitly questioned. The symbolic resonance of the white whale transcends the literal narrative, inviting deeper interpretation. The document concludes with the final, fateful chase. The sheer power and resilience of Moby Dick are emphasized. The human cost of such a pursuit is tragically realized. The ultimate confrontation is depicted with intense drama and visceral detail. The fate of the Pequod and its crew is sealed by the whale's might. The overwhelming force of nature is presented as an unstoppable, indifferent power. The symbolic interpretation of the whale as a force of nature or destiny is solidified. The narrative's exploration of obsession and revenge reaches its tragic conclusion. The ultimate confrontation serves as a powerful allegory for the human condition. The vastness of the ocean serves as a backdrop for this epic struggle. The philosophical questions raised throughout the text find their resolution in the tragic climax. The document is a comprehensive account of the Pequod's ill-fated voyage and its singular obsession."

approx_tokens_example_doc = len(example_doc.split())
print(f"Approximate tokens in example document: {approx_tokens_example_doc}")

# If this example document is ~500 words, it's still a good chunk.
# Let's imagine we want to ask a question that requires a good portion of this.

# A naive prompt:
# naive_prompt = f"Based on the following text, describe Captain Ahab's primary motivation and how it affects his crew:\n\n{example_doc}"
# This prompt, depending on the model's tokenizer, might exceed 4096 tokens.

# Using a conceptual compression approach:
# We can't truly compress in this live example without a real summarization model,
# but we can simulate the *effect* by structuring the prompt differently and
# relying on the LLM's ability to understand that context is provided.

# The key is to *not* dump the entire raw text if it's too large.
# Instead, you provide a *representation* or a *summary* and then the LLM
# can still reason over it, especially if you structure your prompt to guide it.

# A more robust approach for large documents involves:
# 1. Chunking: Split the document into smaller, manageable pieces.
# 2. Summarization: Generate a summary for each chunk.
# 3. Contextualization: Include relevant summaries or even full chunks based on the query.
# 4. Retrieval Augmented Generation (RAG): Use a vector database to find the most relevant chunks and feed *only those* to the LLM.

# For this specific example, let's illustrate the *idea* of providing context
# without exceeding limits, by assuming the LLM can infer from a well-structured query
# and a *representative sample* or *summary*.

# Imagine we have a function that extracts salient points relevant to a query.
def extract_salient_points(text: str, query_keywords: list) -> str:
    # This is a simplified placeholder. A real system would use NLP techniques
    # to identify sentences/paragraphs most relevant to the keywords.
    relevant_sentences = []
    for sentence in text.split('.'):
        if any(keyword.lower() in sentence.lower() for keyword in query_keywords):
            relevant_sentences.append(sentence.strip())
    return ". ".join(relevant_sentences) + "."

query_keywords = ["Ahab's motivation", "crew", "Moby Dick"]
salient_context = extract_salient_points(example_doc, query_keywords)

# Now, construct the prompt with the extracted salient points.
# This significantly reduces the token count while providing focused information.
engineered_prompt = f"""
Analyze Captain Ahab's primary motivation and its effect on his crew, based on the following context derived from a larger document about the Pequod's voyage:

Context:
{salient_context}

Please provide a detailed analysis.
"""

print("\n--- Engineered Prompt ---")
print(engineered_prompt)
print(f"\nApproximate tokens in engineered prompt (word count): {len(engineered_prompt.split())}")

# The LLM then processes this engineered prompt.
# The key insight is that the LLM doesn't always need the *entire* raw text.
# It needs the *information* relevant to the task.
# By intelligently pre-processing and selecting, you can pack more *meaningful* context.

# In a real RAG system, the process would look like:
# 1. User asks: "What is Ahab's motivation?"
# 2. System embeds the query.
# 3. System searches a vector database of document chunks for similarity.
# 4. System retrieves top N most similar chunks.
# 5. System constructs a prompt like: "Based on the following retrieved passages, answer the question: [retrieved_passage_1] [retrieved_passage_2] ... Question: What is Ahab's motivation?"

# The most surprising thing about fitting more context isn't about token limits themselves, but how the LLM's architecture, particularly attention mechanisms, can effectively prioritize and weigh information from a very long sequence, meaning you can often provide a much larger *relevant* context than brute-force estimations suggest, provided that context is structured and focused.

The next frontier is not just fitting more, but making sure the LLM can *intelligently retrieve and synthesize* information from vast, unstructured knowledge bases without explicit manual chunking or summarization.

Want structured learning?

Take the full Prompt-engineering course →