DSPy Prompt Optimization: Automate Prompt Improvement (2026)

DSPy’s magic is that it treats prompts not as static strings, but as compiled programs that can be automatically optimized.

Imagine you have a simple prompt to extract the name and email from a block of text.

import dspy

# Set up the language model
llm = dspy.OpenAI(model='gpt-3.5-turbo')
dspy.settings.configure(lm=llm)

# Define a signature for our desired output
class ExtractInfo(dspy.Signature):
    """Extract the name and email address from the given text."""
    text: str
    name: str
    email: str

# Create a program
extract_program = dspy.Program(ExtractInfo)

# Example usage
text_to_process = "Please contact John Doe at john.doe@example.com for more information."
prediction = extract_program(text=text_to_process)

print(f"Name: {prediction.name}")
print(f"Email: {prediction.email}")

This looks like a regular function call, but extract_program is more than just a template. DSPy can now take this extract_program and automatically tune its underlying prompts to achieve better results.

The core problem DSPy solves is the manual, iterative, and often guesswork-driven process of prompt engineering. Traditionally, you’d write a prompt, test it, see where it fails, tweak the prompt, and repeat – a slow, inefficient loop. DSPy automates this tuning.

Internally, DSPy breaks down your task into a sequence of "modules." Each module might be a simple prompt, a chain of prompts, or even a more complex reasoning process. When you call extract_program, DSPy isn’t just sending a single prompt to the LLM. It’s executing its internal program, which might involve a few steps. For our ExtractInfo example, it’s likely a single prompt-based module.

Here’s how the optimization works: DSPy uses a "compiler" which, given your program and some examples (which can be automatically generated or provided by you), will explore different prompt variations. It does this by defining a "metric" for success (e.g., accuracy of extracted name and email) and then using a search algorithm (like MCTS or Bayesian Optimization) to find prompt parameters that maximize this metric. These parameters can include the exact phrasing, the inclusion of few-shot examples, the temperature setting for the LLM, and more.

Let’s say you want to improve the ExtractInfo program. You’d define a metric:

def exact_match_metric(ground_truth, prediction, _):
    return (prediction.name == ground_truth.name and
            prediction.email == ground_truth.email)

# Compile the program with the metric and some training examples
compiled_extract_program = dspy.compile(
    extract_program,
    trainset=[
        dspy.Example(text="Contact Jane Smith at jane.s@company.net.", name="Jane Smith", email="jane.s@company.net").with_inputs("text"),
        dspy.Example(text="Reach out to Bob Johnson via bob.j@service.org.", name="Bob Johnson", email="bob.j@service.org").with_inputs("text"),
        # Add more diverse examples
    ],
    metric=exact_match_metric,
    train_k=2 # Number of examples to use during training
)

When you run compiled_extract_program, DSPy will have already gone through an optimization process, finding the best prompt configuration for your ExtractInfo signature based on the provided trainset and metric. The actual prompt that gets executed might be significantly different from your initial simple signature description, potentially including few-shot examples or specific instructions derived from the optimization process.

The surprising power of DSPy lies in its ability to abstract away the prompt itself. You define what you want the LLM to do with a Signature, and then DSPy’s compiler figures out how to best instruct the LLM to do it, using its internal prompt templates and optimization algorithms. This means you can focus on the task’s logic and data, rather than the linguistic nuances of prompt construction.

A common misconception is that DSPy just adds few-shot examples. While it can add few-shot examples, its optimization extends to many other aspects of the prompt: the precise instructions, the preamble, the formatting of the output, and even the choice of which internal "modules" (sub-prompts) to use in a complex pipeline. It’s a holistic optimization of the entire execution trace, not just a single string.

The next step after optimizing a single program is often understanding how to compose multiple DSPy programs into more complex workflows.