OpenAI Evals: Test and Benchmark Your LLM Apps (2026)

OpenAI Evals is not just a testing framework; it’s a way to quantify the subjective quality of LLM outputs by defining objective success criteria.

Let’s see it in action. Imagine you’ve built an LLM app that summarizes news articles. You want to ensure your summaries are not only concise but also factually accurate and capture the main points.

First, you’ll need to define some evaluation criteria. For factual accuracy, you might want to check if specific entities mentioned in the original article (like names, dates, or locations) appear in the summary. For conciseness, you could set a word count limit. For capturing main points, you might compare the summary’s semantic similarity to the original article.

Here’s a simplified Python snippet showing how you might set up an evaluation for factual accuracy:

from evals.api import Eval
from evals.metrics import ExactMatch,rouge

class NewsSummaryEval(Eval):
    def __init__(self, name="news-summary-eval", prompt_template="Summarize this article: {article}"):
        super().__init__(name=name)
        self.prompt_template = prompt_template
        self.metrics = {
            "exact_match_entities": ExactMatch(),
            "rouge_l_summary": rouge.RougeL()
        }

    def process_sample(self, sample):
        article = sample["article"]
        golden_summary = sample["golden_summary"]
        golden_entities = sample["golden_entities"]

        prompt = self.prompt_template.format(article=article)
        # In a real scenario, you'd call your LLM here
        # For this example, let's assume `model_output` is the LLM's generated summary
        model_output = self.run_model(prompt) # Placeholder for your LLM call

        results = {}
        # Check for exact match of golden entities in the model output
        entities_found = [entity for entity in golden_entities if entity in model_output]
        results["exact_match_entities"] = len(entities_found) / len(golden_entities) if golden_entities else 1.0

        # Calculate ROUGE-L score against the golden summary
        rouge_score = self.metrics["rouge_l_summary"].process_sample(
            {"prediction": model_output, "reference": golden_summary}
        )
        results["rouge_l_summary"] = rouge_score["rougeL"]

        return results

# Example usage with dummy data
eval_data = [
    {
        "article": "The quick brown fox jumps over the lazy dog. The event happened on October 26th, 2023, in a field near London.",
        "golden_summary": "A fox jumped over a dog on October 26th, 2023.",
        "golden_entities": ["fox", "dog", "October 26th, 2023", "London"]
    },
    # ... more samples
]

# Instantiate and run the evaluation (simplified)
# eval_instance = NewsSummaryEval()
# For demonstration, let's assume a model output
# model_output_for_sample_1 = "A swift fox leaped over a slumbering canine on 26/10/2023 in a meadow close to London."
# sample_results = eval_instance.process_sample(eval_data[0])
# print(sample_results)

This NewsSummaryEval class defines a custom evaluation that uses ExactMatch and RougeL metrics. process_sample takes a data point, generates a prompt, (hypothetically) calls your LLM, and then computes scores based on the provided golden information. The ExactMatch metric checks if specific entities are present in the generated summary, while RougeL measures the overlap of the longest common subsequence between the generated summary and a reference (golden) summary.

The core problem Evals addresses is the inherent difficulty in evaluating LLM outputs, which are often free-form text. Traditional software testing relies on deterministic outputs, but LLMs are probabilistic. Evals bridges this gap by allowing you to define what constitutes a good output in a measurable way. You can create custom metrics tailored to your specific application’s needs, whether it’s code generation, creative writing, or factual question answering.

Internally, Evals is built around the concept of Eval objects. These objects encapsulate the logic for a specific evaluation. They define:

Metrics: The quantitative measures of quality (e.g., accuracy, ROUGE, BLEU, custom functions).
Data: The input samples and corresponding ground truth (golden) data.
Prompting: How to format inputs into prompts for the LLM.
Model Interaction: How to send prompts to an LLM and receive its output.

You control the evaluation by defining your Eval classes and providing them with appropriate datasets. The evals CLI tool then orchestrates the execution, running your LLM against the dataset and aggregating the metric scores. You can specify which model to use, which dataset to run on, and how to interpret the results.

The most surprising part is how readily you can repurpose standard NLP metrics like ROUGE or BLEU, or even invent entirely new ones, to evaluate tasks that seem purely qualitative. For instance, you can create a metric that checks if a generated poem adheres to a specific rhyme scheme and meter, or if a customer service response exhibits empathy by looking for specific sentiment-laden keywords, all within the Evals framework. This makes LLM evaluation far more systematic and reproducible than manual human review alone.

Once you’ve mastered custom evals, you’ll want to explore how to chain multiple evals together to create complex evaluation pipelines.