Promptfoo is a surprisingly effective tool for turning prompt engineering from an art into a science.
Let’s see it in action. Imagine we have a simple prompt for a language model that’s supposed to summarize articles.
prompt.txt:
Summarize the following article in 3 sentences:
{{article}}
And we have a few articles to test it on.
vars.jsonl:
{"article": "The quick brown fox jumps over the lazy dog. This is a classic pangram used for testing typefaces and keyboards, as it contains all the letters of the English alphabet. The sentence is often used in typing tests and demonstrations."}
{"article": "Artificial intelligence is rapidly advancing, with new models and applications emerging daily. Machine learning, a subset of AI, enables systems to learn from data without explicit programming. Ethical considerations and potential societal impacts are crucial areas of ongoing research and discussion."}
Now, we want to evaluate how well our prompt performs on these articles. We’ll use a simple metric: does the summary contain at least 20 words?
First, we need to install promptfoo.
npm install -g promptfoo
Then, we can run our test:
promptfoo eval --file prompt.txt --vars vars.jsonl --include ".*" --filter "(\.summary.length >= 20)" --metrics "(\.summary.length)"
Here’s what’s happening:
eval: This is the command to run an evaluation.--file prompt.txt: Specifies the prompt file.--vars vars.jsonl: Provides the variables to substitute into the prompt.jsonl(JSON Lines) is used so each line is a separate test case.--include ".*": This is a filter for which variables to include..*means all variables.--filter "(\.summary.length >= 20)": This is a critical part. It tells promptfoo to only keep results where the generated summary has a length of 20 characters or more. We’re defining a passing condition.--metrics "(\.summary.length)": This tells promptfoo to calculate the length of the summary for each test case.
Promptfoo will execute this, calling out to an LLM (you’ll need to have an API key configured, typically via environment variables like OPENAI_API_KEY). It will then present a table of results.
The output might look something like this:
Running test cases...
[======================================================================] 100% 2/2
Test cases finished.
Results:
┌─────────┬──────────────────────────────────────────────────────────┬───────────────────────────────────────────┐
│ PASSED │ ARTICLE │ SUMMARY │
├─────────┼──────────────────────────────────────────────────────────┼───────────────────────────────────────────┤
│ ✅ │ The quick brown fox jumps over the lazy dog. This is a...│ The quick brown fox jumps over the lazy...│
│ ✅ │ Artificial intelligence is rapidly advancing, with new...│ Artificial intelligence is rapidly advanc...│
└─────────┴──────────────────────────────────────────────────────────┴───────────────────────────────────────────┘
Metrics:
┌─────────┬───────────────────┐
│ PASSED │ SUMMARY.LENGTH │
├─────────┼───────────────────┤
│ ✅ │ 19 │
│ ✅ │ 27 │
└─────────┴───────────────────┘
Stats:
Total: 2
Passed: 2
Failed: 0
In this simplified example, we’re just checking the length. But promptfoo can do much more: compare outputs, run custom evaluation functions (written in JavaScript or Python), measure latency, and even check for toxicity or bias.
The mental model for promptfoo is that it orchestrates a series of prompt executions against a language model, systematically varying inputs and parameters, and then quantitatively assessing the outputs against predefined criteria. You define your "variables" (inputs, system prompts, model parameters), your "prompt template", and then your "evaluation criteria" (metrics, filters, custom assertions). Promptfoo then acts as the engine that runs these tests, collects the data, and provides a structured report.
The true power of promptfoo lies in its ability to abstract away the repetitive task of manually crafting prompts, feeding them to an LLM, and then manually inspecting the output for each variation. It allows you to define a suite of tests that can be run repeatedly, ensuring that changes to your prompts or the underlying LLM don’t degrade performance.
A common pattern is to use promptfoo to compare different models or different versions of the same model on a benchmark dataset. You can set up a promptfoo.yaml configuration file to define multiple models, each with its own parameters, and then run an evaluation that measures, for instance, the "quality" of the output (as judged by another LLM, or a custom JS function) for each model. This allows for data-driven decisions about which model to use for a particular task.
When you’re defining custom evaluation functions in JavaScript, you can directly access the prompt, the variable values, and the model’s response. This allows for incredibly granular checks. For example, you could write a function that parses the LLM’s output to ensure it adheres to a specific JSON schema, or one that checks if a generated summary accurately reflects the sentiment of the original article using a sentiment analysis library.
Promptfoo’s configuration can be expressed in a YAML file, which offers more structure for complex evaluations. You can define multiple prompts, multiple sets of variables, and multiple models all within a single promptfoo.yaml file, allowing for comprehensive test suites.
The next step after mastering basic evaluation is integrating promptfoo into your CI/CD pipeline.