Choosing the right examples for few-shot prompting is more about understanding the underlying mechanics of the LLM than about picking the "best" ones in a human sense.

Let’s see it in action. Imagine we want an LLM to classify customer feedback into "Positive," "Negative," or "Neutral."

Here’s a basic prompt without any examples (zero-shot):

Classify the following customer feedback:

Feedback: "The app crashes constantly. It's unusable."
Classification:

The LLM might struggle to understand the nuances or might default to a simpler interpretation.

Now, let’s add a few examples (few-shot):

Classify the following customer feedback:

Feedback: "I love the new feature! So easy to use."
Classification: Positive

Feedback: "The interface is clunky and confusing."
Classification: Negative

Feedback: "The update was installed successfully."
Classification: Neutral

Feedback: "The app crashes constantly. It's unusable."
Classification:

See how the examples provide context and demonstrate the desired output format and reasoning? The LLM now has a clearer picture of what "Positive," "Negative," and "Neutral" mean in this specific context.

The Mental Model: How Few-Shot Prompting Works

At its core, few-shot prompting is a form of in-context learning. The LLM doesn’t learn in the traditional sense of updating its weights. Instead, it uses the provided examples to condition its next prediction. Think of it like this: the LLM has a vast latent space of knowledge. The few-shot examples act as a spotlight, guiding the LLM’s attention to the relevant parts of that latent space for the current task.

The structure of the prompt is crucial. It typically follows this pattern: [Task Description (optional)] [Example 1 Input] [Example 1 Output] [Example 2 Input] [Example 2 Output] ... [Target Input] [Target Output]

The "Input" and "Output" can be anything: a sentence to classify, a question to answer, a piece of text to summarize, a code snippet to explain, etc. The LLM learns to map the pattern from the input examples to their corresponding outputs.

Levers You Control:

  1. Number of Examples: Too few, and the pattern might be too subtle. Too many, and you risk exceeding the LLM’s context window or diluting the signal. For most tasks, 3-5 examples are a good starting point.
  2. Quality of Examples: This is where the "choosing the best examples" part comes in. "Best" means representative and discriminative.
    • Representative: Examples should cover the range of possible inputs and outputs you expect. If you have subtle negative feedback, include a subtle negative example.
    • Discriminative: Examples should clearly highlight the differences between categories. If "Neutral" is hard to distinguish from "Slightly Positive," your "Neutral" example needs to be unambiguously neutral.
  3. Order of Examples: While often less critical, sometimes the order can subtly influence the model, especially if the examples are very similar. The LLM might weigh later examples slightly more.
  4. Formatting: Consistent formatting (indentation, labels like "Feedback:" and "Classification:") helps the LLM parse the prompt and understand the structure.

The "Best" Examples Strategy

The real trick to choosing "best" examples isn’t about picking the most eloquent or complex ones. It’s about picking examples that are structurally similar to your target input and that clearly demonstrate the desired transformation or classification.

Consider this: if your LLM is a highly sophisticated pattern-matching engine, you want to give it patterns that are easy to see and replicate. If your task is sentiment analysis, and you have a target input like "The service was okay, but the food took too long," you’d want examples that show:

  • A clear "Positive" (e.g., "Loved the ambiance!")
  • A clear "Negative" (e.g., "Terrible food, wouldn’t recommend.")
  • A clear "Neutral" (e.g., "The restaurant opened at 5 PM.")
  • And crucially, an example that bridges the gap, perhaps a slightly negative or mixed sentiment, to show how the LLM should categorize it.

For our feedback classification, if we notice that many negative comments are about performance, we might include an example like:

Feedback: "The app is slow to load and frequently freezes."
Classification: Negative

This helps the LLM associate performance issues with negative sentiment.

The LLM doesn’t "understand" your business context like a human does. It understands statistical relationships between tokens. Your examples are the bridge, showing it which statistical relationships map to which desired outputs. You’re essentially showing it a few pages from the "user manual" of how to behave for this specific task.

The most common pitfall is providing examples that are too similar to each other, or that don’t cover the edge cases you expect. For instance, if you only provide overwhelmingly positive or negative examples, the LLM might struggle with nuanced, middle-ground feedback.

The next step after mastering example selection is understanding how to structure more complex prompts using techniques like Chain-of-Thought, where you ask the LLM to "think step-by-step."

Want structured learning?

Take the full Prompt-engineering course →