Self-consistency prompting gets better results by having the LLM generate multiple different reasoning paths to the same answer, then picking the most frequent one.

Imagine you’ve got a math problem: "If a train leaves station A at 2 PM traveling at 60 mph, and station B is 180 miles away, what time does it arrive at station B?"

A naive LLM might just churn out an answer:

2 PM + (180 miles / 60 mph) = 2 PM + 3 hours = 5 PM

But what if it made a mistake? Maybe it misread the speed, or did the division wrong. Self-consistency says, "Let’s ask the LLM to do that calculation multiple times, each time trying to reason slightly differently or just having a fresh start, and see which answer it gives most often."

Here’s what that might look like in practice. We’ll use a hypothetical generate_response function that takes a prompt and a temperature (higher temp means more randomness, more varied outputs).

def generate_response(prompt, temperature, n_completions):
    # In a real scenario, this would call an LLM API
    # For demonstration, we simulate varied outputs
    print(f"--- Generating {n_completions} completions with temp={temperature} ---")
    responses = []
    for i in range(n_completions):
        if temperature > 0.5 and i % 3 == 0: # Simulate an error path
            responses.append(f"Completion {i+1}: The train arrives at 6 PM. (Error in calculation)")
        elif temperature > 0.5 and i % 4 == 0: # Simulate a different error path
            responses.append(f"Completion {i+1}: It arrives at 4:30 PM. (Units mismatch)")
        else:
            responses.append(f"Completion {i+1}: The train arrives at 5 PM. (180 miles / 60 mph = 3 hours; 2 PM + 3 hours = 5 PM)")
    return responses

prompt = "A train leaves station A at 2 PM traveling at 60 mph. Station B is 180 miles away. What time does it arrive at station B? Show your work."

# Generate multiple responses with a higher temperature to encourage diversity
completions = generate_response(prompt, temperature=0.7, n_completions=5)
for comp in completions:
    print(comp)

# Now, extract the final answers and count them
import re
final_answers = []
for comp in completions:
    match = re.search(r"arrives at (\d{1,2}(:\d{2})? ?(AM|PM)?)", comp)
    if match:
        final_answers.append(match.group(1))

from collections import Counter
answer_counts = Counter(final_answers)
print("\n--- Majority Vote ---")
print(f"Answer counts: {answer_counts}")
most_common_answer = answer_counts.most_common(1)[0][0]
print(f"Most common answer: {most_common_answer}")

Output Simulation:

--- Generating 5 completions with temp=0.7 ---
Completion 1: The train arrives at 5 PM. (180 miles / 60 mph = 3 hours; 2 PM + 3 hours = 5 PM)
Completion 2: The train arrives at 6 PM. (Error in calculation)
Completion 3: The train arrives at 5 PM. (180 miles / 60 mph = 3 hours; 2 PM + 3 hours = 5 PM)
Completion 4: The train arrives at 4:30 PM. (Units mismatch)
Completion 5: The train arrives at 5 PM. (180 miles / 60 mph = 3 hours; 2 PM + 3 hours = 5 PM)

--- Majority Vote ---
Answer counts: Counter({'5 PM': 3, '6 PM': 1, '4:30 PM': 1})
Most common answer: 5 PM

This technique is powerful because LLMs, despite their capabilities, are not deterministic calculators. They are probabilistic models. When tackling complex reasoning, they can stumble in many different ways. By generating multiple diverse chains of thought (often by using a higher temperature parameter in the LLM call, which increases randomness), you allow the model to explore various potential reasoning paths. If a significant number of these diverse paths converge on the same answer, it’s a strong indicator of correctness. The "majority vote" acts as a robust consensus mechanism.

The core problem self-consistency solves is brittle reasoning. A single, high-stakes query might get a correct answer, but it could also produce a subtly wrong one due to a minor hallucination or calculation error in its internal "thought process." Self-consistency mitigates this by making the model commit to its answer across multiple independent attempts. It’s like asking several experts to solve a problem and trusting the consensus.

The key levers you control are:

  • n_completions: The number of independent reasoning paths to generate. More completions generally lead to better accuracy but increase computational cost and latency.
  • temperature: Controls the randomness of the LLM’s output. A higher temperature (e.g., 0.7 to 1.0) is usually preferred for self-consistency to encourage diverse reasoning paths. Too low a temperature might lead to repetitive outputs, defeating the purpose.
  • Prompting Strategy: How you frame the original query is crucial. It needs to elicit reasoning. Prompts that ask "Show your work," "Explain step-by-step," or "Think through this problem" are good candidates.

The process of extracting the final answer from each completion is also important. You need a reliable way to parse the desired output (e.g., the final numerical answer, the classification label, the summarized sentence) from the LLM’s verbose reasoning. Regular expressions, keyword extraction, or even another smaller LLM call can be used for this.

What most people don’t realize is that self-consistency doesn’t just average out errors; it leverages the structure of the problem and the model’s inherent (though imperfect) ability to reason. If a problem has a unique, logically sound solution, multiple diverse attempts to find it are statistically more likely to converge on that solution than on any specific incorrect one. The "noise" of random errors tends to cancel out when aggregated, while the "signal" of correct reasoning reinforces itself.

The next step is often exploring how to make these reasoning paths even more diverse and less prone to common failure modes, leading to techniques like Tree of Thoughts.

Want structured learning?

Take the full Prompt-engineering course →