The temperature parameter in LLM prompts doesn’t just make output "more creative"; it fundamentally reshapes the probability distribution of the next token, directly influencing the likelihood of sampling less common, yet potentially more insightful, words.
Let’s see this in action. Imagine we have a simple prompt: "The best way to learn a new skill is to…"
With temperature=0.1 (very low, deterministic):
{
"prompt": "The best way to learn a new skill is to...",
"temperature": 0.1,
"max_tokens": 50,
"response": "The best way to learn a new skill is to practice it consistently. Repetition helps solidify the knowledge and muscle memory, making it easier to recall and apply the skill in different situations. This deliberate practice, combined with seeking feedback, is crucial for mastery."
}
Now, with temperature=0.8 (moderate, balanced):
{
"prompt": "The best way to learn a new skill is to...",
"temperature": 0.8,
"max_tokens": 50,
"response": "The best way to learn a new skill is to dive in headfirst and embrace the learning curve. Start with the fundamentals, but don't be afraid to experiment and make mistakes. Seek out mentors, join communities, and actively apply what you learn in real-world scenarios. The journey is as important as the destination."
}
And finally, with temperature=1.2 (high, creative/random):
{
"prompt": "The best way to learn a new skill is to...",
"temperature": 1.2,
"max_tokens": 50,
"response": "The best way to learn a new skill is to weave it into the fabric of your dreams, a whispered secret between consciousness and the universe. Imagine yourself already proficient, then let that vision guide your hands and mind. Seek out the obscure texts, the forgotten masters, and the unconventional paths; for in the margins of the known, true mastery often resides."
}
At its core, temperature controls how the LLM selects the next word from its vocabulary. When the model predicts the next token, it doesn’t just pick the single most likely word. Instead, it generates a probability distribution over all possible next words. The temperature parameter is applied to this distribution before sampling.
Mathematically, the temperature is used to re-scale the logits (the raw, unnormalized scores) of each token before they are converted into probabilities using the softmax function. The formula for the softmax with temperature is:
$P(w_i | \text{context}) = \frac{\exp(\text{logit}(w_i) / T)}{\sum_{j} \exp(\text{logit}(w_j) / T)}$
where $T$ is the temperature.
-
Low Temperature (e.g., 0.1-0.4): When $T$ is small, dividing the logits by $T$ makes the differences between them larger. The highest logit becomes much higher relative to others. This means the softmax output will assign a very high probability to the most likely token and very low probabilities to less likely ones. The model becomes more deterministic and focused, sticking closely to the most probable continuations. This is ideal for tasks requiring factual accuracy, summarization, or code generation where precision is key.
-
Moderate Temperature (e.g., 0.5-0.9): As $T$ increases, the differences between logits are reduced. The probability distribution becomes flatter, meaning less likely tokens have a higher chance of being sampled. This leads to more varied and "creative" outputs, while still generally staying on topic. This is good for tasks like creative writing, brainstorming, or generating diverse responses.
-
High Temperature (e.g., 1.0+): With $T > 1$, the logits are divided by a number greater than 1, shrinking the differences. The probability distribution becomes very flat, approaching a uniform distribution. This means even very unlikely tokens have a significant chance of being selected. The output can become highly unpredictable, nonsensical, or even random. While useful for exploring extreme novelty, it’s generally not suitable for coherent text generation.
A common misconception is that temperature directly "injects creativity." Instead, it manipulates the sampling process. It doesn’t invent new concepts; it increases the model’s willingness to pick words that are statistically less probable according to its training data, leading to outputs that appear more creative because they deviate from the most common linguistic patterns.
Consider the scenario where the model has just generated "The best way to learn a new skill is to". The next most probable tokens might be "practice" (high probability), "study" (medium probability), "read" (medium-low probability), and "experiment" (low probability). With a low temperature, "practice" is almost guaranteed. With a moderate temperature, "study" or "read" become more likely. With a high temperature, "experiment" or even something much less common like "ponder" or "dream" might get sampled.
The parameter that most people overlook when tuning temperature is top_p (nucleus sampling). When top_p is used in conjunction with temperature, it further refines the sampling. top_p limits the sampling pool to the smallest set of tokens whose cumulative probability exceeds p. This means even if a high temperature makes many tokens possible, top_p can still prune away the truly nonsensical ones by only considering those that cumulatively make up, say, 90% of the probability mass. If you’re using a high temperature and getting gibberish, you might need to introduce or lower top_p.
If you’ve set your temperature to 0.1 and are still getting slightly repetitive output, the next thing to look at is the prompt itself and potentially other decoding parameters like frequency_penalty or presence_penalty.