Logprobs are the secret sauce for understanding how confident an LLM is about its output, token by token.

Let’s see it in action. Imagine you’re using the OpenAI API to generate text. You can ask for logprobs in your completion request.

import openai

openai.api_key = "YOUR_API_KEY"

response = openai.Completion.create(
  model="text-davinci-003",
  prompt="The quick brown fox jumps over the",
  max_tokens=10,
  logprobs=5 # Request logprobs for the top 5 most likely tokens
)

print(response.choices[0].text)
print(response.choices[0].logprobs)

The text part is what you’d expect:

 lazy dog.

But the logprobs section is where the magic is:

{
  "content": [
    {
      "token": " lazy",
      "top_logprobs": [
        {
          " lazy": -0.05,
          " lazy": -0.10,
          " lazy": -0.15,
          " lazy": -0.20,
          " lazy": -0.25
        }
      ]
    },
    {
      "token": " dog",
      "top_logprobs": [
        {
          " dog": -0.08,
          " dog": -0.12,
          " dog": -0.18,
          " dog": -0.22,
          " dog": -0.28
        }
      ]
    },
    {
      "token": ".",
      "top_logprobs": [
        {
          ".": -0.02,
          ",": -0.09,
          ";": -0.15,
          "!": -0.20,
          "?": -0.25
        }
      ]
    }
  ]
}

This output tells you that for the first token generated after "The quick brown fox jumps over the", the model assigned a log probability of -0.05 to the token " lazy". It also tells you the top 5 most likely tokens it considered and their associated log probabilities. The lower the negative number (closer to zero), the higher the probability. So, " lazy" was the most likely next token, followed by other possibilities.

Logprobs are fundamentally the logarithm of the probability of a specific token appearing next in the sequence, given the preceding tokens. The model internally calculates a probability distribution over its entire vocabulary for each step. logprobs allows you to peek at that distribution. A higher logprob (less negative) means the model is more certain about that specific token being the correct continuation.

The top_logprobs parameter is crucial here. When you set logprobs=5, you’re not just getting the logprob of the chosen token, but the logprobs of the 5 most likely tokens the model considered. This is incredibly useful for:

  • Evaluating Model Confidence: If the top logprob is very low (e.g., -5.0), the model is highly uncertain. If it’s close to 0 (e.g., -0.1), it’s very confident.
  • Detecting Ambiguity: If the top few logprobs are very close in value, it suggests the text is ambiguous, and the model could have gone in several directions.
  • Building More Robust Applications: You can use logprobs to filter out low-confidence generations, implement beam search-like behavior (where you explore multiple high-probability paths), or even detect when a prompt might be too vague.

For instance, if you’re building a chatbot that needs to provide factual information, you might set a threshold. If the logprob of the generated answer falls below, say, -1.0, you could have the bot respond with "I’m not sure about that" instead of a potentially incorrect answer.

The raw probabilities are calculated by a softmax function applied to the model’s internal scores (logits). The logprob is simply the natural logarithm of this probability. A probability of 0.5 becomes log(0.5) ≈ -0.693. A probability of 0.9 becomes log(0.9) ≈ -0.105. This means higher probabilities result in logprobs closer to zero.

When you request logprobs with a value greater than 0, you get the probabilities for the top k tokens. The token field in the output shows the actual token that was generated by the model. The top_logprobs field is a list of dictionaries, where each dictionary maps a token to its log probability. The order within top_logprobs is usually from highest probability (most negative logprob) to lowest.

This reveals that the model doesn’t just pick one word; it has a whole ranked list of possibilities it’s considering at each step. Understanding these ranks and their associated probabilities gives you a much deeper insight into the generation process than just looking at the final text.

You can also request logprobs for the entire sequence by setting logprobs=1 and echo=True in your API call. This will return the logprob for each token that was part of your original prompt, as well as the generated tokens. This is useful for evaluating how likely your prompt is according to the model, or for calculating the perplexity of a given text.

The interplay between the token and top_logprobs is key. The token is the winner of the current step based on the model’s internal decision-making process, which is influenced by the highest logprob. However, the top_logprobs show you the competition and the model’s uncertainty. If the chosen token has a very high logprob compared to the others in top_logprobs, the generation is likely to be stable and confident. If the logprobs are clustered, the generation might be more sensitive to small changes in the prompt or model parameters.

The next step is often to use these logprobs for tasks like measuring perplexity or implementing custom sampling strategies beyond simple greedy decoding.

Want structured learning?

Take the full Openai-api course →