The most surprising thing about OpenAI’s tokenizers is that they’re not magic; they’re just efficient compression algorithms, and understanding their mechanics unlocks predictable, cost-effective LLM usage.
Let’s see tiktoken in action. Imagine you’re sending a prompt to GPT-4. Before it even leaves your machine, you want to know how many tokens that prompt will consume. tiktoken is the library for that.
import tiktoken
# Get the encoding for the cl100k_base encoding, used by gpt-4, gpt-3.5-turbo, text-embedding-ada-002
encoding = tiktoken.get_encoding("cl100k_base")
# Example text
text_to_count = "This is a sample sentence to count tokens."
# Encode the text into token integers
tokens = encoding.encode(text_to_count)
# Count the number of tokens
num_tokens = len(tokens)
print(f"The text: '{text_to_count}' has {num_tokens} tokens.")
print(f"Token IDs: {tokens}")
# You can also decode tokens back to text
decoded_text = encoding.decode(tokens)
print(f"Decoded text: '{decoded_text}'")
This code snippet demonstrates the core functionality: fetching an encoding (like cl100k_base which is crucial for models like gpt-4 and gpt-3.5-turbo), encoding a string into a list of integer token IDs, and then getting the count by simply taking the length of that list. You can even decode those IDs back to see what they represent, though it’s often not human-readable text directly.
The problem tiktoken solves is straightforward: LLMs don’t process raw characters or words. They process numerical representations called tokens. These tokens are not fixed-size; they can be sub-word units, whole words, or even sequences of characters. Different models use different tokenization schemes, meaning the same text can result in a different token count depending on the model. tiktoken provides a consistent, fast way to determine these counts for OpenAI’s models. This is vital for managing API costs (since you’re billed per token) and for respecting model context window limits (the maximum number of tokens a model can process at once).
Internally, tiktoken uses a Byte Pair Encoding (BPE) algorithm. BPE starts with individual characters as the base vocabulary and iteratively merges the most frequent adjacent pairs of characters or tokens to form new, longer tokens. This process continues until a predefined vocabulary size is reached. The specific vocabulary and merging rules are determined by the training data used for the tokenizer and are unique to each tiktoken encoding. When you encode text, tiktoken applies these learned merging rules to break down your input string into the smallest possible sequence of tokens from its vocabulary. The get_encoding function loads these pre-defined vocabularies and merging rules.
The actual mapping from text to tokens isn’t a simple lookup table for every possible string. Instead, it’s a sophisticated process that involves finding the longest matching token in the vocabulary at each step. For example, if your vocabulary contains "apple" and "apple pie", and your input is "apple pie", tiktoken will likely identify "apple pie" as a single token if it’s in the vocabulary, rather than breaking it into "apple" and " pie". This is why even seemingly small changes in text can sometimes lead to unexpected changes in token counts. The cl100k_base encoding, for instance, has a vocabulary of about 100,000 tokens.
One common misconception is that tokens directly correspond to words. While many common words are single tokens, punctuation, spaces, and even parts of words can be their own tokens. For example, the word "tokenization" might be broken down into "token" and "ization". The space before a word is often a token itself. This is why counting tokens is more nuanced than simply splitting by spaces and counting the resulting words.
If you encounter a situation where tiktoken is reporting a different token count than what you see in the OpenAI Playground or API response, double-check that you are using the exact encoding name that corresponds to the model you are querying. For instance, gpt-3.5-turbo and gpt-4 models generally use cl100k_base, but older models like text-davinci-003 used p50k_base, and even older ones used r50k_base (also known as gpt2). Using the wrong encoding is the most frequent cause of discrepancies.
The next step in managing LLM interactions is understanding how to construct prompts that fit within the model’s context window while maximizing their effectiveness.