What is Token?
A token is the basic unit of text that language models process and generate. Rather than working with individual characters or words, modern LLMs break text into tokens using a process called tokenization. A token might represent a complete word, a part of a word (subword), a character, or special symbols like punctuation. For example, "tokenization" might be split into tokens like "token" and "ization", while common words like "the" are typically single tokens.
The specific tokenization scheme varies by model. GPT models use Byte Pair Encoding (BPE), which creates tokens based on frequently occurring character sequences in training data. This means common words become single tokens while rare or long words are split into multiple tokens. Numbers, punctuation, and special characters may each be separate tokens or combined with adjacent characters. The number of tokens in a text generally exceeds the word count but is less than the character count.
Understanding tokens is crucial for working with LLMs because most constraints and costs are defined in terms of tokens rather than words or characters. Context windows are measured in tokens (e.g., 128K tokens), API pricing is typically per token, and chunk sizes for RAG systems should be specified in tokens to ensure they fit within model limits. Different models use different tokenization schemes, so the same text may produce different token counts across models. Most LLM providers offer tools to count tokens for their specific models, helping developers manage context budgets and costs.