Tokenization

intermediate
TechniquesLast updated: 2025-01-15

What is Tokenization?


Tokenization is the process of converting raw text into sequences of tokens that language models can process. This fundamental preprocessing step breaks text into discrete units based on the model's vocabulary and tokenization algorithm. The resulting token sequence serves as the input representation that the model processes through its neural network layers. Tokenization is one of the first operations applied to any text before it reaches the model.


Modern tokenization typically uses subword algorithms like Byte Pair Encoding (BPE), WordPiece, or SentencePiece. These methods balance vocabulary size against coverage by learning to split text into common subwords based on frequency patterns in training data. Frequent words become single tokens for efficiency, while rare words are decomposed into subword pieces. This approach handles unlimited vocabulary (any text can be tokenized using subword pieces) while keeping vocabulary size manageable (typically 30K-100K tokens).


The tokenization scheme significantly impacts model behavior and efficiency. It affects how many tokens represent a given text (impacting context window usage and costs), how the model handles different languages or domains (languages poorly represented in training data may tokenize less efficiently), and what the model considers as natural units. Understanding tokenization helps optimize prompts, manage context budgets, and debug unexpected model behaviors. Developers must use the same tokenizer that the model was trained with, as different tokenizers produce incompatible token sequences.


Related Terms