Tokens are how language models read and generate text. A single token can be a complete word, part of a word, or even a single character, depending on the tokenization scheme used.
Examples
- “Hello” might be 1 token
- “unhappiness” might be split into [“un”, “happiness”] = 2 tokens
- “ChatGPT” might be [“Chat”, “G”, “PT”] = 3 tokens
Importance
Token limits define how much text a model can process at once (context window). Understanding tokenization is crucial for prompt engineering and cost estimation, as most LLM APIs charge per token.
Common Tokenizers
- BPE (Byte Pair Encoding): Used by GPT models
- WordPiece: Used by BERT
- SentencePiece: Language-agnostic tokenization
Tags
Related Terms
BPE
Byte Pair Encoding - a subword tokenization algorithm that iteratively merges frequent character pairs to create a vocabulary.
Context Window
The maximum number of tokens an LLM can process at once, including both input prompt and generated output. Also called context length.
Tokenization
The process of breaking text into smaller units (tokens) that language models can process, using algorithms like BPE or WordPiece.