Large Language Models

Tokens are how language models read and generate text. A single token can be a complete word, part of a word, or even a single character, depending on the tokenization scheme used.

Examples

  • “Hello” might be 1 token
  • “unhappiness” might be split into [“un”, “happiness”] = 2 tokens
  • “ChatGPT” might be [“Chat”, “G”, “PT”] = 3 tokens

Importance

Token limits define how much text a model can process at once (context window). Understanding tokenization is crucial for prompt engineering and cost estimation, as most LLM APIs charge per token.

Common Tokenizers

  • BPE (Byte Pair Encoding): Used by GPT models
  • WordPiece: Used by BERT
  • SentencePiece: Language-agnostic tokenization

Tags

llm fundamentals nlp

Related Terms

Added: January 15, 2025