Tokenization
The process of breaking text into smaller units (tokens) that language models can process, using algorithms like BPE or WordPiece.
Related Concepts
- Token: Explore how Token relates to Tokenization
- BPE: Explore how BPE relates to Tokenization
- WordPiece: Explore how WordPiece relates to Tokenization
- Subword: Explore how Subword relates to Tokenization
Why It Matters
Understanding Tokenization is crucial for anyone working with large language models. This concept helps build a foundation for more advanced topics in AI and machine learning.
Learn More
This term is part of the comprehensive AI/ML glossary. Explore related terms to deepen your understanding of this interconnected field.
Tags
Related Terms
BPE
Byte Pair Encoding - a subword tokenization algorithm that iteratively merges frequent character pairs to create a vocabulary.
Token
The basic unit of text that a language model processes, typically representing a word, subword, or character. Tokens are the fundamental building blocks for LLM input and output.
WordPiece
A subword tokenization algorithm used by BERT, similar to BPE but with different merging criteria.