Large Language Models
Subword Tokenization
Breaking words into smaller units, balancing vocabulary size with representation granularity.
This concept is essential for understanding large language models and forms a key part of modern AI systems.
Related Concepts
- Tokenization
- BPE
- WordPiece
Tags
large-language-models tokenization bpe wordpiece
Related Terms
BPE
Byte Pair Encoding - a subword tokenization algorithm that iteratively merges frequent character pairs to create a vocabulary.
Tokenization
The process of breaking text into smaller units (tokens) that language models can process, using algorithms like BPE or WordPiece.
WordPiece
A subword tokenization algorithm used by BERT, similar to BPE but with different merging criteria.