Large Language Models
Vocabulary Size
The number of distinct tokens a language model can process, typically 30K-100K+ tokens.
This concept is essential for understanding large language models and forms a key part of modern AI systems.
Related Concepts
- Tokenization
- Token
- Embedding
Tags
large-language-models tokenization token embedding
Related Terms
Embedding
A dense vector representation of discrete data (words, tokens) in continuous space, capturing semantic relationships.
Token
The basic unit of text that a language model processes, typically representing a word, subword, or character. Tokens are the fundamental building blocks for LLM input and output.
Tokenization
The process of breaking text into smaller units (tokens) that language models can process, using algorithms like BPE or WordPiece.