Vocabulary Size

Large Language Models

The number of distinct tokens a language model can process, typically 30K-100K+ tokens.

This concept is essential for understanding large language models and forms a key part of modern AI systems.

Tokenization
Token
Embedding

Related Terms

Embedding

A dense vector representation of discrete data (words, tokens) in continuous space, capturing semantic relationships.

Token

The basic unit of text that a language model processes, typically representing a word, subword, or character. Tokens are the fundamental building blocks for LLM input and output.

Tokenization

The process of breaking text into smaller units (tokens) that language models can process, using algorithms like BPE or WordPiece.

← Back to All Terms

Vocabulary Size

Related Concepts

Tags

Related Terms

Embedding

Token

Tokenization