AI & ML Glossary

A comprehensive reference covering 522+ essential terms in Artificial Intelligence, Machine Learning, Deep Learning, Large Language Models, and related topics. Built as a knowledge resource for developers, researchers, and AI enthusiasts.

AI Ethics & Safety

Adversarial Attack

Intentionally crafted inputs designed to fool AI models into making incorrect predictions, exposing vulnerabilities.

ai-ethics-safety adversarial-example robustness

Adversarial Example

An input with imperceptible perturbations that causes a model to make a wrong prediction, highlighting model fragility.

ai-ethics-safety adversarial-attack robustness

AI Alignment

Ensuring AI systems behave in accordance with human values and intentions, a central challenge in AI safety.

ai-ethics-safety ai-safety rlhf

AI Safety

Research and practices aimed at ensuring AI systems are safe, reliable, and beneficial, especially as capabilities increase.

ai-ethics-safety ai-alignment robustness

Bias in AI

Systematic errors or unfair outcomes in AI systems, often reflecting biases in training data or model design.

ai-ethics-safety fairness ethics

Differential Privacy

A mathematical framework for quantifying and limiting privacy loss when releasing information about datasets.

ai-ethics-safety privacy privacy-preserving-ml

Explainability

The ability to explain how an AI model makes decisions in human-understandable terms, crucial for trust and accountability.

ai-ethics-safety interpretability xai

Fairness

Ensuring AI systems treat all individuals and groups equitably, without discrimination based on protected attributes.

ai-ethics-safety bias-in-ai ethics

Federated Learning

Training models across decentralized devices holding local data, without exchanging the data itself, preserving privacy.

ai-ethics-safety privacy-preserving-ml distributed-training

Interpretability

Understanding the internal workings of AI models, including which features influence predictions and why.

ai-ethics-safety explainability black-box

Model Card

Documentation describing a model's characteristics, intended use, limitations, and ethical considerations for transparent deployment.

ai-ethics-safety documentation transparency

Privacy-Preserving ML

Techniques for training and deploying models while protecting individual privacy (federated learning, differential privacy).

ai-ethics-safety differential-privacy federated-learning

Robustness

A model's ability to maintain performance under distribution shifts, adversarial attacks, or noisy inputs.

ai-ethics-safety adversarial-attack generalization

AI Infrastructure & Deployment

A/B Testing

Comparing two model versions in production by routing traffic to each and measuring performance differences.

ai-infrastructure-deployment experimentation production

Batch Processing

Processing multiple predictions together in batches rather than one at a time, improving throughput efficiency.

ai-infrastructure-deployment inference throughput

Data Drift

Changes in input data distribution over time that can degrade model performance in production.

ai-infrastructure-deployment model-monitoring production

Edge Deployment

Running models on edge devices (phones, IoT) rather than cloud servers for lower latency and privacy.

ai-infrastructure-deployment inference mobile

GPU

Graphics Processing Unit - hardware accelerator with thousands of cores, essential for parallel computation in deep learning.

ai-infrastructure-deployment tpu cuda

Inference

Using a trained model to make predictions on new data, the deployment phase after training is complete.

ai-infrastructure-deployment training deployment

Knowledge Distillation

Training a smaller 'student' model to mimic a larger 'teacher' model, transferring knowledge while reducing size.

ai-infrastructure-deployment model-compression teacher-student

MLOps

Practices for deploying, monitoring, and maintaining machine learning models in production, combining ML and DevOps principles.

ai-infrastructure-deployment deployment model-monitoring

Model Compression

Techniques to reduce model size and computational requirements (quantization, pruning, distillation) for efficient deployment.

ai-infrastructure-deployment quantization pruning

Model Drift

Degradation of model performance over time due to changes in the relationship between features and target.

ai-infrastructure-deployment model-monitoring production

Model Monitoring

Tracking model performance, data distribution, and predictions in production to detect issues and degradation.

ai-infrastructure-deployment mlops production

Model Serving

Deploying trained models as services that can handle prediction requests in production environments.

ai-infrastructure-deployment inference deployment

Pruning

Removing unnecessary weights or neurons from a trained model to reduce size and computation while maintaining performance.

ai-infrastructure-deployment model-compression sparsity

Quantization

Reducing model precision (FP32 → INT8) to decrease size and increase inference speed with minimal accuracy loss.

ai-infrastructure-deployment model-compression inference

TPU

Tensor Processing Unit - Google's custom hardware accelerator designed specifically for machine learning workloads.

ai-infrastructure-deployment gpu hardware-acceleration

Advanced Concepts

Active Learning

Iteratively selecting the most informative unlabeled examples for annotation to efficiently improve models with limited labels.

advanced-concepts semi-supervised-learning data-annotation

Attention Head

An individual attention mechanism in multi-head attention, learning specific patterns of relationships between tokens.

advanced-concepts multi-head-attention self-attention

AutoML

Automated Machine Learning - automating the process of model selection, architecture search, and hyperparameter tuning.

advanced-concepts neural-architecture-search hyperparameter-optimization

Catastrophic Forgetting

The tendency of neural networks to completely forget previously learned information when learning new tasks.

advanced-concepts continual-learning lifelong-learning

Causal Inference

Determining cause-and-effect relationships from data, going beyond correlation to understand causal mechanisms.

advanced-concepts causality intervention

Continual Learning

Learning new tasks sequentially without forgetting previously learned tasks, addressing catastrophic forgetting.

advanced-concepts lifelong-learning catastrophic-forgetting

Contrastive Learning

A self-supervised learning approach that learns representations by contrasting similar and dissimilar examples.

advanced-concepts self-supervised-learning simclr

Curriculum Learning

Training strategy where examples are presented from easy to hard, mimicking human learning for improved convergence.

advanced-concepts training-strategy learning

Hyperparameter

Configuration settings external to the model (learning rate, batch size) that must be set before training begins.

advanced-concepts learning-rate batch-size

Hyperparameter Tuning

The process of finding optimal hyperparameter values through techniques like grid search, random search, or Bayesian optimization.

advanced-concepts hyperparameter grid-search

Latent Space

A compressed, learned representation space where similar data points are close together, used in autoencoders and VAEs.

advanced-concepts embedding representation-learning

Meta-Learning

Learning to learn - training models that can quickly adapt to new tasks with minimal data, often applied to few-shot learning scenarios.

advanced-concepts few-shot-learning transfer-learning

Multi-Task Learning

Training a single model on multiple related tasks simultaneously to improve generalization and efficiency.

advanced-concepts transfer-learning shared-representations

Multimodal Learning

Training models on multiple types of data (text, images, audio) to understand relationships across modalities.

advanced-concepts clip vision-language

Neural Architecture Search

Automated methods for discovering optimal neural network architectures, using techniques like reinforcement learning or evolution.

advanced-concepts automl architecture

Online Learning

Models that learn continuously from streaming data, updating incrementally as new data arrives.

advanced-concepts streaming-data incremental-learning

Out-of-Distribution

Data that differs significantly from the training distribution, where models often perform poorly or unreliably.

advanced-concepts distribution-shift robustness

Representation Learning

Learning useful features or representations of data automatically, rather than hand-crafting them.

advanced-concepts feature-learning deep-learning

Self-Supervised Learning

Learning representations from unlabeled data by creating supervised tasks from the data itself (masked prediction, contrastive learning).

advanced-concepts unsupervised-learning pre-training

Semi-Supervised Learning

Learning from a combination of labeled and unlabeled data, leveraging abundant unlabeled data to improve performance.

advanced-concepts supervised-learning unsupervised-learning

Computer Vision

3D Reconstruction

Creating 3D models from 2D images using geometry and deep learning.

computer-vision computer-vision 3d-vision

Anchor Box

Predefined boxes of various sizes and ratios serving as references for object detection.

computer-vision object-detection bounding-box

Bounding Box

A rectangular box defined by coordinates that localizes an object in an image, used in object detection.

computer-vision object-detection localization

Center Crop

Extracting the central region of an image, often used during inference.

computer-vision image-preprocessing crop

Color Jittering

Randomly adjusting brightness, contrast, saturation, and hue for image augmentation.

computer-vision image-augmentation data-augmentation

Computer Vision

The field of AI focused on enabling computers to understand and interpret visual information from images and videos.

computer-vision cnn image-classification

Data Augmentation in Vision

Creating training image variations through rotation, flipping, cropping, color jittering to improve model robustness.

computer-vision data-augmentation computer-vision

Depth Estimation

Predicting distance of objects from the camera using monocular or stereo images.

computer-vision computer-vision 3d-vision

Face Detection

Locating faces in images, a precursor to recognition and analysis.

computer-vision computer-vision object-detection

Facial Recognition

Identifying or verifying people from face images using deep learning.

computer-vision computer-vision biometrics

Faster R-CNN

An object detection architecture with RPN for efficient region proposals.

computer-vision object-detection r-cnn

Feature Pyramid Network

A CNN architecture creating multi-scale feature representations for detecting objects at different sizes.

computer-vision object-detection multi-scale

Image Augmentation

Applying transformations (rotation, flip, crop, color) to increase training data diversity.

computer-vision data-augmentation computer-vision

Image Classification

Assigning a single label or category to an entire image, a fundamental computer vision task.

computer-vision cnn computer-vision

Image Generation

Creating new images from scratch or from text descriptions using generative models (GANs, diffusion models, VAEs).

computer-vision gan diffusion-model

Image Inpainting

Filling in missing or corrupted regions of images using context and generative models.

computer-vision computer-vision generative-model

Image Normalization

Scaling pixel values to standard ranges (e.g., mean=0, std=1) to improve training.

computer-vision normalization preprocessing

Image Preprocessing

Transforming images before model input (resizing, normalization, color adjustment).

computer-vision computer-vision normalization

ImageNet

A large-scale dataset of 14M images in 20K categories, historically used as the benchmark for image classification models.

computer-vision computer-vision benchmark

Instance Segmentation

Combining object detection and segmentation to identify individual object instances at the pixel level.

computer-vision semantic-segmentation object-detection

Mask R-CNN

Extending Faster R-CNN to instance segmentation by adding a mask prediction branch.

computer-vision instance-segmentation faster-r-cnn

Non-Maximum Suppression

Filtering overlapping detection boxes by keeping only the most confident predictions.

computer-vision object-detection post-processing

Object Detection

Identifying and localizing multiple objects in an image with bounding boxes and class labels (YOLO, R-CNN, RetinaNet).

computer-vision computer-vision bounding-box

Optical Flow

Estimating motion patterns between video frames by tracking pixel movements.

computer-vision computer-vision video-analysis

Pose Estimation

Detecting body keypoints to determine human pose from images or video.

computer-vision computer-vision keypoint-detection

R-CNN

Region-based CNN - an object detection approach using selective search and CNN features.

computer-vision object-detection two-stage

Random Crop

Extracting random patches from images for augmentation and training.

computer-vision image-augmentation data-augmentation

Region Proposal Network

A network generating candidate object locations for two-stage detectors like Faster R-CNN.

computer-vision object-detection faster-r-cnn

RetinaNet

A single-stage object detector using focal loss to handle class imbalance.

computer-vision object-detection focal-loss

Semantic Segmentation

Classifying every pixel in an image into categories, creating a pixel-level understanding of scenes.

computer-vision computer-vision instance-segmentation

Style Transfer

Transferring artistic style from one image to another while preserving content.

computer-vision computer-vision generative-model

Super-Resolution

Enhancing image resolution using deep learning to recover high-frequency details.

computer-vision computer-vision image-enhancement

Transfer Learning in Vision

Using pre-trained vision models (ImageNet) as feature extractors or fine-tuning for specific visual tasks.

computer-vision pre-training imagenet

YOLO

You Only Look Once - a real-time object detection architecture treating detection as regression.

computer-vision object-detection real-time

Data & Features

Data Leakage

When information from outside the training data is used to create the model, leading to overly optimistic performance estimates.

data-features overfitting validation

Data Preprocessing

Cleaning, transforming, and preparing raw data for model training (handling missing values, normalization, encoding).

data-features data-cleaning normalization

Dataset

A collection of data examples used for training, validating, or testing machine learning models.

data-features training-data test-set

Feature

An individual measurable property or characteristic of data used as input to machine learning models.

data-features feature-engineering input

Feature Importance

Measures indicating which features contribute most to model predictions, useful for interpretation and selection.

data-features interpretability feature-selection

Ground Truth

The correct or true labels/values for data, used as targets during training and evaluation benchmarks.

data-features labeled-data annotation

Imbalanced Dataset

A dataset where classes have significantly different numbers of examples, causing models to bias toward majority classes.

data-features class-imbalance resampling

Labeled Data

Data with associated target outputs or annotations, required for supervised learning tasks.

data-features supervised-learning annotation

Normalization

Scaling features to a standard range (typically 0-1 using min-max scaling) to improve model training and convergence. Often used interchangeably with standardization (mean=0, std=1), though technically distinct.

data-features standardization data-preprocessing

One-Hot Encoding

Converting categorical variables into binary vectors with one element set to 1 and others to 0.

data-features encoding categorical-data

Synthetic Data

Artificially generated data created to augment training sets, protect privacy, or simulate rare scenarios.

data-features data-augmentation generative-model

Training Data

The portion of a dataset used to train a model by adjusting its parameters to minimize loss.

data-features dataset test-set

Emerging & Advanced

AlphaFold

DeepMind's breakthrough AI system for predicting protein structures with near-experimental accuracy.

emerging-advanced protein-folding computational-biology

AlphaGo

DeepMind's Go-playing AI that defeated world champions using deep RL and tree search.

emerging-advanced reinforcement-learning game-playing

Behavioral Cloning

Supervised learning of a policy from state-action pairs in expert demonstrations.

emerging-advanced imitation-learning supervised-learning

Domain Randomization

Training with randomized simulation parameters to improve transfer to real-world environments.

emerging-advanced sim-to-real transfer-learning

Graph Attention Network

A GNN using attention mechanisms to weight neighbor contributions when aggregating information.

emerging-advanced gnn attention

Hierarchical RL

Learning policies at multiple levels of abstraction, with high-level goals and low-level skills.

emerging-advanced reinforcement-learning temporal-abstraction

Imitation Learning

Learning policies by mimicking expert behavior from demonstrations.

emerging-advanced reinforcement-learning behavioral-cloning

Inverse Reinforcement Learning

Learning reward functions from expert demonstrations, inferring what is being optimized.

emerging-advanced reinforcement-learning imitation-learning

Message Passing

The fundamental operation in GNNs where nodes exchange and aggregate information with neighbors.

emerging-advanced gnn graph-learning

Meta-RL

Learning to adapt quickly to new RL tasks from experience on related tasks.

emerging-advanced meta-learning reinforcement-learning

Model-Based RL

Reinforcement learning using learned environment models for planning and improving sample efficiency.

emerging-advanced reinforcement-learning world-model

Model-Free RL

Reinforcement learning directly learning policies or value functions without modeling environment dynamics.

emerging-advanced reinforcement-learning q-learning

Monte Carlo Tree Search

A search algorithm combining tree search with random sampling, used in game-playing AIs.

emerging-advanced alphago game-playing

Multi-Agent RL

Reinforcement learning with multiple agents that interact and potentially cooperate or compete.

emerging-advanced reinforcement-learning game-theory

Neural ODE

Neural Ordinary Differential Equations - modeling continuous-depth networks as ODEs, enabling adaptive computation.

emerging-advanced deep-learning continuous-model

Node Embedding

Learning vector representations of graph nodes that capture structural and feature information.

emerging-advanced gnn embedding

Offline RL

Learning policies from fixed datasets without environment interaction, enabling learning from logs.

emerging-advanced reinforcement-learning batch-rl

Protein Folding

Predicting 3D protein structures from amino acid sequences, revolutionized by AlphaFold.

emerging-advanced alphafold computational-biology

Sim-to-Real Transfer

Transferring policies trained in simulation to real-world deployment, crucial for robotics.

emerging-advanced reinforcement-learning robotics

World Model

A learned model of environment dynamics that can predict future states, used in model-based RL.

emerging-advanced reinforcement-learning prediction

Emerging Techniques

Constitutional AI

Training AI systems using principles and rules rather than only human feedback, developed by Anthropic for Claude.

emerging-techniques ai-alignment rlhf

Emergent Abilities

Capabilities that appear suddenly in large language models at certain scales, not present in smaller models.

emerging-techniques scaling-laws llm

Instruction Tuning

Fine-tuning LLMs on diverse instruction-following tasks to improve zero-shot performance on new instructions.

emerging-techniques fine-tuning zero-shot

Mixture of Experts

An architecture where multiple specialized sub-networks (experts) process inputs, with a gating network routing to relevant experts.

emerging-techniques ensemble sparse-models

Multimodal Model

Models processing multiple data types (text, images, audio) jointly, like GPT-4V, Gemini, or CLIP.

emerging-techniques multimodal-learning vision-language

Neural Scaling Laws

Empirical relationships showing how model performance improves predictably with model size, data, and compute.

emerging-techniques model-size compute

Prompt Tuning

Learning continuous prompt embeddings while keeping the LLM frozen, an efficient alternative to fine-tuning.

emerging-techniques prompt-engineering peft

Retrieval-Interleaved Generation

Dynamically retrieving information during generation rather than just before, allowing models to gather facts as needed.

emerging-techniques rag retrieval

Tool Use

LLMs learning to call external tools, APIs, or functions to extend capabilities beyond text generation (calculators, search, code execution).

emerging-techniques function-calling agent

Jargon & Slang

Alignment Tax

Performance degradation that may occur when making models safer and more aligned with human values.

jargon-slang ai-alignment rlhf

Benchmark Gaming

Optimizing models specifically for benchmark performance rather than real-world capabilities, inflating scores artificially.

jargon-slang benchmark evaluation

Black Box

A model whose internal workings are difficult to understand or interpret, common with complex neural networks.

jargon-slang interpretability explainability

Compute

Informal term for computational resources (GPUs, TPUs, time) required for training or running AI models.

jargon-slang gpu tpu

Curse of Dimensionality

Challenges arising when working with high-dimensional data, including data sparsity and computational complexity.

jargon-slang dimensionality-reduction feature-selection

Foundation Model

Large pre-trained models serving as a base for various downstream tasks (GPT, BERT, CLIP, SAM).

jargon-slang pre-training transfer-learning

Grounding

Connecting model outputs to real-world facts, sources, or evidence to improve factuality and reduce hallucinations.

jargon-slang hallucination rag

Hallucination

When language models generate plausible-sounding but factually incorrect or nonsensical information.

jargon-slang llm factuality

No Free Lunch Theorem

The principle that no single ML algorithm works best for all problems - algorithm choice depends on the specific task.

jargon-slang model-selection algorithm

Shot

An example provided in a prompt - zero-shot (no examples), few-shot (a few examples), one-shot (one example).

jargon-slang few-shot-learning zero-shot

Large Language Models

Assistant Response

The output generated by the language model in response to user prompts.

large-language-models generation output

Attention Mask

A binary mask indicating which tokens should be attended to, used to handle padding and causal masking.

large-language-models attention padding

Attention Score

The weight determining how much each value contributes to the output, computed from query-key similarity.

large-language-models attention-mechanism query-key-value

Autoregressive Model

A model that generates output one token at a time, using previously generated tokens as input for the next prediction.

large-language-models gpt causal-language-modeling

Beam Search

A generation algorithm that maintains top-k candidates at each step, balancing quality and diversity.

large-language-models generation sampling

BERT

Bidirectional Encoder Representations from Transformers - a model that understands context by looking at text from both directions.

large-language-models transformer encoder-only

Bidirectional Attention

Allowing tokens to attend to both past and future context, used in encoder models like BERT.

large-language-models attention bert

BLEU Score

A metric for evaluating machine translation quality by comparing n-gram overlap between generated and reference text.

large-language-models evaluation translation

BOS Token

Beginning Of Sequence token - marks the start of a sequence in language models.

large-language-models special-token eos-token

BPE

Byte Pair Encoding - a subword tokenization algorithm that iteratively merges frequent character pairs to create a vocabulary.

large-language-models tokenization subword

Causal Language Modeling

Training a model to predict the next token given previous tokens, the foundation of autoregressive models like GPT.

large-language-models gpt autoregressive

Causal Mask

An attention mask ensuring tokens can only attend to previous positions, crucial for autoregressive generation.

large-language-models attention-mask autoregressive

Chain-of-Thought

A prompting technique where the model explains its reasoning step-by-step before giving a final answer, improving complex reasoning.

large-language-models prompt-engineering reasoning

Context Window

The maximum number of tokens an LLM can process at once, including both input prompt and generated output. Also called context length.

llm architecture limitations

Conversation History

Previous messages in a multi-turn dialogue, provided as context for coherent conversations.

large-language-models context-window chat

Cross-Attention

Attention between two different sequences, where queries come from one and keys/values from another.

large-language-models attention-mechanism encoder-decoder

Decoder-Only Model

A transformer architecture with only decoder layers, using causal masking for autoregressive generation (GPT family).

large-language-models gpt autoregressive

Embedding

A dense vector representation of discrete data (words, tokens) in continuous space, capturing semantic relationships.

large-language-models word2vec vector-space

Encoder-Decoder

A architecture where the encoder processes input and the decoder generates output, used in translation and sequence-to-sequence tasks.

large-language-models transformer seq2seq

Encoder-Only Model

A transformer with only encoder layers and bidirectional attention, suited for understanding tasks (BERT family).

large-language-models bert bidirectional-attention

EOS Token

End Of Sequence token - signals when the model has finished generating a complete output.

large-language-models stop-token generation

Few-Shot Learning

Learning to perform a task from a small number of examples provided in the prompt, without parameter updates.

large-language-models in-context-learning zero-shot

GPT

Generative Pre-trained Transformer - an autoregressive language model architecture that predicts the next token given previous context.

large-language-models transformer autoregressive

Greedy Decoding

Always selecting the most likely next token during generation, fast but can lead to repetitive or suboptimal outputs.

large-language-models generation beam-search

In-Context Learning

The ability of LLMs to learn from examples and instructions provided in the input prompt without training.

large-language-models few-shot-learning prompting

Instruction Following

The ability of language models to understand and execute instructions provided in prompts.

large-language-models instruction-tuning prompt-engineering

Large Language Model

A neural network trained on vast amounts of text data, capable of understanding and generating human-like text across diverse tasks.

large-language-models transformer pre-training

Length Penalty

Adjusting generation scores based on output length to avoid bias toward shorter or longer sequences.

large-language-models generation beam-search

Masked Language Modeling

A pre-training objective where random tokens are masked and the model learns to predict them from context.

large-language-models bert pre-training

Multi-Head Attention

Running multiple attention operations in parallel with different learned projections, capturing diverse relational patterns.

large-language-models self-attention transformer

Nucleus Sampling

Sampling from the smallest token set with cumulative probability exceeding p (also called top-p sampling).

large-language-models top-p-sampling generation

Padding Token

A special token used to make sequences the same length in a batch, typically ignored during computation.

large-language-models special-token batching

Perplexity

A metric measuring how well a language model predicts text - lower perplexity indicates better prediction.

large-language-models language-modeling evaluation

Positional Encoding

Adding position information to token embeddings so the model understands word order in sequences.

large-language-models transformer embedding

Prompt Engineering

The practice of designing and optimizing input prompts to get desired outputs from language models. A crucial skill for effectively using LLMs.

llm technique practical

Prompt Template

A reusable structure for crafting prompts with placeholders for variables, improving consistency.

large-language-models prompt-engineering template

Query-Key-Value

The three learned projections in attention mechanisms used to compute attention weights and outputs.

large-language-models attention-mechanism self-attention

Repetition Penalty

A technique reducing the likelihood of previously generated tokens to avoid repetitive outputs.

large-language-models generation sampling

ROUGE Score

Metrics for evaluating text summarization by measuring overlap of n-grams, word sequences, and word pairs with references.

large-language-models evaluation summarization

Scaled Dot-Product Attention

The attention computation using dot product of queries and keys, scaled by dimension to stabilize gradients.

large-language-models attention-mechanism transformer

Self-Attention

A mechanism where each token attends to all other tokens in the sequence to understand contextual relationships.

large-language-models attention-mechanism transformer

SentencePiece

A language-agnostic tokenization library that treats text as a sequence of Unicode characters.

large-language-models tokenization bpe

Sequence-to-Sequence

Models that transform input sequences to output sequences, used for translation, summarization, and generation.

large-language-models encoder-decoder machine-translation

Special Token

Reserved tokens with special meanings like [CLS], [SEP], [MASK], [PAD] used in various model architectures.

large-language-models token bert

Stop Token

A special token signaling the end of generation, causing the model to stop producing more tokens.

large-language-models generation special-token

Subword Tokenization

Breaking words into smaller units, balancing vocabulary size with representation granularity.

large-language-models tokenization bpe

System Prompt

Initial instructions defining the model's role, behavior, and constraints for the conversation.

large-language-models prompt-engineering instruction

Temperature

A sampling parameter controlling randomness in generation - lower values make output more deterministic, higher more creative.

large-language-models sampling top-k

Token

The basic unit of text that a language model processes, typically representing a word, subword, or character. Tokens are the fundamental building blocks for LLM input and output.

llm fundamentals nlp

Tokenization

The process of breaking text into smaller units (tokens) that language models can process, using algorithms like BPE or WordPiece.

large-language-models token bpe

Top-k Sampling

A generation strategy that samples from only the k most likely next tokens, balancing quality and diversity.

large-language-models temperature top-p

Top-p Sampling

Nucleus sampling - selecting from the smallest set of tokens whose cumulative probability exceeds p, providing dynamic vocabulary.

large-language-models temperature top-k

User Prompt

The input provided by the user to the language model, containing questions or instructions.

large-language-models prompt prompt-engineering

Vocabulary Size

The number of distinct tokens a language model can process, typically 30K-100K+ tokens.

large-language-models tokenization token

WordPiece

A subword tokenization algorithm used by BERT, similar to BPE but with different merging criteria.

large-language-models tokenization bpe

Zero-Shot Learning

A model's ability to perform tasks it wasn't explicitly trained on, using only instructions or descriptions.

large-language-models few-shot-learning transfer-learning

Machine Learning Fundamentals

Bayesian Inference

Using Bayes' theorem to update beliefs about parameters given data, incorporating uncertainty.

machine-learning-fundamentals map prior

Bias-Variance Tradeoff

The balance between a model's bias (systematic error) and variance (sensitivity to training data fluctuations).

machine-learning-fundamentals overfitting underfitting

Classification

A supervised learning task where the model predicts discrete class labels (categories) for input data.

machine-learning-fundamentals supervised-learning binary-classification

Clustering

An unsupervised learning technique that groups similar data points together based on their features or characteristics.

machine-learning-fundamentals k-means hierarchical-clustering

Concentration of Measure

A phenomenon where random variables in high-dimensional spaces concentrate around their mean or median, with most probability mass within a narrow band.

machine-learning-fundamentals curse-of-dimensionality high-dimensions

Conjugate Prior

A prior that when combined with a likelihood results in a posterior of the same family, simplifying Bayesian inference.

machine-learning-fundamentals bayesian-inference prior

Covariance

A measure of how two variables change together, indicating the direction of their linear relationship.

machine-learning-fundamentals correlation variance

Cross-Validation

A technique for assessing model performance by partitioning data into subsets, training on some and validating on others.

machine-learning-fundamentals k-fold validation

Curse of Dimensionality

Phenomena where algorithms become inefficient as dimensionality increases, including data sparsity and distance concentration.

machine-learning-fundamentals high-dimensions dimensionality-reduction

Decision Tree

A tree-structured model that makes decisions by splitting data based on feature values, interpretable but prone to overfitting.

machine-learning-fundamentals random-forest classification

Dimensionality Reduction

Techniques to reduce the number of input features while preserving important information (PCA, t-SNE, autoencoders).

machine-learning-fundamentals pca t-sne

Empirical Risk Minimization

The principle of choosing a model that minimizes error on training data, fundamental to supervised learning.

machine-learning-fundamentals loss-function training

Ensemble Learning

Combining multiple models to produce better predictions than any individual model (bagging, boosting, stacking).

machine-learning-fundamentals random-forest boosting

Entropy

A measure of uncertainty or randomness in a random variable from information theory.

machine-learning-fundamentals information-theory cross-entropy

Evidence Lower Bound

A lower bound on log likelihood used in variational inference and VAEs for tractable optimization.

machine-learning-fundamentals variational-inference vae

Expectation-Maximization

An iterative algorithm for finding maximum likelihood estimates in models with latent variables.

machine-learning-fundamentals latent-variable mle

Feature Engineering

The process of selecting, transforming, and creating input features to improve model performance.

machine-learning-fundamentals feature-selection feature-extraction

Feature Selection

Choosing the most relevant features from available data to reduce dimensionality and improve model performance.

machine-learning-fundamentals feature-engineering dimensionality-reduction

Gaussian Process

A non-parametric Bayesian approach for regression and classification, defining distributions over functions.

machine-learning-fundamentals bayesian-inference kernel-method

Gibbs Sampling

An MCMC method that samples from conditional distributions to approximate joint distributions.

machine-learning-fundamentals mcmc sampling

Gradient Boosting

An ensemble technique that builds models sequentially, each correcting errors of previous ones (XGBoost, LightGBM, CatBoost).

machine-learning-fundamentals ensemble-learning boosting

Hidden Markov Model

A statistical model with hidden states that transition probabilistically, generating observable outputs.

machine-learning-fundamentals latent-variable sequence-model

Inductive Bias

Assumptions built into a learning algorithm that guide it toward certain solutions over others.

machine-learning-fundamentals learning-algorithm prior-knowledge

Information Bottleneck

A principle for learning representations that compress input while retaining information relevant to prediction.

machine-learning-fundamentals representation-learning compression

Jensen-Shannon Divergence

A symmetric measure of similarity between probability distributions, related to KL divergence.

machine-learning-fundamentals kl-divergence distance-metric

KL Divergence

Kullback-Leibler divergence - a measure of how one probability distribution differs from another.

machine-learning-fundamentals information-theory entropy

Latent Variable

Hidden or unobserved variables in a model that influence observed data but aren't directly measured.

machine-learning-fundamentals hidden-state em-algorithm

Manifold Hypothesis

The assumption that high-dimensional data lies on or near a lower-dimensional manifold, justifying dimensionality reduction.

machine-learning-fundamentals dimensionality-reduction representation-learning

Markov Chain Monte Carlo

Sampling methods for approximating distributions, especially for Bayesian inference in complex models.

machine-learning-fundamentals bayesian-inference sampling

Maximum A Posteriori

Parameter estimation that incorporates prior beliefs, maximizing posterior probability rather than just likelihood.

machine-learning-fundamentals bayesian-inference mle

Maximum Likelihood Estimation

Finding model parameters that maximize the probability of observing the training data.

machine-learning-fundamentals parameter-estimation probability

Mutual Information

A measure of dependence between variables, quantifying how much knowing one reduces uncertainty about the other.

machine-learning-fundamentals information-theory dependence

Occam's Razor

The principle that simpler models should be preferred when they perform equally well, reducing overfitting.

machine-learning-fundamentals model-selection simplicity

Overfitting

When a model learns training data too well, including noise and outliers, causing poor generalization to new data.

machine-learning-fundamentals generalization regularization

PAC Learning

Probably Approximately Correct - a theoretical framework for analyzing learning algorithm guarantees.

machine-learning-fundamentals learning-theory generalization

Posterior Distribution

In Bayesian methods, the updated belief about parameters after observing data.

machine-learning-fundamentals bayesian-inference prior

Principal Component Analysis

A dimensionality reduction technique that transforms data into orthogonal components ordered by variance explained.

machine-learning-fundamentals dimensionality-reduction eigenvalues

Prior Distribution

In Bayesian methods, the initial belief about parameters before observing data.

machine-learning-fundamentals bayesian-inference posterior

Rademacher Complexity

A measure of how well a model class can fit random noise, indicating capacity and generalization ability.

machine-learning-fundamentals learning-theory generalization

Random Forest

An ensemble of decision trees trained on random subsets of data and features, reducing overfitting through averaging.

machine-learning-fundamentals decision-tree ensemble-learning

Regression

A supervised learning task where the model predicts continuous numerical values rather than discrete categories.

machine-learning-fundamentals supervised-learning linear-regression

Reinforcement Learning

Learning through interaction with an environment, receiving rewards or penalties to learn optimal behavior policies.

machine-learning-fundamentals agent environment

Supervised Learning

A machine learning paradigm where models learn from labeled training data to make predictions on new, unseen data.

machine-learning-fundamentals classification regression

Support Vector Machine

A supervised learning algorithm that finds the optimal hyperplane to separate classes with maximum margin.

machine-learning-fundamentals kernel-trick classification

Train-Test Split

Dividing a dataset into separate portions for training the model and evaluating its performance on unseen data.

machine-learning-fundamentals validation-set test-set

Underfitting

When a model is too simple to capture the underlying pattern in data, performing poorly on both training and test sets.

machine-learning-fundamentals model-complexity bias-variance-tradeoff

Universal Approximation Theorem

The theorem stating neural networks with one hidden layer can approximate any continuous function.

machine-learning-fundamentals neural-network theory

Unsupervised Learning

Learning from unlabeled data to discover hidden patterns, structures, or relationships without explicit target outputs.

machine-learning-fundamentals clustering dimensionality-reduction

Variational Inference

Approximating complex distributions by optimizing over a simpler family, an alternative to MCMC.

machine-learning-fundamentals bayesian-inference optimization

VC Dimension

A measure of model capacity - the largest set of points a model can shatter (classify in all possible ways).

machine-learning-fundamentals model-capacity learning-theory

Wasserstein Distance

Earth Mover's Distance - a metric measuring the minimum cost to transform one distribution into another.

machine-learning-fundamentals distance-metric wgan

Model Architectures

Attention Mechanism

A technique that allows neural networks to focus on relevant parts of the input when producing each output, assigning different weights to different input elements.

architecture mechanism nlp

BART

Bidirectional and Auto-Regressive Transformer - combines BERT-like encoder with GPT-like decoder for sequence-to-sequence tasks.

model-architectures encoder-decoder transformer

CLIP

Contrastive Language-Image Pre-training - a model jointly trained on images and text, enabling zero-shot image classification.

model-architectures vision-language zero-shot

Diffusion Model

A generative model that learns to denoise data, achieving state-of-the-art image generation (Stable Diffusion, DALL-E 2).

model-architectures image-generation generative-model

EfficientNet

A family of CNNs that scale depth, width, and resolution simultaneously using compound scaling for optimal efficiency.

model-architectures cnn computer-vision

Inception

A CNN architecture (GoogLeNet) using parallel convolutions of different sizes to capture multi-scale features efficiently.

model-architectures cnn multi-scale

ResNet

Residual Network - a CNN architecture using skip connections to enable training of very deep networks (up to 1000+ layers).

model-architectures residual-connection cnn

Stable Diffusion

A latent diffusion model for text-to-image generation that operates in compressed latent space for efficiency.

model-architectures diffusion-model text-to-image

T5

Text-to-Text Transfer Transformer - frames all NLP tasks as text-to-text problems using a unified encoder-decoder architecture.

model-architectures transformer encoder-decoder

Transformer

A neural network architecture introduced in 'Attention is All You Need' (2017) that relies entirely on self-attention mechanisms, becoming the foundation for modern LLMs.

architecture llm attention

U-Net

A CNN architecture with encoder-decoder structure and skip connections, widely used for image segmentation tasks.

model-architectures semantic-segmentation cnn

VGG

A CNN architecture known for its simplicity, using small 3x3 convolutions stacked deeply, influential in computer vision.

model-architectures cnn computer-vision

Vision Transformer

Applying the transformer architecture to computer vision by treating image patches as tokens, achieving state-of-the-art results.

model-architectures transformer computer-vision

Model Evaluation

Accuracy

The proportion of correct predictions out of total predictions, a basic classification metric.

model-evaluation precision recall

AUC

Area Under the Curve - measures the area under the ROC curve, indicating classification model performance (1.0 is perfect).

model-evaluation roc-curve classification

Baseline Model

A simple reference model (random, majority class, simple heuristic) used to benchmark more complex models against.

model-evaluation evaluation benchmark

Benchmark

A standardized dataset and task used to compare model performance across different approaches (ImageNet, GLUE, SuperGLUE).

model-evaluation evaluation leaderboard

Confusion Matrix

A table showing true positives, true negatives, false positives, and false negatives for classification evaluation.

model-evaluation precision recall

F1 Score

The harmonic mean of precision and recall, providing a single metric that balances both concerns.

model-evaluation precision recall

Mean Absolute Error

The average absolute difference between predictions and actual values, a regression metric less sensitive to outliers than MSE.

model-evaluation mse rmse

Precision

The proportion of true positives among all positive predictions - measures how many predicted positives are actually positive.

model-evaluation recall f1-score

R-squared

Coefficient of determination - measures the proportion of variance in the target variable explained by the model.

model-evaluation regression evaluation

Recall

The proportion of true positives among all actual positives - measures how many actual positives were correctly identified.

model-evaluation precision f1-score

ROC Curve

Receiver Operating Characteristic curve - plots true positive rate vs false positive rate at various classification thresholds.

model-evaluation auc classification

Test Set

A final portion of data unseen during training and validation, used for unbiased evaluation of model performance.

model-evaluation train-test-split validation-set

Validation Set

A portion of data held out from training, used to tune hyperparameters and monitor overfitting.

model-evaluation train-test-split cross-validation

Model Evaluation & Metrics

Average Precision

The weighted mean of precisions at each threshold, where the weight is the increase in recall from the previous threshold.

model-evaluation-metrics precision-recall-curve map

Cohen's Kappa

A metric measuring agreement between raters/models accounting for chance agreement.

model-evaluation-metrics evaluation agreement

Contrastive Loss

A loss function for learning similarity metrics, bringing similar pairs together and separating dissimilar ones.

model-evaluation-metrics loss-function metric-learning

CTC Loss

Connectionist Temporal Classification loss for sequence tasks without alignment, used in speech recognition.

model-evaluation-metrics loss-function sequence-modeling

Dice Coefficient

A metric measuring overlap between predicted and ground truth segmentations, common in medical imaging.

model-evaluation-metrics segmentation iou

False Negative

Positive cases that are incorrectly predicted as negative (Type II error) in classification.

model-evaluation-metrics confusion-matrix recall

False Positive

Incorrectly predicted positive cases (Type I error) in classification.

model-evaluation-metrics confusion-matrix precision

False Positive Rate

The proportion of negatives incorrectly classified as positive.

model-evaluation-metrics specificity roc-curve

Focal Loss

A modified cross-entropy loss that down-weights easy examples, helping with class imbalance.

model-evaluation-metrics loss-function imbalanced-data

Hinge Loss

A loss function for maximum-margin classification, used in SVMs.

model-evaluation-metrics loss-function svm

Huber Loss

A loss function that's quadratic for small errors and linear for large errors, robust to outliers.

model-evaluation-metrics loss-function regression

Intersection over Union

A metric for object detection measuring overlap between predicted and ground truth bounding boxes.

model-evaluation-metrics object-detection bounding-box

Jaccard Index

Measuring similarity between sets, equivalent to IoU for binary segmentation.

model-evaluation-metrics iou segmentation

Log Loss

Logarithmic loss measuring the accuracy of probabilistic predictions, penalizing confident wrong predictions.

model-evaluation-metrics cross-entropy-loss probability

Matthews Correlation Coefficient

A balanced measure for binary classification considering all confusion matrix values, robust to imbalance.

model-evaluation-metrics confusion-matrix evaluation

Precision-Recall Curve

A curve showing the tradeoff between precision and recall at different thresholds.

model-evaluation-metrics precision recall

Sensitivity

Same as recall - the proportion of actual positives correctly identified.

model-evaluation-metrics recall true-positive-rate

Specificity

The proportion of actual negatives correctly identified.

model-evaluation-metrics true-negative-rate evaluation

Triplet Loss

A loss for learning embeddings by pulling similar examples together and pushing dissimilar ones apart.

model-evaluation-metrics loss-function metric-learning

True Negative

Correctly predicted negative cases in classification.

model-evaluation-metrics confusion-matrix specificity

True Positive

Correctly predicted positive cases in classification.

model-evaluation-metrics confusion-matrix precision

Natural Language Processing

Anaphora Resolution

Determining what a pronoun or noun phrase refers back to in text.

natural-language-processing coreference-resolution nlp

Cloze Task

A task where words are removed from text and must be predicted, used for evaluation and pre-training.

natural-language-processing nlp masked-language-modeling

Constituency Parsing

Analyzing sentence structure into nested constituents (noun phrases, verb phrases, etc.).

natural-language-processing nlp syntax

Coreference Resolution

Identifying all expressions in text that refer to the same entity (e.g., linking pronouns to nouns).

natural-language-processing nlp named-entity-recognition

Dependency Parsing

Analyzing grammatical structure by identifying relationships between words (subject, object, modifier).

natural-language-processing nlp syntax

Dialogue State Tracking

Maintaining a representation of conversation state in dialogue systems.

natural-language-processing nlp conversational-ai

Entity Linking

Linking entity mentions in text to entries in a knowledge base.

natural-language-processing named-entity-recognition knowledge-graph

Information Extraction

Automatically extracting structured information from unstructured text.

natural-language-processing nlp named-entity-recognition

Intent Recognition

Identifying the user's intention or goal from their utterance in dialogue systems.

natural-language-processing nlp conversational-ai

Language Modeling

Learning probability distributions over sequences of words to predict what comes next.

natural-language-processing llm nlp

Language Understanding

The ability to comprehend meaning, context, intent, and nuance in natural language.

natural-language-processing nlu nlp

Latent Dirichlet Allocation

A generative probabilistic model for topic modeling that represents documents as mixtures of topics.

natural-language-processing topic-modeling nlp

Lemmatization

Reducing words to their base or dictionary form (running → run) using linguistic knowledge.

natural-language-processing stemming nlp

Machine Translation

Automatically translating text from one language to another using neural models (typically encoder-decoder architectures).

natural-language-processing nlp encoder-decoder

N-gram

A contiguous sequence of n items (words, characters) from text, used in language modeling and feature extraction.

natural-language-processing language-modeling tokenization

Named Entity Recognition

Identifying and classifying named entities (people, organizations, locations) in text into predefined categories.

natural-language-processing nlp information-extraction

Natural Language Inference

Determining logical relationships (entailment, contradiction, neutral) between sentence pairs.

natural-language-processing nlp reasoning

Natural Language Processing

The field of AI focused on enabling computers to understand, interpret, and generate human language.

natural-language-processing llm tokenization

Paraphrase Detection

Determining if two text segments express the same meaning in different words.

natural-language-processing nlp semantic-similarity

Part-of-Speech Tagging

Labeling words in text with their grammatical roles (noun, verb, adjective, etc.).

natural-language-processing nlp syntax

Question Answering

Systems that automatically answer questions posed in natural language, often by reading and comprehending text passages.

natural-language-processing nlp reading-comprehension

Reading Comprehension

Answering questions about a text passage, testing understanding.

natural-language-processing question-answering nlp

Relation Extraction

Identifying semantic relationships between entities in text.

natural-language-processing information-extraction nlp

Semantic Role Labeling

Identifying the semantic relationships between predicates and their arguments in sentences.

natural-language-processing nlp semantics

Semantic Similarity

Measuring how similar two pieces of text are in meaning, often using embedding-based distance metrics.

natural-language-processing embeddings cosine-similarity

Sentiment Analysis

Determining the emotional tone or opinion expressed in text (positive, negative, neutral).

natural-language-processing nlp opinion-mining

Slot Filling

Extracting specific pieces of information (slots) needed to fulfill a user's intent.

natural-language-processing nlp information-extraction

Stemming

Reducing words to their root form by removing affixes (prefixes, suffixes, infixes), simpler than lemmatization but less linguistically accurate.

natural-language-processing lemmatization nlp

Stop Words

Common words (the, is, at) often removed in NLP preprocessing as they carry little semantic meaning.

natural-language-processing nlp preprocessing

Text Classification

Assigning categories or labels to text documents, a fundamental NLP task.

natural-language-processing nlp classification

Text Generation

Automatically creating coherent text using language models, from simple completion to creative writing.

natural-language-processing llm gpt

Text Summarization

Generating concise summaries of longer texts, either extractive (selecting sentences) or abstractive (generating new text).

natural-language-processing nlp rouge

Textual Entailment

Determining if one text fragment logically follows from another.

natural-language-processing nli nlp

TF-IDF

Term Frequency-Inverse Document Frequency - a statistical measure of word importance in documents, used for information retrieval.

natural-language-processing information-retrieval feature-extraction

Topic Modeling

Discovering abstract topics in document collections, often using techniques like LDA.

natural-language-processing nlp unsupervised-learning

Word Sense Disambiguation

Determining which meaning of a word is used in a particular context.

natural-language-processing nlp semantics

Word2Vec

A technique for learning word embeddings that capture semantic relationships (Skip-gram and CBOW models).

natural-language-processing embedding word-embeddings

Neural Networks & Deep Learning

Activation Function

A non-linear function applied to neuron outputs that introduces non-linearity, enabling networks to learn complex patterns.

neural-networks-deep-learning relu sigmoid

Attention Is All You Need

The seminal 2017 paper by Vaswani et al. introducing the Transformer architecture that revolutionized NLP.

neural-networks-deep-learning transformer self-attention

Autoencoder

An unsupervised neural network that learns to compress data into a latent representation and reconstruct it, useful for dimensionality reduction.

neural-networks-deep-learning encoder decoder

Batch Normalization

A technique that normalizes layer inputs to stabilize and accelerate training by reducing internal covariate shift.

neural-networks-deep-learning layer-normalization training-stability

Boltzmann Machine

A stochastic recurrent neural network that can learn probability distributions over binary data.

neural-networks-deep-learning rbm energy-based-model

Capsule Network

An architecture using capsules (groups of neurons) that preserve spatial relationships, addressing limitations of CNNs.

neural-networks-deep-learning cnn dynamic-routing

Convolution

A mathematical operation that applies filters/kernels to input data to extract features like edges, textures, and patterns.

neural-networks-deep-learning cnn filter

Convolutional Neural Network

A deep learning architecture designed for processing grid-like data (images) using convolutional layers that learn spatial hierarchies.

neural-networks-deep-learning computer-vision convolution

Deep Belief Network

A generative model composed of multiple layers of RBMs, historically important for unsupervised pre-training.

neural-networks-deep-learning rbm generative-model

Deep Learning

A subset of machine learning that uses neural networks with multiple layers (deep neural networks) to learn hierarchical representations of data.

fundamentals machine-learning architecture

Depthwise Separable Convolution

An efficient convolution that factorizes standard convolution into depthwise and pointwise steps, reducing parameters.

neural-networks-deep-learning convolution mobilenet

Dilated Convolution

Convolution with gaps between kernel elements, expanding the receptive field without increasing parameters.

neural-networks-deep-learning convolution receptive-field

Discriminator

In GANs, the network that tries to distinguish between real and generated data, providing training signal to the generator.

neural-networks-deep-learning gan generator

Dropout

A regularization technique that randomly deactivates neurons during training to prevent overfitting and improve generalization.

neural-networks-deep-learning regularization overfitting

Echo State Network

A recurrent network with a fixed random reservoir and trained readout layer, efficient for time series processing.

neural-networks-deep-learning reservoir-computing liquid-state-machine

ELU

Exponential Linear Unit - an activation function that allows negative values, helping with vanishing gradients.

neural-networks-deep-learning activation-function relu

Feature Map

The output of applying a convolutional filter to an input, representing detected features at various spatial locations.

neural-networks-deep-learning cnn convolution

Feedforward Network

A neural network where information flows in one direction from input to output without cycles.

neural-networks-deep-learning neural-network backpropagation

Gated Recurrent Unit

A simplified variant of LSTM with fewer parameters, using an update gate and a reset gate to control information flow.

neural-networks-deep-learning lstm rnn

GELU

Gaussian Error Linear Unit - a smooth activation function combining properties of dropout and ReLU, used in BERT and GPT.

neural-networks-deep-learning activation-function relu

Generative Adversarial Network

A framework where two networks (generator and discriminator) compete, with the generator learning to create realistic data.

neural-networks-deep-learning generator discriminator

Generator

In GANs, the network that creates synthetic data attempting to fool the discriminator into thinking it's real.

neural-networks-deep-learning gan discriminator

Group Normalization

Normalizing groups of channels independently, more stable than batch normalization for small batch sizes.

neural-networks-deep-learning batch-normalization layer-normalization

He Initialization

Weight initialization designed for ReLU activations, preventing vanishing/exploding gradients in deep networks.

neural-networks-deep-learning weight-initialization xavier-initialization

Hopfield Network

A recurrent network that serves as content-addressable memory, capable of pattern completion and associative memory.

neural-networks-deep-learning recurrent-network energy-function

Leaky ReLU

A variant of ReLU allowing small negative values (f(x) = x if x > 0, else αx where α ≈ 0.01), preventing dead neurons.

neural-networks-deep-learning relu activation-function

Liquid State Machine

A reservoir computing model where a recurrent network acts as a dynamic reservoir for temporal pattern recognition.

neural-networks-deep-learning reservoir-computing echo-state-network

Long Short-Term Memory

A type of RNN architecture with gates that can learn long-term dependencies, solving the vanishing gradient problem.

neural-networks-deep-learning rnn gru

Maxout

An activation function that outputs the maximum of multiple linear functions, providing universal approximation.

neural-networks-deep-learning activation-function relu

Mish

A smooth, non-monotonic activation function (x * tanh(softplus(x))) providing better gradients than ReLU.

neural-networks-deep-learning activation-function relu

Multi-Layer Perceptron

A feedforward neural network with multiple layers of perceptrons, capable of learning non-linear functions.

neural-networks-deep-learning perceptron feedforward-network

Neural Network

A computational model inspired by biological neural networks, consisting of interconnected nodes (neurons) organized in layers that process information through weighted connections.

fundamentals architecture deep-learning

Neuromorphic Computing

Hardware and algorithms designed to mimic the brain's structure and function, enabling efficient spike-based computation.

neural-networks-deep-learning spiking-neural-network brain-inspired

Perceptron

The simplest neural network unit, a single-layer binary classifier that inspired modern deep learning.

neural-networks-deep-learning neural-network linear-classifier

Pooling

A down-sampling operation in CNNs that reduces spatial dimensions while retaining important features (max pooling, average pooling).

neural-networks-deep-learning cnn max-pooling

PReLU

Parametric ReLU - like Leaky ReLU but with learnable negative slope parameter.

neural-networks-deep-learning leaky-relu relu

Radial Basis Function Network

A neural network using radial basis functions as activation functions, useful for function approximation and interpolation.

neural-networks-deep-learning neural-network activation-function

Recurrent Neural Network

A neural network architecture with loops that allow information to persist, designed for sequential data like text and time series.

neural-networks-deep-learning lstm gru

ReLU

Rectified Linear Unit - an activation function that outputs the input if positive, zero otherwise. f(x) = max(0, x).

neural-networks-deep-learning activation-function leaky-relu

Residual Connection

Skip connections that allow gradients to flow directly through a network, enabling training of very deep networks (ResNet).

neural-networks-deep-learning resnet skip-connection

Restricted Boltzmann Machine

A simpler variant of Boltzmann machines with no intra-layer connections, used for unsupervised learning and dimensionality reduction.

neural-networks-deep-learning boltzmann-machine deep-belief-network

Sigmoid

An activation function that squashes values to range (0,1), often used for binary classification and gates in LSTMs.

neural-networks-deep-learning activation-function logistic-function

Softmax

An activation function that converts a vector of values into a probability distribution, commonly used for multi-class classification.

neural-networks-deep-learning classification probability-distribution

Spectral Normalization

Constraining the spectral norm of weight matrices to stabilize GAN training.

neural-networks-deep-learning gan normalization

Spiking Neural Network

Networks inspired by biological neurons that communicate through discrete spikes, incorporating temporal dynamics.

neural-networks-deep-learning neuromorphic-computing temporal-coding

Squeeze-and-Excitation

A channel attention mechanism that adaptively recalibrates channel-wise feature responses, improving CNNs.

neural-networks-deep-learning attention cnn

Swish

A smooth activation function (x * sigmoid(x)) that often outperforms ReLU, discovered through neural architecture search.

neural-networks-deep-learning activation-function relu

Transposed Convolution

An operation that upsamples feature maps, often used in decoders and generative models (also called deconvolution).

neural-networks-deep-learning convolution upsampling

Variational Autoencoder

A generative model that learns a probabilistic latent space, allowing sampling of new data points similar to training data.

neural-networks-deep-learning autoencoder generative-model

Weight Normalization

Reparameterizing weight vectors to improve optimization by decoupling magnitude from direction.

neural-networks-deep-learning normalization optimization

Xavier Initialization

A weight initialization strategy maintaining variance across layers, improving training of deep networks.

neural-networks-deep-learning weight-initialization he-initialization

Practical Applications

Practical Deployment

Blue-Green Deployment

Running two identical environments to enable zero-downtime model updates and rollbacks.

practical-deployment deployment mlops

Calibration

Ensuring predicted probabilities accurately reflect true likelihood of outcomes.

practical-deployment probability confidence

Canary Deployment

Gradually rolling out new model versions to subset of traffic before full deployment.

practical-deployment deployment ab-testing

CI/CD for ML

Continuous integration and deployment practices adapted for machine learning pipelines.

practical-deployment mlops automation

Data Versioning

Tracking different versions of datasets to ensure reproducibility and manage changes.

practical-deployment mlops reproducibility

Experiment Tracking

Recording hyperparameters, metrics, and artifacts from training runs for comparison and reproducibility.

practical-deployment mlops training

Feature Store

A centralized platform for managing, storing, and serving features for ML models.

practical-deployment mlops feature-engineering

gRPC

A high-performance RPC framework often used for low-latency model serving.

practical-deployment api model-serving

Model Caching

Storing frequently requested predictions to reduce latency and computation.

practical-deployment inference performance

Model Endpoint

A deployed service exposing a model's predictions via API requests.

practical-deployment model-serving api

Model Lineage

Tracking the origin and dependencies of models including data, code, and parameters.

practical-deployment mlops reproducibility

Model Performance Degradation

Decline in model quality over time due to distribution shift or changing patterns.

practical-deployment model-drift monitoring

Model Registry

A centralized repository for tracking, versioning, and managing trained models.

practical-deployment mlops versioning

Model Reproducibility

The ability to recreate exact model results given the same code, data, and environment.

practical-deployment mlops versioning

Model Retraining

Periodically updating models with new data to maintain performance as distributions change.

practical-deployment mlops model-drift

Model Versioning

Tracking different versions of models to enable reproducibility and rollback.

practical-deployment mlops version-control

Prediction Confidence

A measure of model certainty in its predictions, important for reliability and user trust.

practical-deployment uncertainty-quantification inference

Request Batching

Combining multiple inference requests into batches to improve throughput.

practical-deployment inference batch-processing

REST API

A web service interface commonly used for serving model predictions over HTTP.

practical-deployment api model-serving

Shadow Deployment

Running a new model alongside production without affecting user experience, for validation.

practical-deployment deployment testing

Reinforcement Learning

Actor-Critic

RL architecture with two components: an actor (policy) that selects actions and a critic (value function) that evaluates them.

reinforcement-learning policy-gradient reinforcement-learning

Agent

In RL, the learner or decision-maker that takes actions in an environment to maximize cumulative reward.

reinforcement-learning environment policy

Deep Q-Network

Combining Q-learning with deep neural networks to handle high-dimensional state spaces, enabling RL for complex tasks like Atari games.

reinforcement-learning q-learning reinforcement-learning

Environment

In RL, the world the agent interacts with, providing states, accepting actions, and returning rewards.

reinforcement-learning agent state

Exploration vs Exploitation

The RL dilemma of trying new actions (exploration) versus using known good actions (exploitation) to maximize reward.

reinforcement-learning reinforcement-learning epsilon-greedy

Markov Decision Process

A mathematical framework for modeling sequential decision-making with states, actions, rewards, and transition probabilities.

reinforcement-learning reinforcement-learning state

Policy

A strategy or mapping from states to actions that defines the agent's behavior in reinforcement learning.

reinforcement-learning agent action

Policy Gradient

RL methods that directly optimize the policy by computing gradients of expected reward with respect to policy parameters.

reinforcement-learning policy reinforce

PPO

Proximal Policy Optimization - a stable and efficient policy gradient algorithm widely used in RLHF for training LLMs.

reinforcement-learning policy-gradient rlhf

Q-Learning

A model-free RL algorithm that learns action-value functions (Q-values) to determine optimal actions in each state.

reinforcement-learning reinforcement-learning value-function

Reward

A scalar feedback signal indicating how good an action was, used to train reinforcement learning agents.

reinforcement-learning reinforcement-learning reward-function

Value Function

A function estimating expected cumulative reward from a state (state-value) or state-action pair (action-value/Q-value).

reinforcement-learning q-learning reinforcement-learning

Specialized AI Topics

Adversarial Perturbation

Small carefully crafted changes to input that fool models while imperceptible to humans.

specialized-ai-topics adversarial-example attack

Adversarial Training

Training on adversarial examples to improve model robustness against attacks.

specialized-ai-topics adversarial-attack robustness

AI Governance

Policies, frameworks, and practices for responsible development and deployment of AI systems.

specialized-ai-topics ethics policy

Algorithmic Accountability

Ensuring AI systems can be held accountable for their decisions and impacts.

specialized-ai-topics ethics governance

Attention Visualization

Visualizing attention weights to understand which inputs the model focuses on.

specialized-ai-topics attention interpretability

Backdoor Attack

Maliciously training models to behave normally except when specific triggers are present.

specialized-ai-topics security adversarial-attack

Certified Robustness

Provable guarantees that a model's prediction won't change within a specified input perturbation.

specialized-ai-topics robustness verification

Data Poisoning

Corrupting training data to manipulate model behavior or introduce vulnerabilities.

specialized-ai-topics security backdoor-attack

Explainable AI

Methods and techniques for making AI decision-making transparent and interpretable to humans.

specialized-ai-topics interpretability explainability

Grad-CAM

Gradient-weighted Class Activation Mapping - visualizing which image regions influenced CNN predictions.

specialized-ai-topics explainability cnn

Homomorphic Encryption

Encryption allowing computation on encrypted data, enabling private model inference.

specialized-ai-topics privacy encryption

LIME

Local Interpretable Model-agnostic Explanations - explaining individual predictions by approximating with simpler models.

specialized-ai-topics explainability interpretability

Membership Inference

Determining if a specific example was in the training dataset, a privacy concern.

specialized-ai-topics privacy attack

Model Extraction

Stealing a model's functionality by querying it and training a copy.

specialized-ai-topics security attack

Model Inversion

Attacks that reconstruct training data or private information from model parameters or outputs.

specialized-ai-topics privacy security

Model Watermarking

Embedding identifiable signatures in models to prove ownership and detect theft.

specialized-ai-topics security intellectual-property

Saliency Map

A visualization highlighting input regions most important for model predictions.

specialized-ai-topics visualization interpretability

Secure Multi-Party Computation

Protocols allowing parties to jointly compute functions while keeping inputs private.

specialized-ai-topics privacy security

SHAP

SHapley Additive exPlanations - a unified approach to explaining model predictions using game theory.

specialized-ai-topics explainability feature-importance

Trusted Execution Environment

Secure hardware areas for protected computation, used for private AI inference.

specialized-ai-topics security privacy

Specialized Domains

Anomaly Detection

Identifying unusual patterns or outliers in data that don't conform to expected behavior, used for fraud detection and monitoring.

specialized-domains outlier unsupervised-learning

Audio Processing

Techniques for analyzing, transforming, and understanding audio signals for tasks like speech recognition and music generation.

specialized-domains speech-recognition spectrogram

Collaborative Filtering

Recommendation technique using patterns from multiple users to predict preferences, assuming similar users like similar items.

specialized-domains recommender-system matrix-factorization

Graph Neural Network

Neural networks designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.

specialized-domains graph-convolution node-embedding

Knowledge Graph

A structured representation of knowledge as entities and their relationships, used for reasoning and information retrieval.

specialized-domains graph entity

Optical Character Recognition

Converting images of text (scanned documents, photos) into machine-readable text using computer vision.

specialized-domains computer-vision text-detection

Recommender System

AI systems that suggest items (products, content) to users based on preferences, behavior, and similarity.

specialized-domains collaborative-filtering content-based-filtering

Retrieval-Augmented Generation

Augmenting LLM generation with retrieved relevant documents, improving factuality and enabling knowledge updates without retraining.

specialized-domains llm information-retrieval

Speech Recognition

Converting spoken language into text using acoustic models and language models, now dominated by deep learning.

specialized-domains asr audio-processing

Text-to-Speech

Synthesizing natural-sounding speech from text, using neural vocoders and attention-based models.

specialized-domains tts speech-synthesis

Time Series Forecasting

Predicting future values based on historical sequential data, using models like ARIMA, LSTMs, or Transformers.

specialized-domains rnn lstm

Vector Database

A database optimized for storing and searching high-dimensional vectors (embeddings), enabling semantic search and RAG.

specialized-domains embedding semantic-search

Technical Terms

Backward Pass

The process of computing gradients by propagating error signals backward through the network during training.

technical-terms backpropagation gradient

Bias

A learnable offset added to neuron inputs, allowing the model to fit data that doesn't pass through the origin.

technical-terms parameter neural-network

Bottleneck

A layer or section with reduced dimensions that compresses information, used in autoencoders and efficient architectures.

technical-terms autoencoder architecture

Filter

Synonym for kernel - the learnable weight matrix applied in convolutions to extract features.

technical-terms kernel convolution

FLOPS

Floating Point Operations Per Second - a measure of computational performance, used to quantify training and inference costs.

technical-terms compute performance

Forward Pass

The process of passing input through the network to generate predictions during training or inference.

technical-terms inference backpropagation

Hidden Layer

Intermediate layers between input and output that learn hierarchical representations in neural networks.

technical-terms layer neural-network

Inference Latency

The time delay between submitting input and receiving output from a deployed model, critical for real-time applications.

technical-terms inference performance

Kernel

A small matrix of weights used in convolutional layers to detect specific features or patterns in input data.

technical-terms convolution filter

Layer

A collection of neurons/operations that process data together, neural networks are composed of stacked layers.

technical-terms neural-network hidden-layer

Padding

Adding borders of zeros (or other values) around input to control output spatial dimensions in convolutions.

technical-terms convolution cnn

Parameter

Learnable values (weights and biases) in a neural network that are adjusted during training to minimize loss.

technical-terms weight training

Receptive Field

The region of input that influences a particular neuron's output, growing larger in deeper layers of CNNs.

technical-terms cnn convolution

Skip Connection

Direct connections bypassing one or more layers, helping gradient flow and enabling deeper networks.

technical-terms residual-connection resnet

Softmax Temperature

A parameter controlling the smoothness of probability distributions in softmax - lower makes peaks sharper, higher makes it more uniform.

technical-terms softmax temperature

Stride

The step size by which a convolutional filter or pooling window moves across the input.

technical-terms convolution cnn

Throughput

The number of predictions or tokens a model can process per unit of time, a key deployment performance metric.

technical-terms inference latency

Weight

Learnable parameters connecting neurons in neural networks, determining the strength of connections.

technical-terms parameter neural-network

Training & Optimization

AdaGrad

An optimizer that adapts learning rates for each parameter based on historical gradients, useful for sparse data.

training-optimization optimizer adaptive-learning-rate

Adam Optimizer

An adaptive learning rate optimization algorithm combining momentum and RMSprop, widely used for training neural networks.

training-optimization optimizer learning-rate

AdamW

Adam with decoupled weight decay, providing better regularization and often superior performance.

training-optimization adam-optimizer weight-decay

Automatic Mixed Precision

Automatically managing precision during training to maximize speed while maintaining stability.

training-optimization mixed-precision-training fp16

Backpropagation

The algorithm for computing gradients of the loss with respect to network weights, enabling training through gradient descent.

training-optimization gradient-descent chain-rule

Batch Gradient Descent

Computing gradients using the entire dataset, providing stable but slow updates.

training-optimization gradient-descent mini-batch

Batch Size

The number of training examples processed together in one forward/backward pass.

training-optimization mini-batch training

Cosine Annealing

A learning rate schedule following a cosine curve, smoothly decreasing the rate over training.

training-optimization learning-rate-schedule training

Cross-Entropy Loss

A loss function for classification that measures the difference between predicted and true probability distributions.

training-optimization loss-function classification

CutMix

Data augmentation combining image patches and labels from two examples, improving robustness.

training-optimization data-augmentation mixup

Cutout

Data augmentation randomly masking out square regions of images during training.

training-optimization data-augmentation computer-vision

Cyclical Learning Rate

Varying learning rate between bounds in cycles, potentially escaping local minima.

training-optimization learning-rate-schedule training

Data Augmentation

Creating variations of training data through transformations (rotation, cropping, noise) to improve model generalization.

training-optimization training generalization

Data Parallelism

Replicating the model across devices, each processing different data batches.

training-optimization distributed-training parallel-training

Distillation Temperature

A hyperparameter in knowledge distillation controlling how soft the teacher's outputs are.

training-optimization knowledge-distillation temperature

Early Stopping

Stopping training when validation performance stops improving, preventing overfitting.

training-optimization regularization validation

Epoch

One complete pass through the entire training dataset during the training process.

training-optimization training batch

Exploding Gradient

A problem where gradients become extremely large during backpropagation, causing unstable training and NaN values.

training-optimization gradient-clipping training-stability

Fine-Tuning

The process of further training a pre-trained model on a specific dataset to adapt it for a particular task or domain.

training optimization technique

Flash Attention

An efficient attention algorithm reducing memory usage and increasing speed through clever recomputation.

training-optimization attention optimization

Gradient Accumulation

Summing gradients over multiple batches before updating, simulating larger effective batch sizes.

training-optimization training batch-size

Gradient Checkpointing

Trading computation for memory by recomputing activations during backprop instead of storing them.

training-optimization memory-efficiency training

Gradient Clipping

Limiting gradient magnitudes during training to prevent exploding gradients and stabilize training.

training-optimization exploding-gradient training-stability

Gradient Descent

An optimization algorithm that iteratively adjusts parameters in the direction of steepest descent of the loss function.

training-optimization sgd learning-rate

L1 Regularization

Adding the sum of absolute weights to the loss function, promoting sparsity and feature selection.

training-optimization regularization l2-regularization

L2 Regularization

Adding the sum of squared weights to the loss function, penalizing large weights and improving generalization.

training-optimization regularization l1-regularization

Label Smoothing

Softening target labels to prevent overconfidence and improve generalization.

training-optimization regularization training

LAMB Optimizer

Layer-wise Adaptive Moments optimizer for Batch training - enables very large batch training for transformers.

training-optimization optimizer large-batch

Learning Rate

A hyperparameter controlling the step size in gradient descent - too high causes instability, too low slows convergence.

training-optimization gradient-descent optimization

Learning Rate Decay

Gradually reducing the learning rate during training to fine-tune convergence.

training-optimization learning-rate-schedule training

Learning Rate Schedule

A strategy for adjusting the learning rate during training (decay, warm-up, cosine annealing) to improve convergence.

training-optimization learning-rate warmup

Lookahead Optimizer

A wrapper that maintains fast and slow weights, periodically updating slow weights, improving convergence.

training-optimization optimizer convergence

LoRA

Low-Rank Adaptation - a parameter-efficient fine-tuning method that updates only small low-rank matrices instead of full weights.

training-optimization fine-tuning peft

Loss Function

A function measuring the difference between model predictions and true values, guiding the training process.

training-optimization cross-entropy mse

Mean Squared Error

A loss function for regression that computes the average squared difference between predictions and targets.

training-optimization loss-function regression

Mini-Batch Gradient Descent

Computing gradients on small batches of data, balancing SGD's noise with full-batch GD's stability.

training-optimization sgd gradient-descent

Mixed Precision Training

Using lower precision (FP16) for some computations while keeping FP32 for stability, speeding up training.

training-optimization training optimization

Mixup

Data augmentation creating synthetic examples by interpolating between training examples and their labels.

training-optimization data-augmentation regularization

Momentum

An optimization technique that accelerates gradient descent by accumulating past gradients, helping escape local minima.

training-optimization sgd optimization

Nesterov Momentum

A momentum variant that looks ahead before computing gradients, often converging faster.

training-optimization momentum sgd

Pipeline Parallelism

Splitting model layers across devices and processing micro-batches in pipeline fashion.

training-optimization distributed-training model-parallelism

Pre-training

Training a model on a large dataset (often self-supervised) before fine-tuning on specific tasks, enabling transfer learning.

training-optimization fine-tuning transfer-learning

Regularization

Techniques to prevent overfitting by adding constraints or penalties to the model (L1, L2, dropout, early stopping).

training-optimization l1-regularization l2-regularization

RLHF

Reinforcement Learning from Human Feedback - training models using human preferences to align behavior with human values.

training-optimization reinforcement-learning alignment

RMSprop

An optimizer using moving average of squared gradients to adapt learning rates, addressing AdaGrad's diminishing rates.

training-optimization optimizer adagrad

Sharded Data Parallelism

Distributing model states across devices to train models larger than single-device memory.

training-optimization distributed-training data-parallelism

Step Decay

Reducing learning rate by a factor at specific epochs, a simple scheduling strategy.

training-optimization learning-rate-decay learning-rate-schedule

Stochastic Gradient Descent

A variant of gradient descent that updates parameters using gradients computed on a single random training example at a time (though often used to refer to mini-batch gradient descent).

training-optimization gradient-descent mini-batch

Student Model

The smaller model in knowledge distillation learning to mimic the teacher's behavior.

training-optimization knowledge-distillation teacher-model

Teacher Model

The larger, more accurate model in knowledge distillation that guides student training.

training-optimization knowledge-distillation student-model

Tensor Parallelism

Splitting individual layers/tensors across devices for very large models.

training-optimization model-parallelism distributed-training

Training

The process of teaching a machine learning model by adjusting its parameters based on data to minimize prediction errors.

fundamentals training machine-learning

Transfer Learning

Leveraging knowledge learned from one task/domain to improve performance on a related task with less data.

training-optimization pre-training fine-tuning

Vanishing Gradient

A problem where gradients become extremely small during backpropagation, preventing deep networks from learning effectively.

training-optimization backpropagation deep-networks

Warmup

Gradually increasing the learning rate at training start to stabilize optimization.

training-optimization learning-rate-schedule training

Weight Decay

A regularization technique that shrinks weights toward zero during optimization. Equivalent to L2 regularization in standard SGD, but differs when using adaptive optimizers like Adam.

training-optimization l2-regularization regularization

ZeRO

Zero Redundancy Optimizer - techniques for memory-efficient distributed training by partitioning optimizer states.

training-optimization distributed-training memory-efficiency