AI & ML Glossary
A comprehensive reference covering 522+ essential terms in Artificial Intelligence, Machine Learning, Deep Learning, Large Language Models, and related topics. Built as a knowledge resource for developers, researchers, and AI enthusiasts.
AI Ethics & Safety
Adversarial Attack
Intentionally crafted inputs designed to fool AI models into making incorrect predictions, exposing vulnerabilities.
Adversarial Example
An input with imperceptible perturbations that causes a model to make a wrong prediction, highlighting model fragility.
AI Alignment
Ensuring AI systems behave in accordance with human values and intentions, a central challenge in AI safety.
AI Safety
Research and practices aimed at ensuring AI systems are safe, reliable, and beneficial, especially as capabilities increase.
Bias in AI
Systematic errors or unfair outcomes in AI systems, often reflecting biases in training data or model design.
Differential Privacy
A mathematical framework for quantifying and limiting privacy loss when releasing information about datasets.
Explainability
The ability to explain how an AI model makes decisions in human-understandable terms, crucial for trust and accountability.
Fairness
Ensuring AI systems treat all individuals and groups equitably, without discrimination based on protected attributes.
Federated Learning
Training models across decentralized devices holding local data, without exchanging the data itself, preserving privacy.
Interpretability
Understanding the internal workings of AI models, including which features influence predictions and why.
Model Card
Documentation describing a model's characteristics, intended use, limitations, and ethical considerations for transparent deployment.
Privacy-Preserving ML
Techniques for training and deploying models while protecting individual privacy (federated learning, differential privacy).
Robustness
A model's ability to maintain performance under distribution shifts, adversarial attacks, or noisy inputs.
AI Infrastructure & Deployment
A/B Testing
Comparing two model versions in production by routing traffic to each and measuring performance differences.
Batch Processing
Processing multiple predictions together in batches rather than one at a time, improving throughput efficiency.
Data Drift
Changes in input data distribution over time that can degrade model performance in production.
Edge Deployment
Running models on edge devices (phones, IoT) rather than cloud servers for lower latency and privacy.
GPU
Graphics Processing Unit - hardware accelerator with thousands of cores, essential for parallel computation in deep learning.
Inference
Using a trained model to make predictions on new data, the deployment phase after training is complete.
Knowledge Distillation
Training a smaller 'student' model to mimic a larger 'teacher' model, transferring knowledge while reducing size.
MLOps
Practices for deploying, monitoring, and maintaining machine learning models in production, combining ML and DevOps principles.
Model Compression
Techniques to reduce model size and computational requirements (quantization, pruning, distillation) for efficient deployment.
Model Drift
Degradation of model performance over time due to changes in the relationship between features and target.
Model Monitoring
Tracking model performance, data distribution, and predictions in production to detect issues and degradation.
Model Serving
Deploying trained models as services that can handle prediction requests in production environments.
Pruning
Removing unnecessary weights or neurons from a trained model to reduce size and computation while maintaining performance.
Quantization
Reducing model precision (FP32 → INT8) to decrease size and increase inference speed with minimal accuracy loss.
TPU
Tensor Processing Unit - Google's custom hardware accelerator designed specifically for machine learning workloads.
Advanced Concepts
Active Learning
Iteratively selecting the most informative unlabeled examples for annotation to efficiently improve models with limited labels.
Attention Head
An individual attention mechanism in multi-head attention, learning specific patterns of relationships between tokens.
AutoML
Automated Machine Learning - automating the process of model selection, architecture search, and hyperparameter tuning.
Catastrophic Forgetting
The tendency of neural networks to completely forget previously learned information when learning new tasks.
Causal Inference
Determining cause-and-effect relationships from data, going beyond correlation to understand causal mechanisms.
Continual Learning
Learning new tasks sequentially without forgetting previously learned tasks, addressing catastrophic forgetting.
Contrastive Learning
A self-supervised learning approach that learns representations by contrasting similar and dissimilar examples.
Curriculum Learning
Training strategy where examples are presented from easy to hard, mimicking human learning for improved convergence.
Hyperparameter
Configuration settings external to the model (learning rate, batch size) that must be set before training begins.
Hyperparameter Tuning
The process of finding optimal hyperparameter values through techniques like grid search, random search, or Bayesian optimization.
Latent Space
A compressed, learned representation space where similar data points are close together, used in autoencoders and VAEs.
Meta-Learning
Learning to learn - training models that can quickly adapt to new tasks with minimal data, often applied to few-shot learning scenarios.
Multi-Task Learning
Training a single model on multiple related tasks simultaneously to improve generalization and efficiency.
Multimodal Learning
Training models on multiple types of data (text, images, audio) to understand relationships across modalities.
Neural Architecture Search
Automated methods for discovering optimal neural network architectures, using techniques like reinforcement learning or evolution.
Online Learning
Models that learn continuously from streaming data, updating incrementally as new data arrives.
Out-of-Distribution
Data that differs significantly from the training distribution, where models often perform poorly or unreliably.
Representation Learning
Learning useful features or representations of data automatically, rather than hand-crafting them.
Self-Supervised Learning
Learning representations from unlabeled data by creating supervised tasks from the data itself (masked prediction, contrastive learning).
Semi-Supervised Learning
Learning from a combination of labeled and unlabeled data, leveraging abundant unlabeled data to improve performance.
Computer Vision
3D Reconstruction
Creating 3D models from 2D images using geometry and deep learning.
Anchor Box
Predefined boxes of various sizes and ratios serving as references for object detection.
Bounding Box
A rectangular box defined by coordinates that localizes an object in an image, used in object detection.
Center Crop
Extracting the central region of an image, often used during inference.
Color Jittering
Randomly adjusting brightness, contrast, saturation, and hue for image augmentation.
Computer Vision
The field of AI focused on enabling computers to understand and interpret visual information from images and videos.
Data Augmentation in Vision
Creating training image variations through rotation, flipping, cropping, color jittering to improve model robustness.
Depth Estimation
Predicting distance of objects from the camera using monocular or stereo images.
Face Detection
Locating faces in images, a precursor to recognition and analysis.
Facial Recognition
Identifying or verifying people from face images using deep learning.
Faster R-CNN
An object detection architecture with RPN for efficient region proposals.
Feature Pyramid Network
A CNN architecture creating multi-scale feature representations for detecting objects at different sizes.
Image Augmentation
Applying transformations (rotation, flip, crop, color) to increase training data diversity.
Image Classification
Assigning a single label or category to an entire image, a fundamental computer vision task.
Image Generation
Creating new images from scratch or from text descriptions using generative models (GANs, diffusion models, VAEs).
Image Inpainting
Filling in missing or corrupted regions of images using context and generative models.
Image Normalization
Scaling pixel values to standard ranges (e.g., mean=0, std=1) to improve training.
Image Preprocessing
Transforming images before model input (resizing, normalization, color adjustment).
ImageNet
A large-scale dataset of 14M images in 20K categories, historically used as the benchmark for image classification models.
Instance Segmentation
Combining object detection and segmentation to identify individual object instances at the pixel level.
Mask R-CNN
Extending Faster R-CNN to instance segmentation by adding a mask prediction branch.
Non-Maximum Suppression
Filtering overlapping detection boxes by keeping only the most confident predictions.
Object Detection
Identifying and localizing multiple objects in an image with bounding boxes and class labels (YOLO, R-CNN, RetinaNet).
Optical Flow
Estimating motion patterns between video frames by tracking pixel movements.
Pose Estimation
Detecting body keypoints to determine human pose from images or video.
R-CNN
Region-based CNN - an object detection approach using selective search and CNN features.
Random Crop
Extracting random patches from images for augmentation and training.
Region Proposal Network
A network generating candidate object locations for two-stage detectors like Faster R-CNN.
RetinaNet
A single-stage object detector using focal loss to handle class imbalance.
Semantic Segmentation
Classifying every pixel in an image into categories, creating a pixel-level understanding of scenes.
Style Transfer
Transferring artistic style from one image to another while preserving content.
Super-Resolution
Enhancing image resolution using deep learning to recover high-frequency details.
Transfer Learning in Vision
Using pre-trained vision models (ImageNet) as feature extractors or fine-tuning for specific visual tasks.
YOLO
You Only Look Once - a real-time object detection architecture treating detection as regression.
Data & Features
Data Leakage
When information from outside the training data is used to create the model, leading to overly optimistic performance estimates.
Data Preprocessing
Cleaning, transforming, and preparing raw data for model training (handling missing values, normalization, encoding).
Dataset
A collection of data examples used for training, validating, or testing machine learning models.
Feature
An individual measurable property or characteristic of data used as input to machine learning models.
Feature Importance
Measures indicating which features contribute most to model predictions, useful for interpretation and selection.
Ground Truth
The correct or true labels/values for data, used as targets during training and evaluation benchmarks.
Imbalanced Dataset
A dataset where classes have significantly different numbers of examples, causing models to bias toward majority classes.
Labeled Data
Data with associated target outputs or annotations, required for supervised learning tasks.
Normalization
Scaling features to a standard range (typically 0-1 using min-max scaling) to improve model training and convergence. Often used interchangeably with standardization (mean=0, std=1), though technically distinct.
One-Hot Encoding
Converting categorical variables into binary vectors with one element set to 1 and others to 0.
Synthetic Data
Artificially generated data created to augment training sets, protect privacy, or simulate rare scenarios.
Training Data
The portion of a dataset used to train a model by adjusting its parameters to minimize loss.
Emerging & Advanced
AlphaFold
DeepMind's breakthrough AI system for predicting protein structures with near-experimental accuracy.
AlphaGo
DeepMind's Go-playing AI that defeated world champions using deep RL and tree search.
Behavioral Cloning
Supervised learning of a policy from state-action pairs in expert demonstrations.
Domain Randomization
Training with randomized simulation parameters to improve transfer to real-world environments.
Graph Attention Network
A GNN using attention mechanisms to weight neighbor contributions when aggregating information.
Hierarchical RL
Learning policies at multiple levels of abstraction, with high-level goals and low-level skills.
Imitation Learning
Learning policies by mimicking expert behavior from demonstrations.
Inverse Reinforcement Learning
Learning reward functions from expert demonstrations, inferring what is being optimized.
Message Passing
The fundamental operation in GNNs where nodes exchange and aggregate information with neighbors.
Meta-RL
Learning to adapt quickly to new RL tasks from experience on related tasks.
Model-Based RL
Reinforcement learning using learned environment models for planning and improving sample efficiency.
Model-Free RL
Reinforcement learning directly learning policies or value functions without modeling environment dynamics.
Monte Carlo Tree Search
A search algorithm combining tree search with random sampling, used in game-playing AIs.
Multi-Agent RL
Reinforcement learning with multiple agents that interact and potentially cooperate or compete.
Neural ODE
Neural Ordinary Differential Equations - modeling continuous-depth networks as ODEs, enabling adaptive computation.
Node Embedding
Learning vector representations of graph nodes that capture structural and feature information.
Offline RL
Learning policies from fixed datasets without environment interaction, enabling learning from logs.
Protein Folding
Predicting 3D protein structures from amino acid sequences, revolutionized by AlphaFold.
Sim-to-Real Transfer
Transferring policies trained in simulation to real-world deployment, crucial for robotics.
World Model
A learned model of environment dynamics that can predict future states, used in model-based RL.
Emerging Techniques
Constitutional AI
Training AI systems using principles and rules rather than only human feedback, developed by Anthropic for Claude.
Emergent Abilities
Capabilities that appear suddenly in large language models at certain scales, not present in smaller models.
Instruction Tuning
Fine-tuning LLMs on diverse instruction-following tasks to improve zero-shot performance on new instructions.
Mixture of Experts
An architecture where multiple specialized sub-networks (experts) process inputs, with a gating network routing to relevant experts.
Multimodal Model
Models processing multiple data types (text, images, audio) jointly, like GPT-4V, Gemini, or CLIP.
Neural Scaling Laws
Empirical relationships showing how model performance improves predictably with model size, data, and compute.
Prompt Tuning
Learning continuous prompt embeddings while keeping the LLM frozen, an efficient alternative to fine-tuning.
Retrieval-Interleaved Generation
Dynamically retrieving information during generation rather than just before, allowing models to gather facts as needed.
Tool Use
LLMs learning to call external tools, APIs, or functions to extend capabilities beyond text generation (calculators, search, code execution).
Jargon & Slang
Alignment Tax
Performance degradation that may occur when making models safer and more aligned with human values.
Benchmark Gaming
Optimizing models specifically for benchmark performance rather than real-world capabilities, inflating scores artificially.
Black Box
A model whose internal workings are difficult to understand or interpret, common with complex neural networks.
Compute
Informal term for computational resources (GPUs, TPUs, time) required for training or running AI models.
Curse of Dimensionality
Challenges arising when working with high-dimensional data, including data sparsity and computational complexity.
Foundation Model
Large pre-trained models serving as a base for various downstream tasks (GPT, BERT, CLIP, SAM).
Grounding
Connecting model outputs to real-world facts, sources, or evidence to improve factuality and reduce hallucinations.
Hallucination
When language models generate plausible-sounding but factually incorrect or nonsensical information.
No Free Lunch Theorem
The principle that no single ML algorithm works best for all problems - algorithm choice depends on the specific task.
Shot
An example provided in a prompt - zero-shot (no examples), few-shot (a few examples), one-shot (one example).
Large Language Models
Assistant Response
The output generated by the language model in response to user prompts.
Attention Mask
A binary mask indicating which tokens should be attended to, used to handle padding and causal masking.
Attention Score
The weight determining how much each value contributes to the output, computed from query-key similarity.
Autoregressive Model
A model that generates output one token at a time, using previously generated tokens as input for the next prediction.
Beam Search
A generation algorithm that maintains top-k candidates at each step, balancing quality and diversity.
BERT
Bidirectional Encoder Representations from Transformers - a model that understands context by looking at text from both directions.
Bidirectional Attention
Allowing tokens to attend to both past and future context, used in encoder models like BERT.
BLEU Score
A metric for evaluating machine translation quality by comparing n-gram overlap between generated and reference text.
BOS Token
Beginning Of Sequence token - marks the start of a sequence in language models.
BPE
Byte Pair Encoding - a subword tokenization algorithm that iteratively merges frequent character pairs to create a vocabulary.
Causal Language Modeling
Training a model to predict the next token given previous tokens, the foundation of autoregressive models like GPT.
Causal Mask
An attention mask ensuring tokens can only attend to previous positions, crucial for autoregressive generation.
Chain-of-Thought
A prompting technique where the model explains its reasoning step-by-step before giving a final answer, improving complex reasoning.
Context Window
The maximum number of tokens an LLM can process at once, including both input prompt and generated output. Also called context length.
Conversation History
Previous messages in a multi-turn dialogue, provided as context for coherent conversations.
Cross-Attention
Attention between two different sequences, where queries come from one and keys/values from another.
Decoder-Only Model
A transformer architecture with only decoder layers, using causal masking for autoregressive generation (GPT family).
Embedding
A dense vector representation of discrete data (words, tokens) in continuous space, capturing semantic relationships.
Encoder-Decoder
A architecture where the encoder processes input and the decoder generates output, used in translation and sequence-to-sequence tasks.
Encoder-Only Model
A transformer with only encoder layers and bidirectional attention, suited for understanding tasks (BERT family).
EOS Token
End Of Sequence token - signals when the model has finished generating a complete output.
Few-Shot Learning
Learning to perform a task from a small number of examples provided in the prompt, without parameter updates.
GPT
Generative Pre-trained Transformer - an autoregressive language model architecture that predicts the next token given previous context.
Greedy Decoding
Always selecting the most likely next token during generation, fast but can lead to repetitive or suboptimal outputs.
In-Context Learning
The ability of LLMs to learn from examples and instructions provided in the input prompt without training.
Instruction Following
The ability of language models to understand and execute instructions provided in prompts.
Large Language Model
A neural network trained on vast amounts of text data, capable of understanding and generating human-like text across diverse tasks.
Length Penalty
Adjusting generation scores based on output length to avoid bias toward shorter or longer sequences.
Masked Language Modeling
A pre-training objective where random tokens are masked and the model learns to predict them from context.
Multi-Head Attention
Running multiple attention operations in parallel with different learned projections, capturing diverse relational patterns.
Nucleus Sampling
Sampling from the smallest token set with cumulative probability exceeding p (also called top-p sampling).
Padding Token
A special token used to make sequences the same length in a batch, typically ignored during computation.
Perplexity
A metric measuring how well a language model predicts text - lower perplexity indicates better prediction.
Positional Encoding
Adding position information to token embeddings so the model understands word order in sequences.
Prompt Engineering
The practice of designing and optimizing input prompts to get desired outputs from language models. A crucial skill for effectively using LLMs.
Prompt Template
A reusable structure for crafting prompts with placeholders for variables, improving consistency.
Query-Key-Value
The three learned projections in attention mechanisms used to compute attention weights and outputs.
Repetition Penalty
A technique reducing the likelihood of previously generated tokens to avoid repetitive outputs.
ROUGE Score
Metrics for evaluating text summarization by measuring overlap of n-grams, word sequences, and word pairs with references.
Scaled Dot-Product Attention
The attention computation using dot product of queries and keys, scaled by dimension to stabilize gradients.
Self-Attention
A mechanism where each token attends to all other tokens in the sequence to understand contextual relationships.
SentencePiece
A language-agnostic tokenization library that treats text as a sequence of Unicode characters.
Sequence-to-Sequence
Models that transform input sequences to output sequences, used for translation, summarization, and generation.
Special Token
Reserved tokens with special meanings like [CLS], [SEP], [MASK], [PAD] used in various model architectures.
Stop Token
A special token signaling the end of generation, causing the model to stop producing more tokens.
Subword Tokenization
Breaking words into smaller units, balancing vocabulary size with representation granularity.
System Prompt
Initial instructions defining the model's role, behavior, and constraints for the conversation.
Temperature
A sampling parameter controlling randomness in generation - lower values make output more deterministic, higher more creative.
Token
The basic unit of text that a language model processes, typically representing a word, subword, or character. Tokens are the fundamental building blocks for LLM input and output.
Tokenization
The process of breaking text into smaller units (tokens) that language models can process, using algorithms like BPE or WordPiece.
Top-k Sampling
A generation strategy that samples from only the k most likely next tokens, balancing quality and diversity.
Top-p Sampling
Nucleus sampling - selecting from the smallest set of tokens whose cumulative probability exceeds p, providing dynamic vocabulary.
User Prompt
The input provided by the user to the language model, containing questions or instructions.
Vocabulary Size
The number of distinct tokens a language model can process, typically 30K-100K+ tokens.
WordPiece
A subword tokenization algorithm used by BERT, similar to BPE but with different merging criteria.
Zero-Shot Learning
A model's ability to perform tasks it wasn't explicitly trained on, using only instructions or descriptions.
Machine Learning Fundamentals
Bayesian Inference
Using Bayes' theorem to update beliefs about parameters given data, incorporating uncertainty.
Bias-Variance Tradeoff
The balance between a model's bias (systematic error) and variance (sensitivity to training data fluctuations).
Classification
A supervised learning task where the model predicts discrete class labels (categories) for input data.
Clustering
An unsupervised learning technique that groups similar data points together based on their features or characteristics.
Concentration of Measure
A phenomenon where random variables in high-dimensional spaces concentrate around their mean or median, with most probability mass within a narrow band.
Conjugate Prior
A prior that when combined with a likelihood results in a posterior of the same family, simplifying Bayesian inference.
Covariance
A measure of how two variables change together, indicating the direction of their linear relationship.
Cross-Validation
A technique for assessing model performance by partitioning data into subsets, training on some and validating on others.
Curse of Dimensionality
Phenomena where algorithms become inefficient as dimensionality increases, including data sparsity and distance concentration.
Decision Tree
A tree-structured model that makes decisions by splitting data based on feature values, interpretable but prone to overfitting.
Dimensionality Reduction
Techniques to reduce the number of input features while preserving important information (PCA, t-SNE, autoencoders).
Empirical Risk Minimization
The principle of choosing a model that minimizes error on training data, fundamental to supervised learning.
Ensemble Learning
Combining multiple models to produce better predictions than any individual model (bagging, boosting, stacking).
Entropy
A measure of uncertainty or randomness in a random variable from information theory.
Evidence Lower Bound
A lower bound on log likelihood used in variational inference and VAEs for tractable optimization.
Expectation-Maximization
An iterative algorithm for finding maximum likelihood estimates in models with latent variables.
Feature Engineering
The process of selecting, transforming, and creating input features to improve model performance.
Feature Selection
Choosing the most relevant features from available data to reduce dimensionality and improve model performance.
Gaussian Process
A non-parametric Bayesian approach for regression and classification, defining distributions over functions.
Gibbs Sampling
An MCMC method that samples from conditional distributions to approximate joint distributions.
Gradient Boosting
An ensemble technique that builds models sequentially, each correcting errors of previous ones (XGBoost, LightGBM, CatBoost).
Hidden Markov Model
A statistical model with hidden states that transition probabilistically, generating observable outputs.
Inductive Bias
Assumptions built into a learning algorithm that guide it toward certain solutions over others.
Information Bottleneck
A principle for learning representations that compress input while retaining information relevant to prediction.
Jensen-Shannon Divergence
A symmetric measure of similarity between probability distributions, related to KL divergence.
KL Divergence
Kullback-Leibler divergence - a measure of how one probability distribution differs from another.
Latent Variable
Hidden or unobserved variables in a model that influence observed data but aren't directly measured.
Manifold Hypothesis
The assumption that high-dimensional data lies on or near a lower-dimensional manifold, justifying dimensionality reduction.
Markov Chain Monte Carlo
Sampling methods for approximating distributions, especially for Bayesian inference in complex models.
Maximum A Posteriori
Parameter estimation that incorporates prior beliefs, maximizing posterior probability rather than just likelihood.
Maximum Likelihood Estimation
Finding model parameters that maximize the probability of observing the training data.
Mutual Information
A measure of dependence between variables, quantifying how much knowing one reduces uncertainty about the other.
Occam's Razor
The principle that simpler models should be preferred when they perform equally well, reducing overfitting.
Overfitting
When a model learns training data too well, including noise and outliers, causing poor generalization to new data.
PAC Learning
Probably Approximately Correct - a theoretical framework for analyzing learning algorithm guarantees.
Posterior Distribution
In Bayesian methods, the updated belief about parameters after observing data.
Principal Component Analysis
A dimensionality reduction technique that transforms data into orthogonal components ordered by variance explained.
Prior Distribution
In Bayesian methods, the initial belief about parameters before observing data.
Rademacher Complexity
A measure of how well a model class can fit random noise, indicating capacity and generalization ability.
Random Forest
An ensemble of decision trees trained on random subsets of data and features, reducing overfitting through averaging.
Regression
A supervised learning task where the model predicts continuous numerical values rather than discrete categories.
Reinforcement Learning
Learning through interaction with an environment, receiving rewards or penalties to learn optimal behavior policies.
Supervised Learning
A machine learning paradigm where models learn from labeled training data to make predictions on new, unseen data.
Support Vector Machine
A supervised learning algorithm that finds the optimal hyperplane to separate classes with maximum margin.
Train-Test Split
Dividing a dataset into separate portions for training the model and evaluating its performance on unseen data.
Underfitting
When a model is too simple to capture the underlying pattern in data, performing poorly on both training and test sets.
Universal Approximation Theorem
The theorem stating neural networks with one hidden layer can approximate any continuous function.
Unsupervised Learning
Learning from unlabeled data to discover hidden patterns, structures, or relationships without explicit target outputs.
Variational Inference
Approximating complex distributions by optimizing over a simpler family, an alternative to MCMC.
VC Dimension
A measure of model capacity - the largest set of points a model can shatter (classify in all possible ways).
Wasserstein Distance
Earth Mover's Distance - a metric measuring the minimum cost to transform one distribution into another.
Model Architectures
Attention Mechanism
A technique that allows neural networks to focus on relevant parts of the input when producing each output, assigning different weights to different input elements.
BART
Bidirectional and Auto-Regressive Transformer - combines BERT-like encoder with GPT-like decoder for sequence-to-sequence tasks.
CLIP
Contrastive Language-Image Pre-training - a model jointly trained on images and text, enabling zero-shot image classification.
Diffusion Model
A generative model that learns to denoise data, achieving state-of-the-art image generation (Stable Diffusion, DALL-E 2).
EfficientNet
A family of CNNs that scale depth, width, and resolution simultaneously using compound scaling for optimal efficiency.
Inception
A CNN architecture (GoogLeNet) using parallel convolutions of different sizes to capture multi-scale features efficiently.
ResNet
Residual Network - a CNN architecture using skip connections to enable training of very deep networks (up to 1000+ layers).
Stable Diffusion
A latent diffusion model for text-to-image generation that operates in compressed latent space for efficiency.
T5
Text-to-Text Transfer Transformer - frames all NLP tasks as text-to-text problems using a unified encoder-decoder architecture.
Transformer
A neural network architecture introduced in 'Attention is All You Need' (2017) that relies entirely on self-attention mechanisms, becoming the foundation for modern LLMs.
U-Net
A CNN architecture with encoder-decoder structure and skip connections, widely used for image segmentation tasks.
VGG
A CNN architecture known for its simplicity, using small 3x3 convolutions stacked deeply, influential in computer vision.
Vision Transformer
Applying the transformer architecture to computer vision by treating image patches as tokens, achieving state-of-the-art results.
Model Evaluation
Accuracy
The proportion of correct predictions out of total predictions, a basic classification metric.
AUC
Area Under the Curve - measures the area under the ROC curve, indicating classification model performance (1.0 is perfect).
Baseline Model
A simple reference model (random, majority class, simple heuristic) used to benchmark more complex models against.
Benchmark
A standardized dataset and task used to compare model performance across different approaches (ImageNet, GLUE, SuperGLUE).
Confusion Matrix
A table showing true positives, true negatives, false positives, and false negatives for classification evaluation.
F1 Score
The harmonic mean of precision and recall, providing a single metric that balances both concerns.
Mean Absolute Error
The average absolute difference between predictions and actual values, a regression metric less sensitive to outliers than MSE.
Precision
The proportion of true positives among all positive predictions - measures how many predicted positives are actually positive.
R-squared
Coefficient of determination - measures the proportion of variance in the target variable explained by the model.
Recall
The proportion of true positives among all actual positives - measures how many actual positives were correctly identified.
ROC Curve
Receiver Operating Characteristic curve - plots true positive rate vs false positive rate at various classification thresholds.
Test Set
A final portion of data unseen during training and validation, used for unbiased evaluation of model performance.
Validation Set
A portion of data held out from training, used to tune hyperparameters and monitor overfitting.
Model Evaluation & Metrics
Average Precision
The weighted mean of precisions at each threshold, where the weight is the increase in recall from the previous threshold.
Cohen's Kappa
A metric measuring agreement between raters/models accounting for chance agreement.
Contrastive Loss
A loss function for learning similarity metrics, bringing similar pairs together and separating dissimilar ones.
CTC Loss
Connectionist Temporal Classification loss for sequence tasks without alignment, used in speech recognition.
Dice Coefficient
A metric measuring overlap between predicted and ground truth segmentations, common in medical imaging.
False Negative
Positive cases that are incorrectly predicted as negative (Type II error) in classification.
False Positive
Incorrectly predicted positive cases (Type I error) in classification.
False Positive Rate
The proportion of negatives incorrectly classified as positive.
Focal Loss
A modified cross-entropy loss that down-weights easy examples, helping with class imbalance.
Hinge Loss
A loss function for maximum-margin classification, used in SVMs.
Huber Loss
A loss function that's quadratic for small errors and linear for large errors, robust to outliers.
Intersection over Union
A metric for object detection measuring overlap between predicted and ground truth bounding boxes.
Jaccard Index
Measuring similarity between sets, equivalent to IoU for binary segmentation.
Log Loss
Logarithmic loss measuring the accuracy of probabilistic predictions, penalizing confident wrong predictions.
Matthews Correlation Coefficient
A balanced measure for binary classification considering all confusion matrix values, robust to imbalance.
Precision-Recall Curve
A curve showing the tradeoff between precision and recall at different thresholds.
Sensitivity
Same as recall - the proportion of actual positives correctly identified.
Specificity
The proportion of actual negatives correctly identified.
Triplet Loss
A loss for learning embeddings by pulling similar examples together and pushing dissimilar ones apart.
True Negative
Correctly predicted negative cases in classification.
True Positive
Correctly predicted positive cases in classification.
Natural Language Processing
Anaphora Resolution
Determining what a pronoun or noun phrase refers back to in text.
Cloze Task
A task where words are removed from text and must be predicted, used for evaluation and pre-training.
Constituency Parsing
Analyzing sentence structure into nested constituents (noun phrases, verb phrases, etc.).
Coreference Resolution
Identifying all expressions in text that refer to the same entity (e.g., linking pronouns to nouns).
Dependency Parsing
Analyzing grammatical structure by identifying relationships between words (subject, object, modifier).
Dialogue State Tracking
Maintaining a representation of conversation state in dialogue systems.
Entity Linking
Linking entity mentions in text to entries in a knowledge base.
Information Extraction
Automatically extracting structured information from unstructured text.
Intent Recognition
Identifying the user's intention or goal from their utterance in dialogue systems.
Language Modeling
Learning probability distributions over sequences of words to predict what comes next.
Language Understanding
The ability to comprehend meaning, context, intent, and nuance in natural language.
Latent Dirichlet Allocation
A generative probabilistic model for topic modeling that represents documents as mixtures of topics.
Lemmatization
Reducing words to their base or dictionary form (running → run) using linguistic knowledge.
Machine Translation
Automatically translating text from one language to another using neural models (typically encoder-decoder architectures).
N-gram
A contiguous sequence of n items (words, characters) from text, used in language modeling and feature extraction.
Named Entity Recognition
Identifying and classifying named entities (people, organizations, locations) in text into predefined categories.
Natural Language Inference
Determining logical relationships (entailment, contradiction, neutral) between sentence pairs.
Natural Language Processing
The field of AI focused on enabling computers to understand, interpret, and generate human language.
Paraphrase Detection
Determining if two text segments express the same meaning in different words.
Part-of-Speech Tagging
Labeling words in text with their grammatical roles (noun, verb, adjective, etc.).
Question Answering
Systems that automatically answer questions posed in natural language, often by reading and comprehending text passages.
Reading Comprehension
Answering questions about a text passage, testing understanding.
Relation Extraction
Identifying semantic relationships between entities in text.
Semantic Role Labeling
Identifying the semantic relationships between predicates and their arguments in sentences.
Semantic Similarity
Measuring how similar two pieces of text are in meaning, often using embedding-based distance metrics.
Sentiment Analysis
Determining the emotional tone or opinion expressed in text (positive, negative, neutral).
Slot Filling
Extracting specific pieces of information (slots) needed to fulfill a user's intent.
Stemming
Reducing words to their root form by removing affixes (prefixes, suffixes, infixes), simpler than lemmatization but less linguistically accurate.
Stop Words
Common words (the, is, at) often removed in NLP preprocessing as they carry little semantic meaning.
Text Classification
Assigning categories or labels to text documents, a fundamental NLP task.
Text Generation
Automatically creating coherent text using language models, from simple completion to creative writing.
Text Summarization
Generating concise summaries of longer texts, either extractive (selecting sentences) or abstractive (generating new text).
Textual Entailment
Determining if one text fragment logically follows from another.
TF-IDF
Term Frequency-Inverse Document Frequency - a statistical measure of word importance in documents, used for information retrieval.
Topic Modeling
Discovering abstract topics in document collections, often using techniques like LDA.
Word Sense Disambiguation
Determining which meaning of a word is used in a particular context.
Word2Vec
A technique for learning word embeddings that capture semantic relationships (Skip-gram and CBOW models).
Neural Networks & Deep Learning
Activation Function
A non-linear function applied to neuron outputs that introduces non-linearity, enabling networks to learn complex patterns.
Attention Is All You Need
The seminal 2017 paper by Vaswani et al. introducing the Transformer architecture that revolutionized NLP.
Autoencoder
An unsupervised neural network that learns to compress data into a latent representation and reconstruct it, useful for dimensionality reduction.
Batch Normalization
A technique that normalizes layer inputs to stabilize and accelerate training by reducing internal covariate shift.
Boltzmann Machine
A stochastic recurrent neural network that can learn probability distributions over binary data.
Capsule Network
An architecture using capsules (groups of neurons) that preserve spatial relationships, addressing limitations of CNNs.
Convolution
A mathematical operation that applies filters/kernels to input data to extract features like edges, textures, and patterns.
Convolutional Neural Network
A deep learning architecture designed for processing grid-like data (images) using convolutional layers that learn spatial hierarchies.
Deep Belief Network
A generative model composed of multiple layers of RBMs, historically important for unsupervised pre-training.
Deep Learning
A subset of machine learning that uses neural networks with multiple layers (deep neural networks) to learn hierarchical representations of data.
Depthwise Separable Convolution
An efficient convolution that factorizes standard convolution into depthwise and pointwise steps, reducing parameters.
Dilated Convolution
Convolution with gaps between kernel elements, expanding the receptive field without increasing parameters.
Discriminator
In GANs, the network that tries to distinguish between real and generated data, providing training signal to the generator.
Dropout
A regularization technique that randomly deactivates neurons during training to prevent overfitting and improve generalization.
Echo State Network
A recurrent network with a fixed random reservoir and trained readout layer, efficient for time series processing.
ELU
Exponential Linear Unit - an activation function that allows negative values, helping with vanishing gradients.
Feature Map
The output of applying a convolutional filter to an input, representing detected features at various spatial locations.
Feedforward Network
A neural network where information flows in one direction from input to output without cycles.
Gated Recurrent Unit
A simplified variant of LSTM with fewer parameters, using an update gate and a reset gate to control information flow.
GELU
Gaussian Error Linear Unit - a smooth activation function combining properties of dropout and ReLU, used in BERT and GPT.
Generative Adversarial Network
A framework where two networks (generator and discriminator) compete, with the generator learning to create realistic data.
Generator
In GANs, the network that creates synthetic data attempting to fool the discriminator into thinking it's real.
Group Normalization
Normalizing groups of channels independently, more stable than batch normalization for small batch sizes.
He Initialization
Weight initialization designed for ReLU activations, preventing vanishing/exploding gradients in deep networks.
Hopfield Network
A recurrent network that serves as content-addressable memory, capable of pattern completion and associative memory.
Leaky ReLU
A variant of ReLU allowing small negative values (f(x) = x if x > 0, else αx where α ≈ 0.01), preventing dead neurons.
Liquid State Machine
A reservoir computing model where a recurrent network acts as a dynamic reservoir for temporal pattern recognition.
Long Short-Term Memory
A type of RNN architecture with gates that can learn long-term dependencies, solving the vanishing gradient problem.
Maxout
An activation function that outputs the maximum of multiple linear functions, providing universal approximation.
Mish
A smooth, non-monotonic activation function (x * tanh(softplus(x))) providing better gradients than ReLU.
Multi-Layer Perceptron
A feedforward neural network with multiple layers of perceptrons, capable of learning non-linear functions.
Neural Network
A computational model inspired by biological neural networks, consisting of interconnected nodes (neurons) organized in layers that process information through weighted connections.
Neuromorphic Computing
Hardware and algorithms designed to mimic the brain's structure and function, enabling efficient spike-based computation.
Perceptron
The simplest neural network unit, a single-layer binary classifier that inspired modern deep learning.
Pooling
A down-sampling operation in CNNs that reduces spatial dimensions while retaining important features (max pooling, average pooling).
PReLU
Parametric ReLU - like Leaky ReLU but with learnable negative slope parameter.
Radial Basis Function Network
A neural network using radial basis functions as activation functions, useful for function approximation and interpolation.
Recurrent Neural Network
A neural network architecture with loops that allow information to persist, designed for sequential data like text and time series.
ReLU
Rectified Linear Unit - an activation function that outputs the input if positive, zero otherwise. f(x) = max(0, x).
Residual Connection
Skip connections that allow gradients to flow directly through a network, enabling training of very deep networks (ResNet).
Restricted Boltzmann Machine
A simpler variant of Boltzmann machines with no intra-layer connections, used for unsupervised learning and dimensionality reduction.
Sigmoid
An activation function that squashes values to range (0,1), often used for binary classification and gates in LSTMs.
Softmax
An activation function that converts a vector of values into a probability distribution, commonly used for multi-class classification.
Spectral Normalization
Constraining the spectral norm of weight matrices to stabilize GAN training.
Spiking Neural Network
Networks inspired by biological neurons that communicate through discrete spikes, incorporating temporal dynamics.
Squeeze-and-Excitation
A channel attention mechanism that adaptively recalibrates channel-wise feature responses, improving CNNs.
Swish
A smooth activation function (x * sigmoid(x)) that often outperforms ReLU, discovered through neural architecture search.
Transposed Convolution
An operation that upsamples feature maps, often used in decoders and generative models (also called deconvolution).
Variational Autoencoder
A generative model that learns a probabilistic latent space, allowing sampling of new data points similar to training data.
Weight Normalization
Reparameterizing weight vectors to improve optimization by decoupling magnitude from direction.
Xavier Initialization
A weight initialization strategy maintaining variance across layers, improving training of deep networks.
Practical Applications
Autonomous Vehicles
Self-driving cars using computer vision, sensor fusion, and decision-making AI for navigation and control.
Chatbot
A conversational AI system that interacts with users through natural language, powered by NLP and LLMs.
Code Generation
AI systems that write code from natural language descriptions, powered by models like Codex, GitHub Copilot, or Claude Code.
Content Moderation
Using AI to automatically detect and filter inappropriate, harmful, or policy-violating content.
Drug Discovery
Using AI to accelerate pharmaceutical research by predicting molecular properties, protein structures, and drug-target interactions.
Fraud Detection
Identifying fraudulent transactions or activities using anomaly detection and pattern recognition in financial data.
Medical Diagnosis
AI systems assisting healthcare professionals in diagnosing diseases from medical images, patient data, and symptoms.
Personalization
Tailoring content, recommendations, or experiences to individual users based on their preferences and behavior.
Practical Deployment
Blue-Green Deployment
Running two identical environments to enable zero-downtime model updates and rollbacks.
Calibration
Ensuring predicted probabilities accurately reflect true likelihood of outcomes.
Canary Deployment
Gradually rolling out new model versions to subset of traffic before full deployment.
CI/CD for ML
Continuous integration and deployment practices adapted for machine learning pipelines.
Data Versioning
Tracking different versions of datasets to ensure reproducibility and manage changes.
Experiment Tracking
Recording hyperparameters, metrics, and artifacts from training runs for comparison and reproducibility.
Feature Store
A centralized platform for managing, storing, and serving features for ML models.
gRPC
A high-performance RPC framework often used for low-latency model serving.
Model Caching
Storing frequently requested predictions to reduce latency and computation.
Model Endpoint
A deployed service exposing a model's predictions via API requests.
Model Lineage
Tracking the origin and dependencies of models including data, code, and parameters.
Model Performance Degradation
Decline in model quality over time due to distribution shift or changing patterns.
Model Registry
A centralized repository for tracking, versioning, and managing trained models.
Model Reproducibility
The ability to recreate exact model results given the same code, data, and environment.
Model Retraining
Periodically updating models with new data to maintain performance as distributions change.
Model Versioning
Tracking different versions of models to enable reproducibility and rollback.
Prediction Confidence
A measure of model certainty in its predictions, important for reliability and user trust.
Request Batching
Combining multiple inference requests into batches to improve throughput.
REST API
A web service interface commonly used for serving model predictions over HTTP.
Shadow Deployment
Running a new model alongside production without affecting user experience, for validation.
Reinforcement Learning
Actor-Critic
RL architecture with two components: an actor (policy) that selects actions and a critic (value function) that evaluates them.
Agent
In RL, the learner or decision-maker that takes actions in an environment to maximize cumulative reward.
Deep Q-Network
Combining Q-learning with deep neural networks to handle high-dimensional state spaces, enabling RL for complex tasks like Atari games.
Environment
In RL, the world the agent interacts with, providing states, accepting actions, and returning rewards.
Exploration vs Exploitation
The RL dilemma of trying new actions (exploration) versus using known good actions (exploitation) to maximize reward.
Markov Decision Process
A mathematical framework for modeling sequential decision-making with states, actions, rewards, and transition probabilities.
Policy
A strategy or mapping from states to actions that defines the agent's behavior in reinforcement learning.
Policy Gradient
RL methods that directly optimize the policy by computing gradients of expected reward with respect to policy parameters.
PPO
Proximal Policy Optimization - a stable and efficient policy gradient algorithm widely used in RLHF for training LLMs.
Q-Learning
A model-free RL algorithm that learns action-value functions (Q-values) to determine optimal actions in each state.
Reward
A scalar feedback signal indicating how good an action was, used to train reinforcement learning agents.
Value Function
A function estimating expected cumulative reward from a state (state-value) or state-action pair (action-value/Q-value).
Specialized AI Topics
Adversarial Perturbation
Small carefully crafted changes to input that fool models while imperceptible to humans.
Adversarial Training
Training on adversarial examples to improve model robustness against attacks.
AI Governance
Policies, frameworks, and practices for responsible development and deployment of AI systems.
Algorithmic Accountability
Ensuring AI systems can be held accountable for their decisions and impacts.
Attention Visualization
Visualizing attention weights to understand which inputs the model focuses on.
Backdoor Attack
Maliciously training models to behave normally except when specific triggers are present.
Certified Robustness
Provable guarantees that a model's prediction won't change within a specified input perturbation.
Data Poisoning
Corrupting training data to manipulate model behavior or introduce vulnerabilities.
Explainable AI
Methods and techniques for making AI decision-making transparent and interpretable to humans.
Grad-CAM
Gradient-weighted Class Activation Mapping - visualizing which image regions influenced CNN predictions.
Homomorphic Encryption
Encryption allowing computation on encrypted data, enabling private model inference.
LIME
Local Interpretable Model-agnostic Explanations - explaining individual predictions by approximating with simpler models.
Membership Inference
Determining if a specific example was in the training dataset, a privacy concern.
Model Extraction
Stealing a model's functionality by querying it and training a copy.
Model Inversion
Attacks that reconstruct training data or private information from model parameters or outputs.
Model Watermarking
Embedding identifiable signatures in models to prove ownership and detect theft.
Saliency Map
A visualization highlighting input regions most important for model predictions.
Secure Multi-Party Computation
Protocols allowing parties to jointly compute functions while keeping inputs private.
SHAP
SHapley Additive exPlanations - a unified approach to explaining model predictions using game theory.
Trusted Execution Environment
Secure hardware areas for protected computation, used for private AI inference.
Specialized Domains
Anomaly Detection
Identifying unusual patterns or outliers in data that don't conform to expected behavior, used for fraud detection and monitoring.
Audio Processing
Techniques for analyzing, transforming, and understanding audio signals for tasks like speech recognition and music generation.
Collaborative Filtering
Recommendation technique using patterns from multiple users to predict preferences, assuming similar users like similar items.
Graph Neural Network
Neural networks designed to operate on graph-structured data, learning representations of nodes, edges, and entire graphs.
Knowledge Graph
A structured representation of knowledge as entities and their relationships, used for reasoning and information retrieval.
Optical Character Recognition
Converting images of text (scanned documents, photos) into machine-readable text using computer vision.
Recommender System
AI systems that suggest items (products, content) to users based on preferences, behavior, and similarity.
Retrieval-Augmented Generation
Augmenting LLM generation with retrieved relevant documents, improving factuality and enabling knowledge updates without retraining.
Speech Recognition
Converting spoken language into text using acoustic models and language models, now dominated by deep learning.
Text-to-Speech
Synthesizing natural-sounding speech from text, using neural vocoders and attention-based models.
Time Series Forecasting
Predicting future values based on historical sequential data, using models like ARIMA, LSTMs, or Transformers.
Vector Database
A database optimized for storing and searching high-dimensional vectors (embeddings), enabling semantic search and RAG.
Technical Terms
Backward Pass
The process of computing gradients by propagating error signals backward through the network during training.
Bias
A learnable offset added to neuron inputs, allowing the model to fit data that doesn't pass through the origin.
Bottleneck
A layer or section with reduced dimensions that compresses information, used in autoencoders and efficient architectures.
Filter
Synonym for kernel - the learnable weight matrix applied in convolutions to extract features.
FLOPS
Floating Point Operations Per Second - a measure of computational performance, used to quantify training and inference costs.
Forward Pass
The process of passing input through the network to generate predictions during training or inference.
Hidden Layer
Intermediate layers between input and output that learn hierarchical representations in neural networks.
Inference Latency
The time delay between submitting input and receiving output from a deployed model, critical for real-time applications.
Kernel
A small matrix of weights used in convolutional layers to detect specific features or patterns in input data.
Layer
A collection of neurons/operations that process data together, neural networks are composed of stacked layers.
Padding
Adding borders of zeros (or other values) around input to control output spatial dimensions in convolutions.
Parameter
Learnable values (weights and biases) in a neural network that are adjusted during training to minimize loss.
Receptive Field
The region of input that influences a particular neuron's output, growing larger in deeper layers of CNNs.
Skip Connection
Direct connections bypassing one or more layers, helping gradient flow and enabling deeper networks.
Softmax Temperature
A parameter controlling the smoothness of probability distributions in softmax - lower makes peaks sharper, higher makes it more uniform.
Stride
The step size by which a convolutional filter or pooling window moves across the input.
Throughput
The number of predictions or tokens a model can process per unit of time, a key deployment performance metric.
Weight
Learnable parameters connecting neurons in neural networks, determining the strength of connections.
Training & Optimization
AdaGrad
An optimizer that adapts learning rates for each parameter based on historical gradients, useful for sparse data.
Adam Optimizer
An adaptive learning rate optimization algorithm combining momentum and RMSprop, widely used for training neural networks.
AdamW
Adam with decoupled weight decay, providing better regularization and often superior performance.
Automatic Mixed Precision
Automatically managing precision during training to maximize speed while maintaining stability.
Backpropagation
The algorithm for computing gradients of the loss with respect to network weights, enabling training through gradient descent.
Batch Gradient Descent
Computing gradients using the entire dataset, providing stable but slow updates.
Batch Size
The number of training examples processed together in one forward/backward pass.
Cosine Annealing
A learning rate schedule following a cosine curve, smoothly decreasing the rate over training.
Cross-Entropy Loss
A loss function for classification that measures the difference between predicted and true probability distributions.
CutMix
Data augmentation combining image patches and labels from two examples, improving robustness.
Cutout
Data augmentation randomly masking out square regions of images during training.
Cyclical Learning Rate
Varying learning rate between bounds in cycles, potentially escaping local minima.
Data Augmentation
Creating variations of training data through transformations (rotation, cropping, noise) to improve model generalization.
Data Parallelism
Replicating the model across devices, each processing different data batches.
Distillation Temperature
A hyperparameter in knowledge distillation controlling how soft the teacher's outputs are.
Early Stopping
Stopping training when validation performance stops improving, preventing overfitting.
Epoch
One complete pass through the entire training dataset during the training process.
Exploding Gradient
A problem where gradients become extremely large during backpropagation, causing unstable training and NaN values.
Fine-Tuning
The process of further training a pre-trained model on a specific dataset to adapt it for a particular task or domain.
Flash Attention
An efficient attention algorithm reducing memory usage and increasing speed through clever recomputation.
Gradient Accumulation
Summing gradients over multiple batches before updating, simulating larger effective batch sizes.
Gradient Checkpointing
Trading computation for memory by recomputing activations during backprop instead of storing them.
Gradient Clipping
Limiting gradient magnitudes during training to prevent exploding gradients and stabilize training.
Gradient Descent
An optimization algorithm that iteratively adjusts parameters in the direction of steepest descent of the loss function.
L1 Regularization
Adding the sum of absolute weights to the loss function, promoting sparsity and feature selection.
L2 Regularization
Adding the sum of squared weights to the loss function, penalizing large weights and improving generalization.
Label Smoothing
Softening target labels to prevent overconfidence and improve generalization.
LAMB Optimizer
Layer-wise Adaptive Moments optimizer for Batch training - enables very large batch training for transformers.
Learning Rate
A hyperparameter controlling the step size in gradient descent - too high causes instability, too low slows convergence.
Learning Rate Decay
Gradually reducing the learning rate during training to fine-tune convergence.
Learning Rate Schedule
A strategy for adjusting the learning rate during training (decay, warm-up, cosine annealing) to improve convergence.
Lookahead Optimizer
A wrapper that maintains fast and slow weights, periodically updating slow weights, improving convergence.
LoRA
Low-Rank Adaptation - a parameter-efficient fine-tuning method that updates only small low-rank matrices instead of full weights.
Loss Function
A function measuring the difference between model predictions and true values, guiding the training process.
Mean Squared Error
A loss function for regression that computes the average squared difference between predictions and targets.
Mini-Batch Gradient Descent
Computing gradients on small batches of data, balancing SGD's noise with full-batch GD's stability.
Mixed Precision Training
Using lower precision (FP16) for some computations while keeping FP32 for stability, speeding up training.
Mixup
Data augmentation creating synthetic examples by interpolating between training examples and their labels.
Momentum
An optimization technique that accelerates gradient descent by accumulating past gradients, helping escape local minima.
Nesterov Momentum
A momentum variant that looks ahead before computing gradients, often converging faster.
Pipeline Parallelism
Splitting model layers across devices and processing micro-batches in pipeline fashion.
Pre-training
Training a model on a large dataset (often self-supervised) before fine-tuning on specific tasks, enabling transfer learning.
Regularization
Techniques to prevent overfitting by adding constraints or penalties to the model (L1, L2, dropout, early stopping).
RLHF
Reinforcement Learning from Human Feedback - training models using human preferences to align behavior with human values.
RMSprop
An optimizer using moving average of squared gradients to adapt learning rates, addressing AdaGrad's diminishing rates.
Sharded Data Parallelism
Distributing model states across devices to train models larger than single-device memory.
Step Decay
Reducing learning rate by a factor at specific epochs, a simple scheduling strategy.
Stochastic Gradient Descent
A variant of gradient descent that updates parameters using gradients computed on a single random training example at a time (though often used to refer to mini-batch gradient descent).
Student Model
The smaller model in knowledge distillation learning to mimic the teacher's behavior.
Teacher Model
The larger, more accurate model in knowledge distillation that guides student training.
Tensor Parallelism
Splitting individual layers/tensors across devices for very large models.
Training
The process of teaching a machine learning model by adjusting its parameters based on data to minimize prediction errors.
Transfer Learning
Leveraging knowledge learned from one task/domain to improve performance on a related task with less data.
Vanishing Gradient
A problem where gradients become extremely small during backpropagation, preventing deep networks from learning effectively.
Warmup
Gradually increasing the learning rate at training start to stabilize optimization.
Weight Decay
A regularization technique that shrinks weights toward zero during optimization. Equivalent to L2 regularization in standard SGD, but differs when using adaptive optimizers like Adam.
ZeRO
Zero Redundancy Optimizer - techniques for memory-efficient distributed training by partitioning optimizer states.
No terms found matching your search.