Mixture of experts (MoE) is an AI model architecture that uses multiple, specialized submodels, or "experts," to handle tasks more efficiently than a single, monolithic model.
Mixture-of-experts models scale by activating only the portions of the network that matter for a given AI token. Instead of running every parameter on every step, a learned routing mechanism sparsely selects which subnetworks should participate, allowing the model to grow capacity without paying the full compute cost. This selective activation is what lets MoE architectures push parameter counts into extreme regimes while keeping AI inference practical. MoE models are built from four core elements:
In MoE-based large language models (LLMs) like DeepSeek-R1 and gpt-oss-120B, tokens first traverse the same self-attention blocks used in dense architectures. The MoE pathway begins immediately afterward: The gating network examines the attention outputs, selects a targeted subset of experts, and routes each token accordingly. This sparse routing repeats across successive MoE layers, progressively shaping the model’s internal representation until the final output is produced.
During training, experts become increasingly specialized in the areas where they perform best, and the gating network (or router) learns the optimal routing of tokens. This process requires careful load balancing to ensure that all experts in the network receive comparable training exposure, rather than overfitting a small subset.
During inference, a mixture-of-experts model can spread its experts across many GPUs, with each device handling only a small subset. When combined with high-speed networking, this distribution boosts both throughput and latency. Effective load balancing is critical to prevent hotspots—ensuring no single GPU is overloaded with popular experts—so that compute, memory, and network resources are used efficiently across the entire system.
For every token processed by an MoE-based LLM, the router must select a subset of experts in a layer to handle that token. Once processed, the outputs from those experts must be aggregated before moving to the next layer. When experts are distributed across multiple GPUs, these token distribution and aggregation operations place tremendous strain on memory bandwidth, which is why high-performance scale-up networking technologies are critical, ensuring data moves quickly and efficiently between GPUs to maintain throughput and minimize latency.
Most top-performing models today use MoE architectures because they deliver higher quality at similar computational costs compared to dense models, even as their overall parameter counts grow much larger. By activating only a small subset of specialized experts for each token, MoEs achieve state-of-the-art results with inference costs and performance that rival or outperform smaller dense models. This is a fundamental reason why they dominate quality benchmarks without a proportional increase in compute or latency. The result is a better user experience, improved throughput, and significant TCO advantages for large-scale deployments.
Leading AI models like DeepSeek-R1 increasingly use MoE architectures. MoE lets AI models divide work among specialized “experts,” so only the right experts are used for each task, making AI smarter, faster, and more efficient. For complex MoE models, like DeepSeek-R1, industry-leading performance is unlocked by extreme hardware-software codesign and novel techniques like wide-expert parallelism and disaggregated serving that squeeze out every ounce of MoE inference performance at chip, rack, and data center scales.
Discover how mixture of experts powers cutting‑edge applications across language, vision, data analytics, healthcare, and autonomous systems, unlocking higher accuracy and efficiency for domain‑specific workloads.
Extreme codesign refers to a collaborative approach that tightly integrates hardware and software to optimize the performance and scalability of advanced AI models like MoEs. This is exemplified by new platforms such as NVIDIA Blackwell and orchestration systems like NVIDIA Dynamo:
Deploying and training MoE architectures involves several unique challenges.