Mixture-of-Transformers (MoT) is a model architecture that combines multiple transformer blocks into a unified system, enabling it to use the appropriate processing strategy for the input, task, or generation objective while maintaining a shared representation space.
MoT works by combining multiple transformer computation paths within a single model and deciding which path should handle which part of the workload. The “mixture” can happen at the level of tokens, layers, generation objectives (such as next-token prediction for language and reasoning, or denoising for continuous signal generation), or generation modes.
Firstly, input types such as text, images, video frames, audio chunks, actions, or latent diffusion tokens are converted into tokens. Even if the raw inputs are different, the model maps them into a common embedding space so they can be processed together.
Next, instead of one transformer doing everything the same way, MoT can include different modality-aware transformer blocks, for example:
Similarly, the system may route based on modality, task, timestep, token type, prompt instruction, or generation objective. MoT makes transformer computation conditional: different inputs, tasks, or generation objectives can activate different transformer capabilities inside the same unified model.
Quick Links
MoT models are often trained with more than one learning objective, such as:
The first two objectives define how the model builds language and continuous signals. Task-specific objectives like planning or simulation can be trained using either next-token prediction or denoising loss, depending on the generation mode.
This teaches the model not only to process different data types, but to apply the right computational behavior to the right job.
The alternative is using separate specialist models per modality and workload, but this adds computational overhead and doesn’t scale well.
MoT as one unified system is more scalable, efficient and high-performing.
MoT routes different input types through specialized transformer sub-networks, making it powerful for tasks that combine multiple modalities or require distinct computational behaviors at the same time.
Quick Links
Omni-models are models capable of understanding and generating multiple modalities, such as text, images, videos, audio, and actions. MoT is not a model but a model architecture for building omni-models.
A world model understands dynamics of the real world and can simulate how the world changes based on data like text, video, and action sequences. MoT enables world models by providing an architecture that understands these different data streams and generates future world states that are physically accurate, coherent across modalities, and useful for planning actions.
Both MoE and MoT are ways to make large AI models more efficient. Instead of activating the entire neural network for every input, they activate only the parts that are needed.
The two approaches are complementary and can be combined. For example, a MoT architecture may have separate transformer paths for text and images, and each transformer block inside those paths can still use MoE layers internally.
Compared to a dense model of equivalent parameter count, MoT reduces active computation per request by using only the transformer components for the relevant modality. The model does less work per token than a dense model that runs every parameter for every input. This can reduce latency and improve throughput while maintaining the same model size.
The comparison changes if you measure against a dense model with fewer total parameters. In that case, both models activate similar amounts of compute per request, and latency would be roughly equivalent. MoT's advantage in that scenario is quality: specializing components by modality tends to produce better results than a single dense path of the same size, not lower compute cost.
In either case, realizing the latency benefit in production requires optimized serving, batching, and GPU scheduling as the efficiency potential of the architecture does not translate automatically without the right inference stack.
Quick Links
Next Steps to Get Started with Mixture-of-Transfomers (MoT)
Learn how NVIDIA Cosmos 3 uses MoT for building world foundation models in the Cosmos 3 whitepaper.
An open platform for physical AI with world foundation models (WFMs), video data processing libraries, video evaluation, and post-training frameworks.
A comprehensive guide for working with the NVIDIA Cosmos ecosystem for real-world, domain-specific applications across robotics, simulation, autonomous systems, and physical scene understanding.