What Is Mixture-of-Transformers (MoT)?

Mixture-of-Transformers (MoT) is a model architecture that combines multiple transformer blocks into a unified system, enabling it to use the appropriate processing strategy for the input, task, or generation objective while maintaining a shared representation space.

How Does Mixture-of-Transformers (MoT) Work?

MoT works by combining multiple transformer computation paths within a single model and deciding which path should handle which part of the workload. The “mixture” can happen at the level of tokens, layers, generation objectives (such as next-token prediction for language and reasoning, or denoising for continuous signal generation), or generation modes.

Firstly, input types such as text, images, video frames, audio chunks, actions, or latent diffusion tokens are converted into tokens. Even if the raw inputs are different, the model maps them into a common embedding space so they can be processed together. 

Next, instead of one transformer doing everything the same way, MoT can include different modality-aware transformer blocks, for example:

  • Autoregressive transformer path for next-token prediction, reasoning, planning, or temporal sequence modeling.
  • Diffusion transformer path for denoising, refinement, and high-fidelity image, video, audio, or action generation.

Similarly, the system may route based on modality, task, timestep, token type, prompt instruction, or generation objective. MoT makes transformer computation conditional: different inputs, tasks, or generation objectives can activate different transformer capabilities inside the same unified model.

The New Frontier Open Foundation Model for Physical AI Is Here

NVIDIA Cosmos 3 is the first OmniModel for physical AI, unifying vision reasoning, multimodal generation, and action prediction in a single foundation.

Quick Links

Why is Mixture-of-Transformers (MoT) Important?

MoT models are often trained with more than one learning objective, such as:

  • next-token prediction for language, action, or sequence reasoning
  • denoising loss for diffusion generation
  • task-specific losses for planning, simulation, or control

The first two objectives define how the model builds language and continuous signals. Task-specific objectives like planning or simulation can be trained using either next-token prediction or denoising loss, depending on the generation mode.

This teaches the model not only to process different data types, but to apply the right computational behavior to the right job.

The alternative is using separate specialist models per modality and workload, but this adds computational overhead and doesn’t scale well. 

MoT as one unified system is more scalable, efficient and high-performing.

Applications and Use Cases of Mixture-of-Transformers (MoT)

MoT routes different input types through specialized transformer sub-networks, making it powerful for tasks that combine multiple modalities or require distinct computational behaviors at the same time.

Physical AI

MoT architectures help world foundation models process the diverse data physical agents need, including video, language, sensors, maps and actions. Specialized transformer blocks can handle perception, motion, physics, planning or control, enabling robots and autonomous vehicles to reason more efficiently in complex real-world environments.

Enterprise or Agentic AI

For enterprise and agentic AI, MoT designs can route different tasks to specialized experts, such as retrieval, planning, coding, document understanding, analytics or tool use. This makes it possible to build larger, more capable AI agents that adapt to different business workflows without running every part of the model for every request.

Vision AI

MoT architectures can specialize across image types, camera views, resolutions, scenes and tasks such as classification, detection, segmentation, video understanding and visual search. This is especially useful for applications like retail analytics, medical imaging, industrial inspection, security and autonomous systems, where visual data varies widely and requires domain-specific interpretation.

Conversational AI

MoT design in a foundation model can support richer, faster and more personalized interactions across text, speech and multimodal inputs. Experts can specialize in languages, accents, dialogue memory, emotion, intent detection or real-time response generation, helping assistants deliver more accurate and natural conversations at lower serving cost. 

What Are the Benefits of Mixture-of-Transformers (MoT)?

Efficient Conditional Computation

MoT improves efficiency by activating only the transformer path needed for a specific input type or generation objective. Instead of running the full model for every request, it uses conditional computation, helping reduce training and inference costs compared with dense multimodal models.

Enables Cross-Modal Understanding

MoT gives each modality, such as text, image, video, or speech, its own specialized processing path while still allowing the full sequence to share attention. This means the model can learn the unique structure of each data type and still reason across them together.

Scalable Capability Support

MoT makes it easier to expand a model's capabilities over time. Because different paths can operate within a shared representation space, developers can add specialized functions, such as video generation, speech understanding or reasoning, without redesigning the entire architecture.

Challenges and Solutions

Data Requirements and Curation

Each modality or capability path needs enough high-quality training data. If one path has weaker data, the overall model may perform unevenly across tasks.

Solutions

  • Large-scale data curation is key here. Based on the use-case, developers need to identify high-quality data of desired modalities in a training dataset and monitor performance per objective, adjusting sampling when one expert lags behind.

Training Complexity for MoT Models

MoT models are harder to train than dense Transformers because each path or expert must learn useful specialization without becoming isolated. The model also needs enough shared learning so modalities can still work together.

Solutions

  • Such models need training for both specialized paths and a shared representation space. Each transformer learns what it is best at, such as text, video, speech, action generation or reasoning, while shared attention or shared layers keep the model connected across modalities. NVIDIA Cosmos is an open platform with training code, optimization tools, and recipes to build MoT-based foundation models.

Inference and Serving Complexity

Sparse architectures can be more complicated to deploy than dense models. Different paths may have different memory, latency, and batching needs, which can make inference harder to optimize.

Solutions

  • Deploy them through optimized runtimes that automatically manage those paths, keep latency predictable and make the model easier to run at scale. NVIDIA NIM microservices and TensorRT optimizations package model paths into production-ready microservices with tuned inference, batching, memory management and GPU acceleration. 

FAQs

Omni-models are models capable of understanding and generating multiple modalities, such as text, images, videos, audio, and actions. MoT is not a model but a model architecture for building omni-models.

A world model understands dynamics of the real world and can simulate how the world changes based on data like text, video, and action sequences. MoT enables world models by providing an architecture that understands these different data streams and generates future world states that are physically accurate, coherent across modalities, and useful for planning actions.

Both MoE and MoT are ways to make large AI models more efficient. Instead of activating the entire neural network for every input, they activate only the parts that are needed.

  • MoE works at the layer level inside a transformer model. In a standard transformer, every token passes through the same feed-forward layer. With MoE, that single layer is replaced by several specialized versions called experts. A small component called a router decides which 1–2 experts should process each token. This allows the model to have many more parameters and specialized capabilities while keeping compute costs low because most experts remain inactive for any given token.
  • MoT operates at a broader transformer-block level, primarily for multimodal models that integrate text, images, audio, and other data types. Instead of using a learned router, the routing is fixed based on the modality. The main purpose of MoT is not just efficiency, but reducing modality interference meaning the model can process text, images, and audio separately so they do not negatively affect each other during learning.

The two approaches are complementary and can be combined. For example, a MoT architecture may have separate transformer paths for text and images, and each transformer block inside those paths can still use MoE layers internally.

Compared to a dense model of equivalent parameter count, MoT reduces active computation per request by using only the transformer components for the relevant modality. The model does less work per token than a dense model that runs every parameter for every input. This can reduce latency and improve throughput while maintaining the same model size.

The comparison changes if you measure against a dense model with fewer total parameters. In that case, both models activate similar amounts of compute per request, and latency would be roughly equivalent. MoT's advantage in that scenario is quality: specializing components by modality tends to produce better results than a single dense path of the same size, not lower compute cost.

In either case, realizing the latency benefit in production requires optimized serving, batching, and GPU scheduling as the efficiency potential of the architecture does not translate automatically without the right inference stack.

Next Steps to Get Started with Mixture-of-Transfomers (MoT)

Ready to Get Started?

Learn how NVIDIA Cosmos 3 uses MoT for building world foundation models in the Cosmos 3 whitepaper.

NVIDIA Cosmos

An open platform for physical AI with world foundation models (WFMs), video data processing libraries, video evaluation, and post-training frameworks.

Cosmos Cookbook

A comprehensive guide for working with the NVIDIA Cosmos ecosystem for real-world, domain-specific applications across robotics, simulation, autonomous systems, and physical scene understanding.