What Is Disaggregated Serving?

Disaggregated serving processes the prefill and decode phases of AI inference on dedicated GPUs, enabling targeted optimization of compute-bound and memory-bound resources.

How Does Disaggregated Serving Work?

AI inference consists of two distinct phases: the context phase, known as “prefill,” and the generation phase, known as “decode,” each placing fundamentally different demands on AI infrastructure. The context phase is compute-bound, requiring high-throughput processing to ingest and analyze large volumes of input data and produce the first token output result. In contrast, the generation phase is memory bandwidth-bound, relying on fast memory transfers and high-speed interconnects, such as NVIDIA NVLink™, to sustain token-by-token output performance.

What is the difference between aggregated and disaggregated serving?

Traditional aggregated serving relies on a single GPU or GPU type to handle both stages, the compute-heavy prefill and the memory-heavy decode. Because both phases run on the same hardware, resource bottlenecks often emerge. This monolithic design offers little flexibility—scaling requires simply adding more of the same GPUs, which drives up costs without addressing the inefficiencies.

Disaggregated serving takes a fundamentally different approach. Instead of forcing one GPU or GPU type to handle everything, hardware is specialized for each task. Compute-optimized GPUs accelerate prefill, while memory-optimized GPUs focus on decode. At the same time, inference software orchestrates scheduling and the transfer of the KV cache between the GPUs, allowing far greater scalability. The system architecture itself is heterogeneous, with tasks routed to the most suitable hardware. This division of labor enables higher efficiency, lower cost per query, and the flexibility to efficiently scale inference workloads as demand grows.

What is the difference between aggregated and disaggregated serving?

Traditional aggregated serving relies on a single GPU or GPU type to handle both stages, the compute-heavy prefill and the memory-heavy decode. Because both phases run on the same hardware, resource bottlenecks often emerge. This monolithic design offers little flexibility—scaling requires simply adding more of the same GPUs, which drives up costs without addressing the inefficiencies.

Disaggregated serving takes a fundamentally different approach. Instead of forcing one GPU or GPU type to handle everything, hardware is specialized for each task. Compute-optimized GPUs accelerate prefill, while memory-optimized GPUs focus on decode. At the same time, inference software orchestrates scheduling and the transfer of the KV cache between the GPUs, allowing far greater scalability. The system architecture itself is heterogeneous, with tasks routed to the most suitable hardware. This division of labor enables higher efficiency, lower cost per query, and the flexibility to efficiently scale inference workloads as demand grows.

Applications and Use Cases of Disaggregated Serving

Disaggregated serving is especially valuable in workloads that demand large context windows, high concurrency, or heavy memory usage. For long-context language models—such as those used in code generation and video understanding—context lengths can extend into millions of tokens. Running these models on a traditional setup results in low GPU utilization, but disaggregation enables them to operate efficiently, reducing overall inference costs.

Video and multimodal AI present another challenge, as tasks like video generation, summarization, and retrieval require enormous memory resources. Here, disaggregating prefill (the compute-heavy stage) from decode (the memory-bound stage) ensures throughput remains both high and consistent, even for the most demanding media workloads.

AI factories aiming to balance cost efficiency with high user interactivity are increasingly turning to serving models at mid-latency and mid-concurrency levels. Disaggregated serving offers a powerful solution—delivering higher throughput in moderate to high-concurrency scenarios. For applications where responsiveness is important but ultra-low latency isn’t essential, disaggregated serving provides an effective solution. By decoupling compute-heavy prefill from memory-intensive decode operations, this architecture enables greater throughput without increasing latency while keeping infrastructure costs under control.

Code Generation

Code generation models often need to operate over very large codebases: When asked to refactor, debug, or extend existing code, they must consume a lot of prior tokens. This means their prefill phase can become very heavy. Since prefill and decode phases have different resource needs, code generation greatly benefits from disaggregated serving, which separates these stages to improve efficiency and reduce latency.

Video Generation

Video generation models often need to process extensive input sequences—like detailed prompts, storyboards, or multimodal input—before producing even a single frame. This puts pressure on the prefill phase, which is compute intensive. Since the prefill (context encoding) and decode (frame generation) phases have different resource requirements, video generation models benefit from disaggregated serving.

Deep Research

Deep research in AI uses many agents, each powered by different models, working together through multiple rounds of interaction to solve complex problems. With each round, they build on previous results and share more information, making the context larger over time. Disaggregated serving helps manage this by separating the compute-heavy prefill processing from decode, allowing systems to scale more efficiently while maintaining high throughput.

Balanced Serving

Serving LLMs at scale requires balancing throughput (how many requests can be processed) with latency (how fast each response is returned). Hitting this balance is key to meeting service-level objectives (SLOs). Too much focus on throughput can cause delays, while prioritizing latency alone can waste resources. Disaggregated serving is especially effective in this middle ground, where both throughput and latency matter.

What Are the Benefits of Disaggregated Serving?

Return on Investment

Delivers up to 15x ROI, allowing AI factories to serve more user requests per dollar invested.

Performance

Increases throughput by up to 15x  without trading off latency by optimizing prefill and decode separately.

Efficiency

Allocates the right type and number of GPUs to each inference phase, minimizing resource underutilization.

Scalability

Autoscales GPUs up or down, responding to changing input/output sequence lengths and SLA demands.

Challenges and Solutions

Deploying disaggregated serving across multiple nodes is complex, requiring tight coordination across several layers of the stack. When separating context from generation, the system must support low-latency data transfer to efficiently move intermediate data—like the KV cache—between GPUs. It also needs an LLM-aware routing engine that can track where this data lives across large GPU clusters and route incoming requests intelligently to avoid redundant computation. Finally, a memory management layer is essential to offload unused data to more cost-effective storage and retrieve it when needed. These components are critical to making disaggregated serving practical and performant at scale.

Solutions like NVIDIA Dynamo—a distributed inference serving framework built for data-center-scale deployments—simplify and automate the complexities of disaggregated serving architectures. Dynamo manages fast KV cache transfers between prefill and decode GPUs, intelligently routes requests to the right decode GPUs holding the relevant cache, and enables the system to scale to thousands of GPUs when user demand exceeds available capacity.

Next Steps

NVIDIA Dynamo

Scale your large language model deployments using disaggregated serving with NVIDIA Dynamo.

NVIDIA AI Inference

Learn about the NVIDIA Inference Platform, including NVIDIA Dynamo and NVIDIA TensorRT™-LLM, for a full-stack approach to disaggregated serving at scale.

Stay Up to Date on NVIDIA AI Inference News

Sign up for the latest AI inference news, updates, and more from NVIDIA.