Disaggregated serving processes the prefill and decode phases of AI inference on dedicated GPUs, enabling targeted optimization of compute-bound and memory-bound resources.
AI inference consists of two distinct phases: the context phase, known as “prefill,” and the generation phase, known as “decode,” each placing fundamentally different demands on AI infrastructure. The context phase is compute-bound, requiring high-throughput processing to ingest and analyze large volumes of input data and produce the first token output result. In contrast, the generation phase is memory bandwidth-bound, relying on fast memory transfers and high-speed interconnects, such as NVIDIA NVLink™, to sustain token-by-token output performance.
Traditional aggregated serving relies on a single GPU or GPU type to handle both stages, the compute-heavy prefill and the memory-heavy decode. Because both phases run on the same hardware, resource bottlenecks often emerge. This monolithic design offers little flexibility—scaling requires simply adding more of the same GPUs, which drives up costs without addressing the inefficiencies.
Disaggregated serving takes a fundamentally different approach. Instead of forcing one GPU or GPU type to handle everything, hardware is specialized for each task. Compute-optimized GPUs accelerate prefill, while memory-optimized GPUs focus on decode. At the same time, inference software orchestrates scheduling and the transfer of the KV cache between the GPUs, allowing far greater scalability. The system architecture itself is heterogeneous, with tasks routed to the most suitable hardware. This division of labor enables higher efficiency, lower cost per query, and the flexibility to efficiently scale inference workloads as demand grows.
Related
Disaggregated Inference Infographic
NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads
NVIDIA Blackwell Leads on SemiAnalysis InferenceMAX v1 Benchmarks
NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut
How NVIDIA GB200 NVL72 and NVIDIA Dynamo Boost Inference Performance for MoE Models
How to Autoscale Efficiently With Disaggregated Serving
NVIDIA Research: Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
Distributed Inference 101: Disaggregated Serving With NVIDIA Dynamo
Disaggregated serving is especially valuable in workloads that demand large context windows, high concurrency, or heavy memory usage. For long-context language models—such as those used in code generation and video understanding—context lengths can extend into millions of tokens. Running these models on a traditional setup results in low GPU utilization, but disaggregation enables them to operate efficiently, reducing overall inference costs.
Video and multimodal AI present another challenge, as tasks like video generation, summarization, and retrieval require enormous memory resources. Here, disaggregating prefill (the compute-heavy stage) from decode (the memory-bound stage) ensures throughput remains both high and consistent, even for the most demanding media workloads.
AI factories aiming to balance cost efficiency with high user interactivity are increasingly turning to serving models at mid-latency and mid-concurrency levels. Disaggregated serving offers a powerful solution—delivering higher throughput in moderate to high-concurrency scenarios. For applications where responsiveness is important but ultra-low latency isn’t essential, disaggregated serving provides an effective solution. By decoupling compute-heavy prefill from memory-intensive decode operations, this architecture enables greater throughput without increasing latency while keeping infrastructure costs under control.
Code generation models often need to operate over very large codebases: When asked to refactor, debug, or extend existing code, they must consume a lot of prior tokens. This means their prefill phase can become very heavy. Since prefill and decode phases have different resource needs, code generation greatly benefits from disaggregated serving, which separates these stages to improve efficiency and reduce latency.
Video generation models often need to process extensive input sequences—like detailed prompts, storyboards, or multimodal input—before producing even a single frame. This puts pressure on the prefill phase, which is compute intensive. Since the prefill (context encoding) and decode (frame generation) phases have different resource requirements, video generation models benefit from disaggregated serving.
Deep research in AI uses many agents, each powered by different models, working together through multiple rounds of interaction to solve complex problems. With each round, they build on previous results and share more information, making the context larger over time. Disaggregated serving helps manage this by separating the compute-heavy prefill processing from decode, allowing systems to scale more efficiently while maintaining high throughput.
Serving LLMs at scale requires balancing throughput (how many requests can be processed) with latency (how fast each response is returned). Hitting this balance is key to meeting service-level objectives (SLOs). Too much focus on throughput can cause delays, while prioritizing latency alone can waste resources. Disaggregated serving is especially effective in this middle ground, where both throughput and latency matter.
Delivers up to 15x ROI, allowing AI factories to serve more user requests per dollar invested.
Increases throughput by up to 15x without trading off latency by optimizing prefill and decode separately.
Allocates the right type and number of GPUs to each inference phase, minimizing resource underutilization.
Autoscales GPUs up or down, responding to changing input/output sequence lengths and SLA demands.
Deploying disaggregated serving across multiple nodes is complex, requiring tight coordination across several layers of the stack. When separating context from generation, the system must support low-latency data transfer to efficiently move intermediate data—like the KV cache—between GPUs. It also needs an LLM-aware routing engine that can track where this data lives across large GPU clusters and route incoming requests intelligently to avoid redundant computation. Finally, a memory management layer is essential to offload unused data to more cost-effective storage and retrieve it when needed. These components are critical to making disaggregated serving practical and performant at scale.
Solutions like NVIDIA Dynamo—a distributed inference serving framework built for data-center-scale deployments—simplify and automate the complexities of disaggregated serving architectures. Dynamo manages fast KV cache transfers between prefill and decode GPUs, intelligently routes requests to the right decode GPUs holding the relevant cache, and enables the system to scale to thousands of GPUs when user demand exceeds available capacity.
NVIDIA Dynamo
Scale your large language model deployments using disaggregated serving with NVIDIA Dynamo.
NVIDIA AI Inference
Learn about the NVIDIA Inference Platform, including NVIDIA Dynamo and NVIDIA TensorRT™-LLM, for a full-stack approach to disaggregated serving at scale.
Stay Up to Date on NVIDIA AI Inference News
Sign up for the latest AI inference news, updates, and more from NVIDIA.