Cloud Services

Baseten Sets New Standard for Cloud Scaling With Breakthrough AI Inference Performance

Grip, Moët Hennessy

Objective

Baseten’s mission is to power the world’s AI-driven applications. As AI models have grown in size and complexity, especially with the rise of AI reasoning capabilities, Baseten adopted NVIDIA’s latest data center GPU architecture, NVIDIA Blackwell, on Google Cloud along with the NVIDIA Dynamo inference framework and NVIDIA TensorRT™-LLM to help its customers scale quickly and meet the growing demand for AI.

Customer

Baseten

Partner

Google Cloud

Use Case

Customized inference

Products

NVIDIA Dynamo
NVIDIA Data Center / cloud
NVIDIA TensorRT - LLM

Key Benefits:

5× Higher Throughput for High-Traffic Endpoints

Baseten can now serve five times as many user requests for custom models with the same number of GPUs.

2x Better Price-Performance Serving Frontier Reasoning Models

Basten increased price-performance when serving the DeepSeek R1 and Llama 4 models by up to 225% reducing cost to manufacture intelligence.

Up to 38% Faster LLM Serving for Improved User Experience & Adoption

Baseten reduced latency for serving the largest LLMs like DeepSeek-R1 by up to 38%. Lower latency means faster intelligence and better user experience, driving up adoption of AI-powered applications.

How to Orchestrate a Global GPU Infrastructure for Scalable, High-Performance AI

Orchestrating a Global GPU Infrastructure

With AI model sizes increasing rapidly, and with new reasoning tasks requiring longer AI inference times due to the generation of “thinking” tokens, the demand for more cost-efficient compute performance and multinode inference serving has never been higher. To meet this challenge, Baseten turned to NVIDIA Blackwell GPUs, unlocking a new wave of performance and efficiency.

Founded in 2019, Baseten brings together GPUs from more than 10 cloud providers across dozens of global regions, creating a unified scalable GPU pool that supports the demanding AI workloads of some of the world’s fastest-growing AI companies.

To make this possible, Baseten built a sophisticated software orchestration layer that abstracts away the complexities of infrastructure management and latency variations that arise from the geographic diversity of GPU instances in the cloud. This system, enabled by the NVIDIA CUDA platform—a powerful parallel computing architecture that provides the software foundation for GPUs to efficiently execute AI workloads, breaks down the silos between GPU clusters across different providers and regions, turning them into a single unified GPU pool. GPU nodes, no matter where they reside, become completely fungible and seamless for their end users.

As a result, Baseten developed multi-cloud capacity management (MCM), with the capability to provision thousands of GPUs within less than five minutes by drawing from its global pool of compute resources across different cloud service providers.

Baseten

Baseten Adopts NVIDIA Blackwell to Deliver Real-Time, Scalable Inference for Large Reasoning Models

Baseten’s Leap in Performance and Efficiency

Delivering real-time, production-grade inference for state-of-the-art large language models that demand exponentially more memory, compute, and support for massive context windows requires a new approach: one that can efficiently manage the “think time” compute and intricate reasoning processes inherent to today’s most sophisticated AI workloads, all while maintaining speed, scalability, and cost-efficiency without compromise. Recognizing these demands, Baseten became the first company to adopt A4 VMs with NVIDIA Blackwell GPUs on Google Cloud to meet the scale and complexity of modern AI inference.

At the heart of Baseten’s NVIDIA Blackwell GPU cluster is NVIDIA Blackwell, NVIDIA’s most powerful GPU architecture yet. It features fifth-generation Tensor Cores, ultra-low-latency NVIDIA NVLink™ fabric, FP4 and FP6 precision, and more. With 208 billion transistors, more than 2.5x the number of transistors in NVIDIA Hopper™ GPUs, and built on TSMC’s 4NP process tailored for NVIDIA, Blackwell is engineered to drive breakthroughs in reasoning, generative content, and real-time intelligence.

Before the move to NVIDIA HGX™ B200, Baseten had to make difficult tradeoffs between user latency and inference costs when serving large reasoning models like DeepSeek-R1. The company also faced challenges when serving Llama 4 Scout models due to their larger 10-million-token context windows that required massive amounts of GPU memory. The move to NVIDIA Blackwell allowed Baseten to serve these models while balancing inference cost, latency, and other tradeoffs, all while leveraging their full context window and intelligence capabilities.

Baseten is now able to serve four of the most popular open-source models—DeepSeek-V3, DeepSeek-R1, gpt-oss, and Llama 4 Maverick—directly on its model APIs, delivering over 225% better cost performance for high-throughput inference, and 25% better cost performance for latency-sensitive inference. In addition to model APIs, Baseten also provides B200-powered dedicated deployments for customers seeking to run their own custom LLMs with the same reliability and efficiency.

By combining the architectural innovations of NVIDIA Blackwell with Google Cloud’s AI Hypercomputer architecture, Baseten benefits from a tightly integrated stack of performance-optimized hardware, high-speed networking, and flexible consumption models to deliver the scale, availability, and cost-efficiency for AI in the enterprise.

Google Cloud

Benchmarks show market improvement in throughput for Blackwell vs. H200 GPUs on both Llama and DeepSeek models

Baseten

NVIDIA Blackwell

Fifth-generation Tensor Cores

Ultra low-latency NVLink interconnect

FP4 and FP6 precision formats

208 billion transistors

Built using TSMC’s 4NP process tailored for NVIDIA

Largest GPU ever built

NVIDIA Dynamo

Distributed inference serving

LLM-aware request routing

Support for open source inference engines

NVIDIA TensorRT-LLM

Easy-to-use Python API

State-of-the-art optimizations

PyTorch backend

Coupling NVIDIA HGX B200 With Open-Source Inference Frameworks

Baseten’s approach to achieving peak inference model performance is rooted in coupling the latest accelerated compute hardware with the most advanced software to extract maximum utilization from every chip. When it came time to deploy the latest OpenAI gpt-oss-120b reasoning model, Baseten leveraged NVIDIA’s open-source inference stack—including NVIDIA Dynamo and TensorRT-LLM - to serve the model on the NVIDIA HGX B200 platform. This strategic choice enabled Baseten to achieve top performance rankings on a leading public LLM endpoint benchmarking platform on the model’s launch day.

At the core of this success was Basten’s integration of NVIDIA Dynamo, a low-latency distributed inference serving platform that supports advanced inference optimization techniques such as disaggregated serving, LLM-aware routing, and KV cache offloading to storage, into its serving architecture, alongside compiling the model with NVIDIA TensorRT-LLM, an easy-to-use Python API that contains state-of-the-art model optimizations to perform inference efficiently on NVIDIA GPUs.

Beyond gpt-oss-120b, Baseten also uses Dynamo to serve other frontier reasoning models like DeepSeek-R1 and Llama 4 on Blackwell GPUs via public endpoints. This allowed Baseten to dramatically lower latency, increase throughput, and construct an entirely new cost-performance curve when serving frontier models at scale. Thanks to the openness of NVIDIA Dynamo and its support for different inference backends, Baseten was also able to incorporate inference optimizations from other open-source inference engines, such as SGLang, to run models at peak performance.

Baseten further relies on TensorRT-LLM to optimize and compile custom LLMs, including for one of its largest and fastest-growing AI customers, Writer. These efforts have boosted throughput by over 60% for Writer’s Palmyra LLMs. The flexibility of TensorRT-LLM also enabled Baseten to extend its capabilities by developing a custom model builder that speeds up model compilation.

“Cost-effective scaling of reasoning mixture of expert models demands innovative inference techniques, such as disaggregated serving and context-aware routing. Baseten delivers industry-leading inference performance when running DeepSeek-R1 and Llama 4 on NVIDIA Blackwell accelerated by NVIDIA Dynamo, which is now live in production. The fifth-generation Blackwell Tensor cores, combined with low-latency NVLink bandwidth and NVIDIA Dynamo large-scale distributed inference optimizations, create a compounding effect allowing us to set new benchmarks for both throughput and latency.”

Pankaj Gupta,
Cofounder of Baseten

The Road Ahead for Cloud Scaling

Baseten is accelerating its mission to deliver the world’s most advanced inference platform for mission-critical AI. Its Inference Stack is what makes every model on Baseten so fast, reliable, and cost-efficient. Baseten will continue to expand globally, bringing NVIDIA’s latest accelerated compute infrastructure and inference software closer to customers through region-aware deployments and local support.

In addition, Baseten will continue its tradition of contributing back to open-source inference engines and frameworks by upstreaming its inference software optimizations back to the open-source projects, allowing others to benefit from Baseten’s work and creating a virtuous effect for the AI community at large.

Deploy NVIDIA Dynamo, contribute to its growth, and seamlessly integrate it into your existing stack.

Get Started

Cloud Services

Baseten Sets New Standard for Cloud Scaling With Breakthrough AI Inference Performance

Objective

Customer

Partner

Use Case

Products

Key Benefits:

How to Orchestrate a Global GPU Infrastructure for Scalable, High-Performance AI

Orchestrating a Global GPU Infrastructure

Baseten Adopts NVIDIA Blackwell to Deliver Real-Time, Scalable Inference for Large Reasoning Models

Baseten’s Leap in Performance and Efficiency

NVIDIA Blackwell

NVIDIA Dynamo

NVIDIA TensorRT-LLM

Coupling NVIDIA HGX B200 With Open-Source Inference Frameworks

The Road Ahead for Cloud Scaling

Related Customer Stories