Cloud Services

Streamlined AI Inference Infrastructure in the Cloud


Baseten leverages NVIDIA GPUs and NVIDIA® TensorRT™-LLM to provide machine learning infrastructure that’s high-performance, scalable, and cost-effective.





Use Case

Generative AI / LLMs


NVIDIA A100 Tensor Core GPU
NVIDIA A10 Tensor Core GPU

Baseten’s AI Inference Infrastructure

Baseten’s mission is simple: provide machine learning (ML) infrastructure that just works.

With Baseten, organizations have what they need to deploy and serve ML models performantly, scalably, and cost-effectively for real-time applications. Customers can come to Baseten with their own models or choose from a variety of pretrained models and deploy them in production, served on Baseten’s open-source Truss framework and managed on an easy-to-use dashboard.

Leveraging NVIDIA GPU-accelerated instances on AWS, such as Amazon EC2 P4d instances powered by NVIDIA A100 Tensor Core GPUs, and optimized NVIDIA software, such as NVIDIA TensorRT-LLM, Baseten can deliver on their mission from the cloud.

Image courtesy of Baseten

Image courtesy of Baseten

Inference Deployment Challenges

Baseten tackles several model-deployment challenges faced by their customers, specifically around scalability, cost efficiency, and expertise.

Scalability: Handling AI infrastructure that serves varying levels of demand, from sporadic individual requests to thousands of high-traffic requests, is a big challenge. The underlying infrastructure must be both dynamic and responsive, adapting to real-time demands without causing delays or needing manual oversight.

Cost Efficiency: Maximizing the utilization of the underlying NVIDIA GPUs is critical. AI inference infrastructure needs to deliver high performance without incurring unnecessary expenses during low- and high-traffic scenarios.

Expertise: The deployment of ML models requires specialized skills and a deep understanding of the underlying infrastructure. This expertise can be scarce and costly to acquire, presenting a challenge for organizations to maintain cutting-edge inference capabilities without a significant investment in skilled personnel.

Baseten Powered by NVIDIA on AWS

Baseten offers optimized inference infrastructure powered by NVIDIA’s hardware and software to help solve the challenges of deployment scalability, cost efficiency, and expertise.

With automatic scaling capabilities, Baseten allows customers deploying their models to dynamically adjust the number of replicas based on consumer traffic and service-level agreements, ensuring that capacity meets demand without manual intervention. This helps optimize for cost, as Baseten’s infrastructure can easily scale up or down depending on the number of requests coming to the model. Not only does it cost customers nothing when there's no activity, but once a request does come in, Baseten’s infrastructure, powered by NVIDIA GPUs on AWS EC2 instances powered by NVIDIA A100 Tensor Core GPUs, only takes 5–10 seconds to get the model up and running. This is an incredible speedup on cold starts, which previously took up to five minutes—a speedup of 30–60X. Customers can also choose from a variety of NVIDIA GPUs available on Baseten to accelerate their model inference, including but not limited to NVIDIA A100, A10G, T4, and V100 Tensor Core GPUs.

On top of NVIDIA hardware, Baseten leverages optimized NVIDIA software. Using the TensorRT-LLM feature of tensor parallelism served on AWS, Baseten boosted inference performance for a customer's LLM deployment by 2X through their open-source Truss framework. Truss is Baseten’s open-source packaging and deployment library, which lets users deploy models in production with ease.

TensorRT-LLM is included as a part of NVIDIA AI Enterprise, which provides a production-grade, secure, end-to-end software platform for enterprises building and deploying accelerated AI software.

NVIDIA’s full-stack AI inference approach plays a crucial role in meeting the stringent demands of Baseten’s customers’ real-time applications. With NVIDIA A100 GPUs andTensorRT-LLM optimizations, the underlying infrastructure unlocks both performance gains and cost savings for developers.

Explore more about Baseten by watching a quick demo of their product.

NVIDIA Inception Program

Baseten is a member of NVIDIA Inception, a free program that nurtures startups revolutionizing industries with technological advancements. As a benefit of Inception, Baseten gained early access to TensorRT-LLM, presenting a significant opportunity to develop and deliver high-performance solutions.

What Is NVIDIA Inception?

  • NVIDIA Inception is a free program designed to help startups evolve faster through cutting-edge technology, opportunities to connect with venture capitalists, and access to the latest technical resources from NVIDIA.

NVIDIA Inception Program Benefits

  • Unlike traditional accelerators, NVIDIA Inception supports all stages of a startup’s life cycle. We work closely with members to provide the best technical tools, latest resources, and opportunities to connect with investors.

Join NVIDIA Inception’s global network of over 15,000 technology startups.