Cloud Services

Accelerating Large Language Model Inference with NVIDIA in the Cloud

Objective

Perplexity aims to make it easy for developers to integrate cutting-edge, open-source large language models (LLMs) into their projects with pplx-api, an efficient API tool powered by NVIDIA GPUs and optimized for fast inference with NVIDIA® TensorRT™-LLM.  

Customer

Perplexity

Partner

AWS

Use Case

Generative AI / LLMs

Products

NVIDIA TensorRT-LLM
NVIDIA H100 Tensor Core GPUs
NVIDIA A100 Tensor Core GPUs

Perplexity’s Fast and Efficient API

Delivering fast and efficient LLM inference is critical for real-time applications. 

Perplexity offers pplx-api, an API designed to access popular LLMs with blazing fast inference capabilities and robust infrastructure. Pplx-api is made for developers looking to integrate open-source LLMs into their projects and is designed to withstand production-level traffic. It's currently served on Amazon Elastic Compute Cloud (Amazon EC2) P4d instances powered by NVIDIA A100 Tensor Core GPUs and is further accelerated with NVIDIA TensorRT-LLM. Soon, Perplexity will make a full transition to Amazon P5 instances powered by NVIDIA H100 Tensor Core GPUs.

Inference Deployment Challenges

Perplexity faces several challenges when deploying LLMs for their core product, which deploys customized versions of various open-source models specialized for search. One significant challenge as a startup has been managing the escalating costs associated with LLM inference to support Perplexity’s rapid growth. 

Since Perplexity’s LLM inference platform, pplx-api, was released for public beta in October 2023, Perplexity has been challenged to optimize their infrastructure to achieve massive scale at minimal cost while maintaining strict service-level agreement (SLA) requirements.

Additionally, community LLMs are growing at an explosive rate. Organizations of all sizes need to adapt quickly to these innovations and build on optimized infrastructure to deploy complex models efficiently. This is driving up the cost and complexity of deployment, so an optimized full-stack approach becomes essential for strong performance of LLM-powered applications.

Image courtesy of Perplexity.

Perplexity and NVIDIA on AWS

Perplexity harnesses the power of NVIDIA’s hardware and software to solve this challenge. Serving results faster than one can read, pplx-api can achieve up to 3.1X lower latency and up to 4.3X lower first-token latency relative to other deployment platforms. Perplexity was able to lower costs by 4X by simply switching their external inference-serving API references to call pplx-api, resulting in savings of $600,000 per year. 

Perplexity achieves this by deploying their pplx-api solution on Amazon P4d instances. At the hardware level, the underlying NVIDIA A100 GPUs make for a cost-effective and reliable option for scaling out GPUs with incredible performance. Perplexity has also shown that by leveraging NVIDIA H100 GPUs and FP8 precision on Amazon P5 instances, they can cut their latency in half and boost throughput by 200 percent compared to NVIDIA A100 GPUs given the same configuration.

Optimizing the software that runs on the GPU helps further maximize performance. NVIDIA TensorRT-LLM, an open-source library that accelerates and optimizes LLM inference, facilitates these optimizations for implementations like FlashAttention and masked multi-head attention (MHA) for the context and generation phases of LLM model execution. It also provides a flexible layer of customization for key parameters such as batch size, quantization, and tensor parallelism. TensorRT-LLM is included as a part of NVIDIA AI Enterprise, which provides a production-grade, secure, end-to-end software platform for enterprises building and deploying accelerated AI software. 

Finally, to tackle scalability, Perplexity uses AWS’s robust integration with Kubernetes to scale elastically beyond hundreds of GPUs and ultimately minimize downtime and network overhead. 

NVIDIA’s full-stack AI inference approach plays a crucial role in meeting the stringent demands of real-time applications. From NVIDIA H100 and A100 GPUs to the optimizations of NVIDIA TensorRT-LLM, the underlying infrastructure powering Perplexity’s pplx-api unlocks both performance gains and cost savings for developers. 

Explore more about Perplexity on AWS on Air, where they discuss their product in-depth.

  • TensorRT-LLM accelerates and optimizes inference performance. 
  • NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of the latest LLMs on the NVIDIA AI platform.
  • Perplexity’s pplx-api platform optimizes high-performance computing (HPC) workloads with NVIDIA A100 Tensor Core GPUs.
  • Amazon instances with NVIDIA A100 GPUs offer scalable, high performance for machine learning training and HPC applications in the cloud.
  • pplx-api supercharges LLM inference with NVIDIA H100 Tensor Core GPUs.
  • Amazon instances with NVIDIA H100 GPUs deliver unprecedented performance for training large generative AI models at scale.

NVIDIA Inception Program

Perplexity is a member of NVIDIA Inception, a free program that nurtures startups revolutionizing industries with technological advancements.

What Is NVIDIA Inception

  • NVIDIA Inception is a free program designed to help startups evolve faster through cutting-edge technology, opportunities to connect with venture capitalists, and access to the latest technical resources from NVIDIA.

NVIDIA Inception Program Benefits

  • Unlike traditional accelerators, NVIDIA Inception supports all stages of a startup’s life cycle. We work closely with members to provide the best technical tools, latest resources, and opportunities to connect with investors.

Join NVIDIA Inception’s global network of over 15,000 technology startups.