Amazon Accelerates Customer Satisfaction With NVIDIA Triton Inference Server and NVIDIA TensorRT


Amazon improves the customer experience with ai-driven, real-time spell check for product search.



Use Case

Real-time Search


NVIDIA TensorRT, NVIDIA Triton Inference Server, T5, Triton Model Analyzer

Real-Time Spell Check for Enhanced Product Search, one of the most visited e-commerce websites in the world, enables customers to more effortlessly shop using an AI model that automatically corrects misspelled words in search queries. Amazon measures the success of their accelerated search results based on latency—how fast the spell checker corrects a typo, and throughput—the number of successful sessions.

NVIDIA Solutions

To achieve the desired results, Amazon uses the Text-To-Text Transfer Transformer (T5) natural language processing (NLP) model for spelling correction. To accelerate text correction, they leverage NVIDIA AI inference software, including NVIDIA Triton™ Inference Server, and NVIDIA® TensorRT™, an SDK for high performance deep learning inference.

Amazon Results


  • 5X inference speedup with NVIDIA TensorRT and NVIDIA Triton Inference Server

  • Real-time (<50ms) inference

Amazon successfully deployed the T5 NLP model for automatic spelling correction, accelerated by Triton Inference Server and TensorRT. The NVIDIA solutions respectively delivered under 50ms of inference latency and 5X the throughput for the T5 model, using NVIDIA GPUs on Amazon Web Services (AWS). The Triton Model Analyzer also reduced the time needed to find optimal inference configuration from weeks to hours. With AI, online shoppers can now find the products they’re looking for more quickly and easily, boosting Amazon’s overall customer satisfaction.

About Amazon, Inc. is an American multinational technology company that focuses on e-commerce.

“It’s all about the customer experience, and the search bar is the point of entry for our customers all over the world. With Model Analyzer, what used to take us two or three weeks we can do in less than a day. We demonstrated generative models work best on NVIDIA GPUs, that was clear. If I can bring millisecond latency to bigger models, I can make more customers happy. NVIDIA focuses on the right thing: optimizing for performance, and they are excellent partners, fast and responsive on features.”

Senior Machine Learning Developer