The inference accelerator for NVIDIA Vera Rubin.
Overview
In the past, AI inference architectures delivered either interactivity and intelligence at the cost of throughput, or throughput and intelligence at the cost of interactivity. You couldn’t have all three. Agentic systems demand more.
NVIDIA Groq 3 LPX is the inference accelerator for NVIDIA Vera Rubin, designed to meet the low-latency and large-context demands of agentic systems. Vera Rubin and LPX unite the extreme performance of NVIDIA Rubin GPUs and LPUs through a co-designed architecture.
Inference Performance
By combining Rubin GPUs for high-bandwidth memory (HBM) and LPUs for static random-access memory (SRAM), NVIDIA Vera Rubin with LPX delivers a new class of inference performance for trillion-parameter models and million-token context. Deployed with Vera Rubin NVL72, Rubin GPUs and LPUs boost decode by jointly computing every layer of the AI model for every output token.
Agentic systems consume up to 15x more tokens than traditional AI applications. AI factories must deliver on token volume and massive context windows with low latency and efficient economics. When paired with LPX, Vera Rubin delivers up to 35x higher throughput per megawatt for trillion-parameter models.
Agents are units of intelligence, and inference is their fuel. To deliver real-world impact, agentic systems need tokens that are fast and smart. When LPX is paired with Vera Rubin, the additional throughput per watt and token performance unlock a new tier of ultra-premium, trillion-parameter, million-context inference, expanding revenue opportunity for all AI providers.
The NVIDIA Groq 3 LPU is the next generation of Groq’s innovative language processing unit. Each LPX rack features 256 interconnected LPU accelerators that, together with the NVIDIA Vera Rubin platform, supercharge inference. Each LPU accelerator delivers 500 megabytes (MB) of SRAM, 150 terabytes per second (TB/s) of SRAM bandwidth, and 2.5 TB/s scale-up bandwidth.
Technology Breakthroughs
Built through extreme co-design, the NVIDIA Vera Rubin NVL72 unifies seven purpose-built chips into a single AI supercomputer.
In one LPX rack, 256 LPU chips come together to deliver extreme performance.
In each rack, LPX delivers 128 GB of SRAM for low-latency processing and 12 TB of DDR5 memory for large models and workloads.
40 petabytes per second (PB/s) of SRAM bandwidth per rack delivers low latency.
Direct chip-to-chip links deliver 640 TB/s of scale-up bandwidth across the LPX rack for low-latency chip communication.
LPX’s high-speed connections to NVL72 reduce latency to near zero.
LPX leverages the NVIDIA MGX™ extract, transform, and load (ETL) rack, enabling token factories to plan for a single universal rack in their NVIDIA Vera Rubin platform deployments.
Sign up for the latest news, updates, and more from NVIDIA.