Inference is the process where a trained AI model generates new outputs by reasoning and making predictions on new data—classifying inputs and applying learned knowledge in real time.
AI inference helps solve advanced application deployment challenges by bringing machine learning and artificial intelligence technology to the real world. From voice-activated AI assistants and personalized shopping recommendations to robust fraud detection systems, inference is powering AI workloads everywhere.
Related
AI training is the process where an AI model or neural network learns to perform a specific task by adjusting its weights based on a set of training data. This process involves multiple iterations to achieve high accuracy, especially when working with large datasets and changing parameters.
Inference is the application of the trained model to real-world data, generating new outputs through predictions or classifications. This phase is optimized for speed and efficiency, often using techniques like speculative decoding, quantization, pruning, and layer fusion to enhance performance while maintaining accuracy.
As models grow in complexity, especially with advanced AI reasoning models, they require more compute resources for inference. Enterprises must scale their accelerated computing resources to support the next generation of AI tools, enabling complex problem-solving, coding, and multistep planning.
AI inference, particularly in the context large language models (LLMs), works by generating AI tokens and determines the speed, cost, and user experience associated with these tokens. Specialized hardware such as high-performance GPUs and networking are used to provide the compute and efficiency needed for this large workload, which is further optimized with full-stack software enabled by accelerated computing.
Figure description: The diagram illustrates the flow of model inference in LLMs, starting with tokenization of the user’s prompt and progressing through two GPU phases: Prefill (input token processing) and Decode (output token generation). The end-to-end request latency includes the time for tokenization, prefill, decoding, and de-tokenization into human-readable output.
Cost per Token: The cost of AI inference is often measured in terms of the cost per token. This is because the computational resources required to process and generate tokens can be significant, especially for multimodal large language models.
Inference Deployment Type | Description |
Batch Inference | Combines multiple user requests to maximize GPU usage, providing high throughput for many users. |
Real-Time Inference | Processes data instantly as it arrives, essential for applications needing immediate decisions, like autonomous driving or video analysis. |
Distributed | Runs AI inference across multiple devices or nodes simultaneously to parallelize computations, enabling efficient scaling for large models and lower latency. |
Disaggregated | Divides the AI thinking process into two distinct stages—initial analysis and response generation—and runs each stage on specialized computers to enhance efficiency. |
Large language model (LLM) inference is a critical component in generative AI applications, chatbots, and document summarization. These applications require a balance of high performance, low latency, and efficient resource utilization to deliver a seamless user experience and maintain cost-effectiveness.
Three primary metrics for evaluating LLM inference include Time to First Token, Time to Output Token, and Goodput.
Measures the time it takes for the system to generate the first token, crucial for maintaining user engagement. A shorter TTFT ensures that users receive an initial response quickly, which is essential for keeping them engaged and satisfied.
Measures the average time taken to generate each subsequent token, impacting the overall speed and efficiency of the inference process. Reducing TPOT is vital for ensuring that the entire response is generated quickly, which is particularly important for real-time applications like chatbots and live translations.
Balances latency, performance, and cost by measuring throughput while maintaining target TTFT and TPOT, optimizing AI inference for business goals.
The key challenges in AI inference involve balancing latency, cost, and throughput. High performance often requires overprovisioning GPUs, which increases costs. Real-time latency demands either more AI infrastructure or smaller batch sizes, which can lower performance. Achieving both low latency and high throughput without additional costs is difficult, often leading to data center trade-offs.
Figure description: The core challenge of AI inference is balancing latency, cost, and throughput. By favoring one, you may need to trade off maximum value in another.
The following optimization techniques can be used to help overcome these challenges:
Title (Techniques?) | Title (Challenges?) |
Advanced Batching |
Techniques like dynamic, sequence, and inflight batching optimize GPU usage, balancing throughput and latency. |
Chunked Prefill |
Breaks input into smaller chunks to reduce processing time and cost. |
Multiblock Attention |
Optimizes the attention mechanism to focus on relevant input parts, reducing computational load and cost. |
Model Ensembles |
Uses multiple algorithms to improve prediction accuracy and robustness. |
Dynamic Scaling |
Adjusts GPU resources in real time to optimize costs and maintain high performance during peak loads. |
By implementing these advanced techniques and best practices, businesses can ensure that their AI applications deliver high performance, low latency, and cost efficiency, ultimately driving better user experiences and business outcomes.
AI inference enables reasoning models with a new scaling law known as test-time scaling, which allows them to perform a sequence of inference passes. This process involves the model iteratively “thinking” through the problem, creating more output tokens and longer generation cycles, which helps in generating higher-quality responses. Significant test-time compute is essential to support real-time inference and enhance the quality of responses from reasoning models.
An AI factory is a large-scale computing infrastructure designed to automate the development, deployment, and continuous improvement of AI models. AI inference plays a crucial role in these systems, as it represents the final stage where trained models generate real-world predictions and decisions. Once a model is developed within an AI factory, it is optimized and deployed for inference, to deliver high-performance, low-latency AI services across cloud, hybrid, or on-prem environments.
The AI factory also ensures inference remains efficient through constant optimization and management of accelerated AI infrastructure. Additionally, by setting up an AI data flywheel, the inference results feed back into the AI factory, allowing continuous learning and refinement of models based on real-world data. This feedback loop helps AI systems evolve, improving accuracy and efficiency over time. By tightly integrating AI inference into its workflow, an AI factory enables scalable, cost-effective AI deployments across industries.
NVIDIA provides full-stack libraries, software, and services to help you get started with AI inference. With the largest inference ecosystem, purpose-built acceleration software, advanced networking, and industry-leading performance per watt, NVIDIA is delivering the high throughput, low latency, and cost efficiency needed for this new era of AI computing.