Retrieval-augmented generation (RAG) is an AI technique where an external data source is connected to a large language model (LLM) to generate domain-specific or the most up-to-date responses in real time.
LLMs are powerful, but their knowledge is limited to their pretraining data. This poses a challenge for businesses needing AI applications that rely on their own specific documents and data.
RAG addresses this limitation by supplementing LLMs with external data. This technique retrieves relevant information from diverse structured and unstructured sources, including text, images, and video, to ground LLM responses in a company's proprietary data––improving accuracy and reducing hallucinations. This active retrieval—often facilitated by vector databases for efficient semantic search—enables LLMs to provide more informed, contextually relevant answers than if they relied solely on their pretraining.
In short, RAG works as follows:
RAG allows you to integrate specialized knowledge without retraining the LLM entirely, saving on compute resources
Keyword search focuses on finding exact matches to the words or phrases a user enters, treating the query literally and with a limited understanding of synonyms or context. For example, a keyword search for "best running shoes for flat feet" may only return results containing that exact phrase.
Conversely, semantic search aims to understand the meaning and intent behind the query by analyzing context and relationships between words and user history to deliver more relevant results. A semantic search for "best running shoes for flat feet" may return results for "stability running shoes," "arch support running shoes," or even reviews of specific shoe models suitable for flat feet, even if those exact keywords weren't used in the search query.
Essentially, keyword search looks for the words, while semantic search looks for the meaning.
Information retrieval is the process of finding relevant documents or data based on a user query. It uses algorithms like BM25, TF-IDF, and vector search to return a list of sources, which users must then review manually for insights. Instead of just returning documents, RAG synthesizes a direct response using the retrieved data, reducing the need for manual interpretation. While information retrieval focuses on finding relevant content, RAG uses that content to generate context-aware, coherent answers in real time.
A typical RAG pipeline works in three steps: extraction (where data gets ingested and embedded), retrieval (where relevant information is found), and generation (where answers are created). Each phase is critical in ensuring a RAG pipeline retrieves precise, reliable, and relevant data.
Extraction
During the extraction phase, enterprise data is collected, transformed, indexed, and stored in a vector database. In this phase, an embedding model converts textual, audio, or visual content into high-dimensional vector representations, enabling similarity-based searches. Embedding vectors are indexed (e.g., as graphs) before storage for fast retrieval using approximate nearest neighbor (ANN) methods.
Retrieval
Retrieval identifies and retrieves relevant data using vector and keyword search techniques. Many RAG systems also use a reranking model to identify which of the retrieved data is most relevant. Appending a reranking model to the retrieval phase helps the RAG system improve the overall accuracy of the final response generation.
Generation
Finally, during the generation phase, LLMs combine the user’s prompt with retrieved data to craft answers that are simultaneously semantic and contextually precise. This combination gives RAG a distinct edge over LLM-only solutions.
A RAG architecture diagram showing three phases: data extraction, retrieval, and generation—powered by NVIDIA NeMo™ Retriever Extraction open library and Nemotron, accelerated with NVIDIA cuVS.
In a RAG pipeline, data extraction involves data collection, embedding generation, and indexing.
Data Collection
First, you must collect, parse, and clean your data—documents, PDFs, product catalogs, images, or even audio transcripts. Text is often chunked into paragraphs or sections to limit the context window and maximize retrieval accuracy. High-quality extraction involving accurate metadata and minimal duplication is crucial because even the most advanced LLMs will struggle if the underlying data is incomplete or disorganized.
Your data sources will indicate the ideal data collection method:
Embedding Generation
During the extraction phase, an embedding model transforms data into vector embeddings. Vector embeddings are numerical representations of data—such as text, images, or audio—mapped into a high-dimensional space. These embeddings capture the semantic meaning of content, enabling similarity-based retrieval in RAG. Items with related meanings are placed closer together in the vector space, allowing for fast and efficient searches.
For example, in a search system, a query like "fast GPU for deep learning" retrieves documents with similar embeddings, ensuring contextually relevant results. High-quality embeddings are critical for accurate and meaningful retrieval in AI applications.
Indexing
Finally, once embeddings are created and inserted, the system indexes the data in a vector database. Vector databases are at the core of RAG systems and are needed to efficiently store information as data chunks, each represented by a corresponding multidimensional vector produced by an embedding model. These databases can handle the complexities and specificities of vector space operations, like cosine similarity, offering several key advantages such as efficient similarity search, handling high-dimensional data, scalability, real-time processing, and enhanced search relevance.
Architecture diagram showing data being extracted in a RAG pipeline and a GPU-accelerated vector database—powered by NVIDIA NeMo Retriever Extraction open library and Nemotron.
Because RAG is not limited to text, it can also process images, audio, and video inputs by converting them into embeddings using computer vision and speech-processing models. This enables cross-modal retrieval, where users can query across data types. For instance, an ecommerce platform might embed both product descriptions and product images, so users can search visually (“Find images similar to this reference photo”) and textually. By supporting diverse data formats, enterprise and consumer applications are becoming smarter.
Multilingual models (embedding and LLM) used in RAG enable global accessibility for enterprise generative AI applications supporting queries and documents in different languages.
Retrieval identifies the most relevant data to enhance an LLM’s response. The process often begins with query rewriting, where the original search query is automatically refined. This can involve expanding it with synonyms, resolving ambiguities, or incorporating context from previous interactions to improve retrieval accuracy.
Next, the query is converted into an embedding—a numeric vector representation—using an embedding model. This transformation ensures compatibility with stored data embeddings, making it essential to maintain consistency between ingestion-time and query-time embeddings.
Finally, the system performs a similarity search, retrieving the top-k most relevant chunks by measuring vector distances using metrics such as cosine similarity, Euclidean distance, or dot product. ANN algorithms optimize this step by efficiently narrowing down potential matches. The retrieved content—whether text, image, or other data—then provides crucial context for the LLM’s final response.
Architecture diagram showing retrieval in a RAG pipeline with a GPU-accelerated vector database—powered by NVIDIA Nemotron and NVIDIA cuVS.
After retrieval, a reranking step follows to refine the results by prioritizing the most relevant data using a reranking model. This can be based on recency, domain relevance, or user preferences. Reranking models—whether heuristic-based or machine learning (ML)-powered—improve retrieval recall, ensuring that the LLM processes the highest-quality information first.
A reranking model refines retrieved results by prioritizing the most relevant chunks before passing them to the LLM. After initial retrieval, reranking models reorder content based on relevance signals, such as keyword frequency, semantic similarity, recency, or metadata alignment. They can be rule-based (heuristics like BM25), ML-driven (learned relevance models), or hybrid approaches that combine multiple factors. Effective reranking ensures that the LLM processes the most useful data first, improving response accuracy and efficiency in RAG systems.
To optimize LLM responses, there are advanced retrieval techniques that help enhance search precision by combining methods, managing large data, and adapting to query nuances.
| Advanced Retrieval Techniques | |
|---|---|
| Hybrid Retrieval | This approach combines vector search with traditional techniques like BM25. |
| Long Context Retrieval | Some LLMs can process thousands of tokens in a single prompt, allowing them to consider large amounts of retrieved content. This is especially useful in research, legal, and technical domains where responses require multiple sources. However, longer prompts increase computational costs and memory usage. |
| Contextual Retrieval | Adding additional context to each chunk through metadata like what document it’s part of, a summary of the surrounding context, or the date the chunk was indexed is another way to increase the likelihood of the correct context being retrieved by the retrieval pipeline. This method, coined “contextual retrieval,” is extremely useful when dealing with complex multi-source documents, like coding repositories, legal documents, and scientific research papers. |
Once relevant information is retrieved, the generation phase in RAG involves synthesizing a final response using an LLM. This process includes:
By structuring responses around verified, retrieved information, RAG enhances the reliability and transparency of AI-generated content.
Agentic RAG pipelines benefit complex applications that require contextually rich responses, like customer support tools, legal services, and enterprise knowledge management.
While simple retrieval can work for some applications, consider agentic AI workflows that include multiple AI agents working together to achieve a common goal. For example, software design, IT automation, and code generation are applications that warrant extensive data extraction and processing to ensure accuracy. An advanced RAG pipeline can include multiple passes of retrieval, reasoning, and tuning of response outputs based on criteria to ensure an optimal generated response that suits your business goals.
Quick Links
To get started building sample RAG applications, download the NVIDIA AI Blueprint for building enterprise-grade RAG pipelines. This reference architecture provides developers with a foundational starting point for building scalable and customizable retrieval pipelines that deliver high accuracy and throughput.
It integrates state-of-the-art NVIDIA technologies, including NVIDIA Nemotron for extraction, embedding, and reranking, and the NVIDIA cuVS library for accelerated data processing and cost-efficient, scalable RAG solutions. Nemotron RAG is a collection of open models, datasets, libraries, and training scripts for building and customizing information retrieval systems. This openness enables deep customization and helps organizations innovate faster with security and transparency.
To connect AI agents to large amounts of diverse data, build an AI query engine. Get started with AI-Q—the NVIDIA Blueprint for building AI agents—powered by the RAG blueprint and Nemotron. Additionally, developers can use the open source NVIDIA NeMo Agent Toolkit to efficiently connect teams of agents and optimize agentic AI systems.
Taking RAG applications to production presents challenges like data curation, governance, security, scalability, and deployment complexity. NVIDIA AI Enterprise simplifies development and deployment by offering powerful tools and technologies, including NVIDIA blueprints, NVIDIA NeMo, and NVIDIA NIM™. Sign up for a 90-day free trial to access enterprise-grade security and robust support needed to scale AI confidently.
Learn how to build RAG AI agents with NVIDIA Nemotron, a family of open source models with open weights, training data, and recipes, by reviewing this learning path.
Connect AI applications to enterprise data using industry-leading embedding and reranking models for information retrieval at scale.
Explore the RAG topic page for developers to access the latest RAG tools, technology, and additional RAG learning resources.