RAG Systems: A Complete Guide

What is Retrieval-Augmented Generation (RAG)?

Retrieval augmented generation is a process that provides a large language model (LLM) with domain specific and relevant context retrieved from an external knowledge base to help it answer queries more accurately.

Why does RAG matter?

RAG complements the strengths of traditional information retrieval systems, such as using databases and search engines, with the strengths of LLMs to give more accurate and relevant information specific to your domain or knowledge rather than the more generic response LLMs generally provide. Standard LLMs rely largely on their internal memory (with the option to surf the internet), which has a knowledge cutoff date.

How RAG Works

To understand how RAG systems work, there are Four (4) main stages to understand

Indexing & Embedding

This stage transforms your raw text (from Google Docs, PDFs, etc.) into something a machine can search semantically.

First, we chunk the data (split it into smaller sections). Then each chunk gets passed through an embedding model (OpenAI's text-embedding-3-small/large, Cohere embed-v3, etc.), which converts the text into a vector (a list of numbers typically 256-1536 dimensions) that represents semantic meaning in a way machines can compare mathematicallythat represents its semantic meaning. We chunk and not just embed the full document so the embedding model vectorizes properly and does not lose context on specific terms.

This is what enables semantic search when you query "troubleshooting login issues" and retrieve chunks about "authentication problems" even though the query terms don't match. If you also need exact keyword matching (like searching for a specific product ID or an error code), you'll need to combine this with a lexical search index like BM25 (Best Matching 25. Discussed further below).

If we chunk too large, the embedding becomes too spread out. It's trying to represent too much information at once, making it less useful for specific queries. If we chunk too small, we lose the surrounding context needed to understand what the text is actually about.

The ideal goal is finding chunks that are self contained enough to answer questions on their own but focused enough to maintain strong relevance. This usually means 250-700 tokens per chunk, but it depends heavily on your content structure and use case.
Retrieval

When a user asks a question, that query goes through the exact same embedding process as your documents. So if someone asks "how do I reset my password?", that question becomes a vector too.

Now you need to find which document chunks are most similar to that query vector. In theory, you could calculate the distance between your query vector and every single vector in your database to find the closest matches. This is called exact search.

But here's the problem. if you have a million document chunks, you're doing a million distance calculations for every single query. And distance calculations in high dimensional space (remember, these vectors have hundreds or thousands of dimensions) are computationally expensive. Your search could take minutes, which kills the user experience.

This is where approximate nearest neighbor (ANN) algorithms come in. These algorithms trade a tiny bit of accuracy for massive speed improvements. It could be as quick as milliseconds instead of seconds. They work by organizing vectors into smart data structures that let you quickly narrow down the search space without checking everything.

The specific algorithm you use depends on your vector database or library. Popular ones include
- HNSW (Hierarchical Navigable Small World) - Used by Pinecone, Weaviate, Chroma
- DiskANN - Used by Milvus for very large datasets

You don't usually need to choose these yourself. Your vector database handles it (but it is good to know that you're trading perfect accuracy for speed). It helps explain why sometimes the "right" chunk doesn't show up in your top results.

Once the search runs, you get your top x most similar chunks (usually k=5 to 20). These are your retrieval chunks that move to the next stage.

Reranking

Now we have retrieved our top x chunks based on vector similarity. But the issue is those similarity scores aren't perfect.

Your embedding model was optimized to place semantically similar content close together in vector space, but it's trying to represent meaning in a compressed way. Sometimes a chunk scores high on similarity but isn't actually the best answer to the specific question. Maybe it uses similar vocabulary but misses the point. Maybe the query is nuanced and needs more in-depth understanding than a simple distance calculation can provide.

This is where reranking comes in. Think of it as a second opinion from a slower but smarter processor.

You take those initial x chunks and run them through a reranking model. A more sophisticated model that was specifically trained to score "how relevant is this passage to this specific query?" Unlike embeddings, which compress everything into fixed size vectors, rerankers can look at the actual query and chunks pair and make a more nuanced judgment.

These models are typically cross-encoders (like models from Cohere or open-source options like bge-reranker or cross-encoder/ms-marco-MiniLM). They're more accurate than pure vector similarity but also slower and more expensive to run. That's why we don't use them on our entire database but to refine an already narrowed down list.

After reranking, we might take the top x (or 3) chunks (now reordered by relevance) to pass to your LLM.

We do this because it balances speed and accuracy. Fast approximate search narrows down millions of chunks to 20 chunks (for example) in milliseconds. Then the slower, smarter reranker ensures those final 3 chunks are actually the most relevant ones.
Augmented Generation

Now we have our top 3 reranked chunks. This is where the generation part happens. It's augmented because the LLM isn't working from memory alone.

You construct a prompt that includes both the user's original question and the retrieved context. A typical prompt looks something like this
```
 You are a helpful assistant. Use the following context to answer the question.

 Context:
 [Chunk 1 content]
 [Chunk 2 content]
 [Chunk 3 content]
 ...

 Question: [User's original query]

 Answer based on the context provided. If the context doesn't contain enough information to answer, clearly state that rather than guessing.
```
What the LLM does is to:
- Read through multiple chunks that might have overlapping or complementary information
- Piece together a coherent answer from fragments scattered across different sources
- Handle cases where chunks contradict each other (In cases where we have outdated data with newer updates)
- Ignore chunks that seem relevant by keywords but don't actually answer the question

If you add too much context into the prompt, the LLM might get distracted by irrelevant details. If we provide too little context, the LLM can't answer properly. The reranking step helps ensure quality.

If none of the retrieved chunks actually answer the question, the LLM should say, "I don't have information about this in the knowledge base" rather than crafting one from its own knowledge and answering. This is why the instruction in the prompt matters. It gives the model permission to admit ignorance.

We can also include citations (links to the actual documents), showing which chunk each part of the answer came from. This builds trust and lets users verify the information themselves.

RAG System Patterns For Effective Retrievals

While the approach explained above works (dense retrieval or semantic search), in real-world we often require more sophisticated approaches for retrieval. These patterns address common failures like irrelevant retrievals, complex multi-hop questions, and queries that don't match the language of your documents.

Sparse Retrieval (BM25)

BM25 (Best Matching 25) is a probabilistic ranking function that scores documents based on term frequency and inverse document frequency. Unlike dense retrieval, it operates on exact token matches. Good for exact matches, error codes, product IDs, technical terms, acronyms. It weakness is that it has no semantic understanding e.g "automobile" won't match "car".
Hybrid Retrieval

Hybrid retrieval combines dense (semantic search) and sparse (BM25) methods to get the best of both worlds. Good for technical documentation with codes, IDs, mixed query types (some semantic, some keyword-based).
Query Transformation

The way users phrase questions often differs from how information is stored in your knowledge base. Query transformation techniques helps resolve this.

An example is Query Decomposition. Complex questions often require information from multiple documents. Query decomposition breaks a question into simpler sub-queries that can be answered independently.
```
 Original: "Compare the authentication methods supported by AWS and GCP"

 Decomposed:
 1. "What authentication methods does AWS support?"
 2. "What authentication methods does GCP support?"
```
Graph RAG

Traditional RAG treats documents as independent chunks. Graph RAG adds relationship awareness via linked metadata by building a knowledge graph from your documents. It extracts entities (people, concepts) and their relationships during indexing. At query time, it can traverse the connections (e.g answering questions like “What projects use technology X?”) by following the graph rather than hoping the right chunk is retrieved.

How it works:

1. Extract entities and relationships from documents during indexing.

2. Store both the text chunks and the graph structure.

3. At query time, retrieve relevant nodes and traverse relationships to gather connected context.
Fusion RAG (RAG-Fusion)

Generate multiple query variations of the user’s query and combine their retrieval results using reciprocal rank fusion. This increases recall by capturing documents that match the intent but use different terminology.

Things to consider when building a RAG System

Building a RAG proof of concept is straightforward. But one that's fast, cost-effective, and reliable in production is another matter entirely. Below are things to consider

Semantic Caching

Users would rarely ask the exact same question twice (at least not in the same wordings), but they ask similar questions all the time. "How do I reset my password?" and "What's the process for password recovery?" are semantically the same.

Semantic caching stores responses and returns cached results when a similar enough query comes in. You embed the new query, check if it's close to any cached queries (usually >95% similarity), and return the stored response if it matches.

Why it matters: This can cut your costs by 30-60% and dramatically reduce latency. Instead of hitting your vector store and LLM, you're serving from cache.

The catch: Choose your similarity threshold carefully. Too low and we return wrong answers. Too high and you never get cache hits. Also, add time to live (TTL) so cached responses don't become stale when documents update (or purge on update).
Cost Management

RAG systems have multiple cost spots that compounds quickly.

Embedding costs happen at indexing (one-time) and at every query. Batch your document embeddings instead of processing one by one. it's much cheaper. For query embeddings, cache them when possible.

LLM costs are typically the biggest expense.

Smart cost optimization:
- Use semantic caching to avoid redundant LLM calls
- Route simple queries to cheaper models (save higher models for complex questions)
- Better retrieval means fewer chunks needed, which means lower token counts
- Consider compressing or summarizing retrieved context before sending to the LLM
Scheduler System For Document Updates

Documents get added, modified, and deleted. we need a way to constantly stay reliable. You would need to build a scheduler system to properly handle this.

Full re-indexing is simple but expensive. Use incremental updates that tracks document hashes, and only re-embed what's changed.
Security

You need document level permissions. Filter at retrieval time based on what each user is allowed to see. Don't retrieve everything then filter, else we run the risk of leaking documents.

Retrieval Augmented Generation Systems

What is Retrieval-Augmented Generation (RAG)?

Why does RAG matter?

RAG System Patterns For Effective Retrievals

Graph RAG

Fusion RAG (RAG-Fusion)

Things to consider when building a RAG System

Comments

More from this blog

Mysql Event Scheduler

Implementing OIDC Authentication: From GitHub Actions to AWS - A Practical Guide

Building Bridges: Navigating AWS Lambda’s Multi-Language Landscape with Docker

Learning Typescript in 324-Pages.

Command Palette

What is Retrieval-Augmented Generation (RAG)?

Why does RAG matter?

RAG System Patterns For Effective Retrievals

Graph RAG

Fusion RAG (RAG-Fusion)

Things to consider when building a RAG System

Comments

More from this blog