Advanced Retrieval Strategies: Beyond Basic RAG for AI Products

Why Basic RAG Isn't Enough for Production

Basic RAG follows a simple pipeline: embed the user query, find the top-k most similar chunks from a vector store, concatenate them into the prompt, and generate an answer. This works surprisingly well for narrow, well-structured knowledge bases. But in production, it breaks down in predictable ways.

The vocabulary mismatch problem

Embedding similarity measures semantic closeness, but users often describe problems using different vocabulary than the source documents use. A user asking 'why is my app crashing?' won't match a document titled 'Memory allocation errors in containerized deployments' even though the answer is there. Dense embeddings capture semantic meaning but miss lexical overlap that keyword search would catch.

Trade-off: This is the fundamental limitation of vector-only retrieval. Solutions include hybrid search (combining dense and sparse retrieval) and query expansion. Fixing this alone typically improves recall by 15-25%.

The ranking quality problem

Bi-encoder embeddings (used by most embedding models) compute query and document vectors independently, then compare them with cosine similarity. This is fast but crude. A query and document might be 'similar' in embedding space without the document actually answering the query. The document might discuss the same topic but from an irrelevant angle, contain outdated information, or address a different aspect entirely.

Trade-off: Cross-encoder rerankers solve this by jointly encoding the query-document pair, producing much more accurate relevance scores. The cost is latency: reranking adds 50-200ms per query depending on the number of candidates.

The multi-hop reasoning problem

Many real questions require information from multiple documents that individually don't contain the answer. 'What's our churn rate for enterprise customers who onboarded after the pricing change?' requires joining information across pricing change dates, customer segments, and churn metrics. Basic top-k retrieval finds documents similar to the query, not documents that together compose an answer.

Trade-off: Query decomposition and agentic retrieval address this by breaking complex queries into sub-queries or letting the LLM iteratively search for the information it needs. These approaches are slower but dramatically better for analytical questions.

The context dilution problem

Retrieving more chunks increases the chance of including the right information, but it also increases noise. Long contexts with irrelevant information degrade LLM answer quality — models get distracted by plausible-but-wrong context, especially content in the middle of the prompt (the 'lost in the middle' effect). Basic RAG has no mechanism to filter out low-value retrieved content.

Trade-off: Contextual compression solves this by summarizing or filtering retrieved chunks before injection, keeping only the information relevant to the specific query. This reduces prompt size and improves answer accuracy simultaneously.

The 5 Advanced Retrieval Strategies

Hybrid search (dense + sparse)

Combine vector similarity search (dense retrieval) with traditional keyword search (sparse retrieval, typically BM25). Dense retrieval excels at semantic matching — understanding that 'revenue growth' and 'top-line expansion' mean the same thing. Sparse retrieval excels at exact term matching — finding documents that contain specific product names, error codes, or technical terms. Run both searches in parallel, then merge results using Reciprocal Rank Fusion (RRF) or a learned linear combination. Most production RAG systems use a 0.7 dense / 0.3 sparse weighting as a starting point, then tune based on query logs.

Trade-off: Adds BM25 index maintenance and a fusion layer. Infrastructure cost increases modestly, but recall improvements of 15-30% are typical. Nearly every production RAG system should use hybrid search — the cost-benefit ratio is overwhelmingly positive.

Cross-encoder reranking

After initial retrieval returns the top 20-50 candidates, pass each query-document pair through a cross-encoder model (like Cohere Rerank, BGE-reranker, or a fine-tuned model) that jointly attends to both query and document. Cross-encoders produce far more accurate relevance scores than bi-encoder cosine similarity because they can model fine-grained interactions between query terms and document content. Rerank the candidates, then take the top 3-5 for the final prompt. This two-stage pipeline (fast retrieval, then accurate reranking) is the standard architecture for high-quality RAG.

Trade-off: Adds 50-200ms of latency per query (depending on the number of candidates and reranker model size). For user-facing products where answer quality matters more than sub-second latency, this is almost always worth it. For high-throughput, latency-sensitive applications, consider caching reranked results for common queries.

Query decomposition

Use the LLM to break a complex query into simpler sub-queries before retrieval. 'Compare our Q1 and Q2 customer acquisition costs across enterprise and SMB segments' becomes four separate retrieval queries: Q1 enterprise CAC, Q1 SMB CAC, Q2 enterprise CAC, Q2 SMB CAC. Each sub-query retrieves more targeted chunks than the original compound query would. The sub-results are then combined and passed to the LLM for synthesis. This is critical for analytical, comparative, or multi-entity queries that a single embedding can't represent well.

Trade-off: Multiplies the number of retrieval calls (and LLM calls for decomposition). Latency increases proportionally. Best applied selectively — use a query classifier to route simple queries directly and only decompose complex ones. The decomposition step itself can fail if the LLM misunderstands the query structure.

Contextual compression

After retrieving chunks, use a smaller LLM or a trained extractor to compress each chunk, keeping only the sentences relevant to the specific query. A 500-token chunk about employee benefits might contain one paragraph about parental leave policy — if the query is about parental leave, contextual compression extracts just that paragraph. This reduces the total context size passed to the generator LLM, lowers cost, and eliminates the 'lost in the middle' effect by removing distracting content.

Trade-off: Adds an LLM call per retrieved chunk (or a batch call for all chunks). The compression model can occasionally remove relevant information, so evaluation is essential. For cost optimization, use a small, fast model for compression — this is a low-complexity task that doesn't need a frontier model.

Agentic retrieval

Instead of a fixed retrieve-then-generate pipeline, give the LLM retrieval tools and let it decide what to search for, evaluate the results, and iteratively refine its searches. The LLM might search for 'parental leave policy,' find a reference to 'Policy Document 2024-HR-12,' then search for that specific document, extract the relevant section, and then answer. This is the most powerful retrieval strategy because the LLM applies reasoning to the retrieval process itself — but it's also the most expensive and hardest to control.

Trade-off: Latency is unpredictable (1-10+ seconds depending on how many retrieval iterations the agent performs). Cost scales with iterations. Requires careful guardrails to prevent infinite loops or irrelevant searches. Best for complex, high-value queries where correctness matters more than speed — internal knowledge bases, legal research, technical support escalation.

How to Choose the Right Retrieval Strategy

FAQ and support bots

Start with hybrid search + reranking. Queries are typically simple and direct, but users phrase the same question in many different ways. Hybrid search catches both semantic and keyword matches; reranking ensures the best answer surfaces first. Agentic retrieval is overkill here and adds unnecessary latency.

Document Q&A over large corpora

Hybrid search + reranking + contextual compression. When your knowledge base has thousands of documents, retrieved chunks often contain surrounding noise. Compression extracts the precise answer from noisy chunks. Add query decomposition if users ask comparative or multi-document questions.

Internal knowledge management

Agentic retrieval shines here. Internal queries are often complex and context-dependent ('What did the engineering team decide about the migration timeline in Q3 reviews?'). The agentic approach can follow references, search across document types, and synthesize answers from scattered sources.

Real-time product features

Hybrid search + reranking with strict latency budgets. For features like in-app search or autocomplete suggestions, latency matters more than exhaustive accuracy. Use a small, fast reranker and cap the number of candidates. Consider pre-computing and caching results for common queries.

Build Production-Grade RAG Systems

Retrieval architecture, vector databases, and production AI pipelines are covered in depth in the AI PM Masterclass. Taught by a Salesforce Sr. Director PM.

Retrieval Quality Evaluation Methods

Most teams evaluate their RAG system end-to-end: did the final answer match the expected answer? This is necessary but insufficient. You must also evaluate retrieval quality independently — if the right chunks aren't retrieved, even a perfect LLM can't generate the right answer.

Recall@k

Of all the relevant chunks in your knowledge base for a given query, what fraction appears in the top-k retrieved results? This is the most important retrieval metric. If recall@5 is 0.6, it means 40% of the relevant information is being missed before the LLM ever sees it. Build a labeled evaluation set of 100+ query-relevant chunk pairs and measure this weekly. Aim for recall@5 above 0.85 for production systems.

Precision@k

Of the k chunks retrieved, what fraction is actually relevant to the query? Low precision means you're stuffing irrelevant context into the prompt, wasting tokens and potentially confusing the LLM. Precision and recall are in tension — retrieving more chunks improves recall but often hurts precision. Reranking is the primary tool for improving precision without sacrificing recall.

Mean Reciprocal Rank (MRR)

Where does the first relevant chunk appear in the ranked results? MRR rewards systems that put the most relevant chunk at position 1 rather than position 5. This matters because LLMs weight information at the beginning of the context more heavily. If your MRR is low but recall is high, your chunks are being retrieved but poorly ranked — reranking will help significantly.

Retrieval-generation alignment

For each query in your evaluation set, run retrieval and generation separately. Score the retrieved chunks for relevance (human judgment or LLM-as-judge), then score the generated answer for correctness. When generation quality is low and retrieval quality is also low, your problem is retrieval. When retrieval quality is high but generation quality is low, your problem is the prompt or model. This decomposition prevents you from optimizing the wrong component.

Retrieval Architecture Patterns for Production

The two-stage pipeline (retriever + reranker)

The most common production pattern. Stage 1: fast retrieval (hybrid search) returns 20-50 candidates in under 100ms. Stage 2: cross-encoder reranker scores each candidate against the query and returns the top 3-5. This architecture separates speed from accuracy — the retriever optimizes for recall (don't miss relevant chunks), the reranker optimizes for precision (only pass the best chunks to the LLM). Start here unless you have a specific reason not to.

Query routing with strategy selection

Not all queries need the same retrieval strategy. Build a lightweight query classifier (can be rule-based or LLM-based) that categorizes incoming queries: simple factual queries go through hybrid search + reranking, complex multi-hop queries go through decomposition + parallel retrieval, ambiguous queries trigger a clarification response. This reduces average latency and cost while maintaining quality for complex queries.

Hierarchical retrieval with document-then-chunk

Instead of searching a flat pool of chunks, first identify the most relevant documents, then search within those documents for the best chunks. This prevents cross-contamination where chunks from irrelevant documents score high due to surface-level similarity. Particularly important for large knowledge bases with documents across different domains or time periods.

Retrieval with fallback chains

Define a chain of retrieval strategies with increasing capability and cost. First try cached results, then hybrid search, then hybrid search + reranking, then query decomposition. Only escalate to the next strategy if the current one returns results below a confidence threshold. This optimizes the cost-quality trade-off dynamically — simple queries are answered cheaply and quickly, complex queries get the full pipeline.