Semantic Search vs Keyword Search: Which Your AI Product Actually Needs
TL;DR
Keyword search (BM25) is the 30-year-old algorithm that still beats embeddings on exact-match queries: product SKUs, error codes, legal citations, person names. Semantic search (dense embeddings) wins on paraphrase, intent, and conceptual queries: "how do I cancel" finding a doc titled "Subscription Termination." Real benchmarks show neither wins outright — hybrid search using Reciprocal Rank Fusion (RRF) typically beats both by 5-15 points on NDCG@10. Semantic adds 50-200ms latency and storage cost; BM25 is essentially free. Most production AI products that look pure-semantic are actually hybrid under the hood. Default to BM25, add semantic where it earns its cost, fuse with RRF.
BM25: The Keyword Baseline That Refuses to Die
BM25 (Best Match 25) is a probabilistic ranking function from 1994. It scores documents using term frequency, inverse document frequency, and length normalization. It does not understand meaning. It does not know that "car" and "automobile" are related. And yet 30 years later it's still the default for most search systems, including the under-the-hood retrieval in Elasticsearch, OpenSearch, Postgres FTS, and Lucene.
How it works (intuition)
Documents that contain rare query terms many times, but aren't bloated with them, score highest. The math weights term frequency sublinearly (saturates fast) and penalizes very long documents — so a 50-page doc with one match doesn't beat a one-paragraph doc with two.
Where it dominates
Exact identifiers (SKU-7421, NSError -50, 28 USC § 1331), rare proper nouns, code search, log search, structured product catalogs. If users type the term they actually want, BM25 wins because there's no semantic gap to bridge.
Where it fails hard
Synonyms ("cancel subscription" vs "end membership"), paraphrasing, intent ("how do I make my app faster" vs a doc titled "Performance Tuning Guide"), and short conversational queries. Recall drops because the query and document share zero terms.
Cost profile
Indexing: cheap, ~ms per document. Query: 1-10ms typical. Storage: ~1-2x raw text size. No GPUs, no embedding API. Why every search system has a BM25 layer somewhere — it's nearly free.
Variants worth knowing
BM25F (per-field weighting — title vs body), BM25+ (small fix to TF saturation), and SPLADE (learned sparse — neural model that emits term-weight pairs, retrieved with BM25-style inverted index). SPLADE bridges sparse and dense and is increasingly used in hybrid stacks.
Semantic Search: Embeddings and Their Failure Modes
Semantic search encodes documents and queries into dense vectors (typically 256-3072 dimensions) using a neural embedding model (OpenAI text-embedding-3, Cohere embed-v3, Voyage, Jina). Retrieval finds nearest neighbors by cosine similarity. It catches paraphrase and intent — and creates new failure modes most teams don't plan for.
Where it shines
Conversational queries, intent matching, multilingual retrieval, cross-format (a query about "cooling" finds a doc that only says "heat dissipation"). On paraphrase-heavy benchmarks like MS MARCO, dense retrievers beat BM25 by 5-15 NDCG@10.
Failure mode 1: exact-match drift
User searches for product SKU "XR-7421-B". BM25 nails it instantly. Embedding model thinks "XR-7421-B" is similar to "XR-7420-B" and "XR-7422-B". Returns wrong product, customer churns. Pure semantic search on identifiers is a known anti-pattern.
Failure mode 2: domain shift
Generic embedding models are trained on web data. On highly domain-specific corpora (legal, medical, code, internal jargon), retrieval quality drops 10-30 points vs BM25. Either fine-tune the embedder or stay sparse.
Failure mode 3: long-document drift
Embeddings struggle to represent long documents — meaning gets averaged out. Standard fix: chunk into 200-800 tokens. But then you have a chunking strategy problem (boundaries split context, retrieval returns mid-paragraph fragments). BM25 handles long documents natively.
Hybrid Search: Why RRF Beats Both
The strongest result on most public benchmarks (BEIR, MTEB retrieval, MS MARCO) is hybrid: run BM25 and dense retrieval in parallel, then fuse the rankings. Reciprocal Rank Fusion (RRF) is the standard combiner because it doesn't require score calibration and works out of the box.
RRF math (60-second version)
What it is: Score(d) = Σ 1/(k + rank_i(d)) across retrievers, with k=60 by default. A document at rank 1 in BM25 and rank 5 in dense scores 1/61 + 1/65 ≈ 0.0317. A document at rank 1 in only one scores 1/61 ≈ 0.0164. Documents endorsed by both retrievers float to the top.
PM Implication: No score normalization, no learned weights, no training data. Drop-in fusion that beats both individual retrievers on most workloads. Default to RRF before considering learned rerankers.
Real benchmark numbers (BEIR, 2024-2025)
What it is: Across 18 BEIR datasets, BM25 averages ~42 NDCG@10. State-of-the-art dense retrievers (e5-large-v2, BGE) average ~52. RRF fusion of BM25 + dense averages ~56. Hybrid + cross-encoder reranker pushes ~62.
PM Implication: The 4-point lift from RRF on top of dense is "free" — no extra ML, no extra training. Skipping it is the most common mistake in early-stage RAG products.
Add a reranker for the last 5 points
What it is: After hybrid retrieval narrows to top 50-100 candidates, run a cross-encoder reranker (Cohere Rerank, BGE reranker, Jina reranker) over the (query, doc) pairs. 50-200ms added latency, but lifts NDCG another 5-8 points on average.
PM Implication: The standard production stack in 2026: BM25 + dense → RRF → cross-encoder rerank → top 5-10 docs to LLM. Pretending you can skip stages is what makes RAG demos look good and prod look bad.
Build RAG That Actually Works in Production
The AI PM Masterclass walks through real RAG architecture decisions — including the retrieval-stack tradeoffs every AI PM has to defend. Live, taught by a Salesforce Sr. Director PM.
The Cost and Latency Tradeoffs
Picking a retrieval stack is partly a relevance question and largely an economics question. Real numbers below assume a 10M-document corpus on production-grade infra in 2026.
BM25-only
Indexing: ~$0.0001 per doc one-time. Query: 5-20ms p95. Storage: ~10 GB. Total infra: <$200/mo for moderate QPS. Best for: structured catalogs, log search, dev tools, code search.
Dense-only
Indexing: $1-10 per million tokens (embedding API) + vector DB storage ($30-100/GB-mo for managed). Query: 30-150ms p95 including embedding. Best for: small corpora, conversational queries, prototypes. Avoid for ID-heavy domains.
Hybrid (BM25 + dense + RRF)
Sum of the two indexing costs. Query: max(BM25, dense) + ~5ms RRF, so 50-200ms p95. Best for: most production search, customer support RAG, documentation Q&A. Default choice in 2026.
Hybrid + reranker
Adds 50-200ms reranker latency and ~$1 per 1K reranks. Total p95: 100-400ms. Best for: high-stakes retrieval (legal, medical, enterprise), high-precision use cases. Skip for sub-second latency budgets.
The PM's Retrieval Decision Framework
Walk through these five questions before picking an architecture. They translate directly into roadmap and infra spec.
1. Are queries conversational or keyword-shaped?
Look at 100 real queries. If >70% are exact terms / IDs / proper nouns, default to BM25. If >70% are full sentences or intent statements, you need semantic. Most B2B and consumer search are mixed — that's your hybrid signal.
2. Is the domain general or specialized?
Generic embedding models work well for general English. They fall apart on legal, medical, scientific, and proprietary internal jargon without fine-tuning. Budget fine-tuning effort or stay BM25 + heavy synonyms.
3. What's the cost per query at target QPS?
10 QPS sustained on a hybrid + reranker stack: ~$200-500/mo infra + reranker fees. 1000 QPS: $5K-15K/mo. Project the bill at 12-month traffic before committing to dense-heavy stacks.
4. What's the freshness requirement?
BM25 indexes update in seconds. Vector indexes can take minutes-to-hours to fully rebuild on large corpora. If you need real-time updates (live chat, news, marketplace listings), test the re-index path before architecting.
5. Do you have offline eval data?
If you can't measure NDCG@10, MRR, or recall@k on a held-out set of (query, relevant doc) pairs, you can't pick between architectures. Build the eval set FIRST. 200 labeled queries beats six months of debate.