How to Evaluate RAG Retrieval (Before You Touch the Prompt)
Most teams rewrite the system prompt when answers are wrong. In production RAG, the problem is usually retrieval — and no prompt fix helps if the right chunk never reached the model.
This guide covers golden eval sets, hit@k, failure labels, and a weekly loop that actually moves answer quality.
Who is this for? Teams with a working RAG chatbot who want measurable retrieval quality instead of gut feel.
The prompt tuning trap
User says: "The bot gave a wrong answer." Default reaction: rewrite the system prompt.
Better reaction: ask did the right document chunk appear in the top-k results?
| If retrieval failed | If retrieval succeeded |
|---|---|
| Fix chunking, embeddings, top-k, rerank, or KB gaps | Then look at prompt, model, or context formatting |
Typical breakdown of "bad answers" in production RAG — retrieval misses dominate
What to measure
Focus on retrieval metrics first. Generation quality (LLM-as-judge, human rating) comes second.
| Metric | What it tells you | Target to start |
|---|---|---|
| hit@k | Expected source in top-k results | ≥ 85% at k=10 (after rerank) |
| MRR | How high the first relevant chunk ranks | ≥ 0.7 |
| Rerank lift | hit@k after rerank minus before | +10–25 points |
| Empty retrieval rate | Queries with zero usable chunks | < 5% |
| Latency p95 | End-to-end retrieval time | Track; optimize later |
Do not start with BLEU or ROUGE — they hide retrieval failures behind fluent wrong answers.
Priority order: nail hit@k before investing in answer-level metrics
Build a golden eval set
A golden eval set is a fixed list of questions with known correct source documents. Store it in git. Version it. Never edit rows silently.
How many questions?
- Minimum: 30 (main topics)
- Good: 50–100 (edge cases included)
- Always growing: add every user-reported failure
Where questions come from
- Real support tickets / chat logs (anonymized)
- FAQ and onboarding ("how do I…")
- Error codes and SKU lookups
- Follow-ups ("what about step 2?", "explain more")
- Adversarial: typos, abbreviations, wrong product names
{
"id": "eval-014",
"question": "How do I reset my password?",
"expected_doc_id": "account-settings",
"expected_section": "Password reset",
"tags": ["account", "password"],
"notes": "Users often say 'forgot login'"
}
Mix question sources — prod failures are the highest-signal additions
Run hit@k eval
hit@k = fraction of questions where the expected doc appears in the top-k retrieved results.
def hit_at_k(results: list[list[str]], expected: list[str], k: int = 10) -> float:
hits = 0
for retrieved, gold in zip(results, expected):
if gold in retrieved[:k]:
hits += 1
return hits / len(expected) if expected else 0.0
Run eval after every KB sync, after chunk/embedding changes, and before vs after adding a reranker.
Typical hit@10 lift when adding a cross-encoder reranker after wide vector search
Failure taxonomy
Label every failed question. After 20 rows, patterns become obvious.
| Label | Symptom | Fix |
|---|---|---|
| Retrieval miss | Right doc not in top-k | Chunking, top-k, rerank, hybrid search |
| Wrong chunk | Right doc, wrong section | Smaller chunks, heading-aware split |
| Ambiguous query | Multiple valid topics tied | Disambiguation, clarifying question |
| KB gap | Answer not in docs | Add content — don't prompt-engineer |
| Generation error | Good context, wrong answer | Prompt, model tier, template |
| Out of scope | Not in KB domain | Router or polite refusal |
Teams that log failure labels spend less time on prompt rewrites that don't help
Fix order that works
Random tuning wastes weeks. This order matches what actually moves hit@k:
Relative impact on hit@k — early levers (chunking, rerank) beat prompt tweaks
Weekly eval loop
- Monday: Run golden set → export hit@k + failure list
- Tuesday: Triage top 5 retrieval misses
- Wednesday: Fix + re-index if needed
- Thursday: Re-run eval
- Friday: Add 3 new questions from prod failures
Example: hit@10 climbing over four weekly eval cycles after chunk + rerank fixes
Sample eval script
import json
from your_app.retrieve import retrieve
def eval_retrieval(cases: list[dict], k: int = 10) -> dict:
hits, mrr_sum = 0, 0.0
for case in cases:
docs = retrieve(case["question"], top_k=k)
doc_ids = [d["metadata"]["doc_id"] for d in docs]
gold = case["expected_doc_id"]
if gold in doc_ids:
hits += 1
mrr_sum += 1.0 / (doc_ids.index(gold) + 1)
n = len(cases)
return {"hit@k": hits / n, "mrr": mrr_sum / n, "n": n}
if __name__ == "__main__":
cases = json.load(open("eval/golden.json"))
print(eval_retrieval(cases, k=10))
Wire into CI: fail the build if hit@k drops more than 2 points vs baseline.
Checklist
Must have
- 30+ golden questions
- hit@k in one command
- Baseline recorded
- Failure labels on misses
Should have
- Eval after KB sync
- Vector vs rerank metrics
- Weekly triage ritual
- Prod → golden set pipeline
Later
- LLM-as-judge on answers
- Per-topic hit@k
- CI regression alerts
Glossary
| Term | Meaning |
|---|---|
| Golden eval set | Fixed test questions with known correct sources |
| hit@k | % of queries where gold doc appears in top-k |
| MRR | Mean reciprocal rank — rewards higher placement |
| Retrieval miss | Expected doc not returned in top-k |
| Rerank lift | hit@k improvement after cross-encoder rerank |
| Regression | hit@k drops after a change — roll back |
Measure retrieval first. Prompts are the last lever, not the first.
Related: How to Build a RAG Chatbot · Cohere Reranking in Production RAG