How to Evaluate RAG Retrieval (Before You Touch the Prompt)

June 2026 · Published by Amar Kumar

Most teams rewrite the system prompt when answers are wrong. In production RAG, the problem is usually retrieval — and no prompt fix helps if the right chunk never reached the model.

This guide covers golden eval sets, hit@k, failure labels, and a weekly loop that actually moves answer quality.

Who is this for? Teams with a working RAG chatbot who want measurable retrieval quality instead of gut feel.

The prompt tuning trap

User says: "The bot gave a wrong answer." Default reaction: rewrite the system prompt.

Better reaction: ask did the right document chunk appear in the top-k results?

If retrieval failed	If retrieval succeeded
Fix chunking, embeddings, top-k, rerank, or KB gaps	Then look at prompt, model, or context formatting

Typical breakdown of "bad answers" in production RAG — retrieval misses dominate

What to measure

Focus on retrieval metrics first. Generation quality (LLM-as-judge, human rating) comes second.

Metric	What it tells you	Target to start
hit@k	Expected source in top-k results	≥ 85% at k=10 (after rerank)
MRR	How high the first relevant chunk ranks	≥ 0.7
Rerank lift	hit@k after rerank minus before	+10–25 points
Empty retrieval rate	Queries with zero usable chunks	< 5%
Latency p95	End-to-end retrieval time	Track; optimize later

Do not start with BLEU or ROUGE — they hide retrieval failures behind fluent wrong answers.

Priority order: nail hit@k before investing in answer-level metrics

Build a golden eval set

A golden eval set is a fixed list of questions with known correct source documents. Store it in git. Version it. Never edit rows silently.

How many questions?

Minimum: 30 (main topics)
Good: 50–100 (edge cases included)
Always growing: add every user-reported failure

Where questions come from

Real support tickets / chat logs (anonymized)
FAQ and onboarding ("how do I…")
Error codes and SKU lookups
Follow-ups ("what about step 2?", "explain more")
Adversarial: typos, abbreviations, wrong product names

{
  "id": "eval-014",
  "question": "How do I reset my password?",
  "expected_doc_id": "account-settings",
  "expected_section": "Password reset",
  "tags": ["account", "password"],
  "notes": "Users often say 'forgot login'"
}

Mix question sources — prod failures are the highest-signal additions

Run hit@k eval

hit@k = fraction of questions where the expected doc appears in the top-k retrieved results.

def hit_at_k(results: list[list[str]], expected: list[str], k: int = 10) -> float:
    hits = 0
    for retrieved, gold in zip(results, expected):
        if gold in retrieved[:k]:
            hits += 1
    return hits / len(expected) if expected else 0.0

Run eval after every KB sync, after chunk/embedding changes, and before vs after adding a reranker.

Typical hit@10 lift when adding a cross-encoder reranker after wide vector search

Failure taxonomy

Label every failed question. After 20 rows, patterns become obvious.

Label	Symptom	Fix
Retrieval miss	Right doc not in top-k	Chunking, top-k, rerank, hybrid search
Wrong chunk	Right doc, wrong section	Smaller chunks, heading-aware split
Ambiguous query	Multiple valid topics tied	Disambiguation, clarifying question
KB gap	Answer not in docs	Add content — don't prompt-engineer
Generation error	Good context, wrong answer	Prompt, model tier, template
Out of scope	Not in KB domain	Router or polite refusal

Teams that log failure labels spend less time on prompt rewrites that don't help

Fix order that works

Random tuning wastes weeks. This order matches what actually moves hit@k:

KB completeness → Chunking → Embedding parity → top-k ↑ → Reranker → Query rewrite → Hybrid BM25 → Prompt / model

Relative impact on hit@k — early levers (chunking, rerank) beat prompt tweaks

Weekly eval loop

Monday: Run golden set → export hit@k + failure list
Tuesday: Triage top 5 retrieval misses
Wednesday: Fix + re-index if needed
Thursday: Re-run eval
Friday: Add 3 new questions from prod failures

Example: hit@10 climbing over four weekly eval cycles after chunk + rerank fixes

Sample eval script

import json
from your_app.retrieve import retrieve

def eval_retrieval(cases: list[dict], k: int = 10) -> dict:
    hits, mrr_sum = 0, 0.0
    for case in cases:
        docs = retrieve(case["question"], top_k=k)
        doc_ids = [d["metadata"]["doc_id"] for d in docs]
        gold = case["expected_doc_id"]
        if gold in doc_ids:
            hits += 1
            mrr_sum += 1.0 / (doc_ids.index(gold) + 1)
    n = len(cases)
    return {"hit@k": hits / n, "mrr": mrr_sum / n, "n": n}

if __name__ == "__main__":
    cases = json.load(open("eval/golden.json"))
    print(eval_retrieval(cases, k=10))

Wire into CI: fail the build if hit@k drops more than 2 points vs baseline.

Checklist

Must have

30+ golden questions
hit@k in one command
Baseline recorded
Failure labels on misses

Should have

Eval after KB sync
Vector vs rerank metrics
Weekly triage ritual
Prod → golden set pipeline

Later

LLM-as-judge on answers
Per-topic hit@k
CI regression alerts

Glossary

Term	Meaning
Golden eval set	Fixed test questions with known correct sources
hit@k	% of queries where gold doc appears in top-k
MRR	Mean reciprocal rank — rewards higher placement
Retrieval miss	Expected doc not returned in top-k
Rerank lift	hit@k improvement after cross-encoder rerank
Regression	hit@k drops after a change — roll back

Measure retrieval first. Prompts are the last lever, not the first.