How to Evaluate RAG Retrieval (Before You Touch the Prompt)

June 2026 · Published by Amar Kumar

Most teams rewrite the system prompt when answers are wrong. In production RAG, the problem is usually retrieval — and no prompt fix helps if the right chunk never reached the model.

This guide covers golden eval sets, hit@k, failure labels, and a weekly loop that actually moves answer quality.

Who is this for? Teams with a working RAG chatbot who want measurable retrieval quality instead of gut feel.

The prompt tuning trap

User says: "The bot gave a wrong answer." Default reaction: rewrite the system prompt.

Better reaction: ask did the right document chunk appear in the top-k results?

If retrieval failedIf retrieval succeeded
Fix chunking, embeddings, top-k, rerank, or KB gapsThen look at prompt, model, or context formatting

Typical breakdown of "bad answers" in production RAG — retrieval misses dominate

What to measure

Focus on retrieval metrics first. Generation quality (LLM-as-judge, human rating) comes second.

MetricWhat it tells youTarget to start
hit@kExpected source in top-k results≥ 85% at k=10 (after rerank)
MRRHow high the first relevant chunk ranks≥ 0.7
Rerank lifthit@k after rerank minus before+10–25 points
Empty retrieval rateQueries with zero usable chunks< 5%
Latency p95End-to-end retrieval timeTrack; optimize later

Do not start with BLEU or ROUGE — they hide retrieval failures behind fluent wrong answers.

Priority order: nail hit@k before investing in answer-level metrics

Build a golden eval set

A golden eval set is a fixed list of questions with known correct source documents. Store it in git. Version it. Never edit rows silently.

How many questions?

Where questions come from

  1. Real support tickets / chat logs (anonymized)
  2. FAQ and onboarding ("how do I…")
  3. Error codes and SKU lookups
  4. Follow-ups ("what about step 2?", "explain more")
  5. Adversarial: typos, abbreviations, wrong product names
{
  "id": "eval-014",
  "question": "How do I reset my password?",
  "expected_doc_id": "account-settings",
  "expected_section": "Password reset",
  "tags": ["account", "password"],
  "notes": "Users often say 'forgot login'"
}

Mix question sources — prod failures are the highest-signal additions

Run hit@k eval

hit@k = fraction of questions where the expected doc appears in the top-k retrieved results.

def hit_at_k(results: list[list[str]], expected: list[str], k: int = 10) -> float:
    hits = 0
    for retrieved, gold in zip(results, expected):
        if gold in retrieved[:k]:
            hits += 1
    return hits / len(expected) if expected else 0.0

Run eval after every KB sync, after chunk/embedding changes, and before vs after adding a reranker.

Typical hit@10 lift when adding a cross-encoder reranker after wide vector search

Failure taxonomy

Label every failed question. After 20 rows, patterns become obvious.

LabelSymptomFix
Retrieval missRight doc not in top-kChunking, top-k, rerank, hybrid search
Wrong chunkRight doc, wrong sectionSmaller chunks, heading-aware split
Ambiguous queryMultiple valid topics tiedDisambiguation, clarifying question
KB gapAnswer not in docsAdd content — don't prompt-engineer
Generation errorGood context, wrong answerPrompt, model tier, template
Out of scopeNot in KB domainRouter or polite refusal

Teams that log failure labels spend less time on prompt rewrites that don't help

Fix order that works

Random tuning wastes weeks. This order matches what actually moves hit@k:

KB completeness Chunking Embedding parity top-k ↑ Reranker Query rewrite Hybrid BM25 Prompt / model

Relative impact on hit@k — early levers (chunking, rerank) beat prompt tweaks

Weekly eval loop

  1. Monday: Run golden set → export hit@k + failure list
  2. Tuesday: Triage top 5 retrieval misses
  3. Wednesday: Fix + re-index if needed
  4. Thursday: Re-run eval
  5. Friday: Add 3 new questions from prod failures

Example: hit@10 climbing over four weekly eval cycles after chunk + rerank fixes

Sample eval script

import json
from your_app.retrieve import retrieve

def eval_retrieval(cases: list[dict], k: int = 10) -> dict:
    hits, mrr_sum = 0, 0.0
    for case in cases:
        docs = retrieve(case["question"], top_k=k)
        doc_ids = [d["metadata"]["doc_id"] for d in docs]
        gold = case["expected_doc_id"]
        if gold in doc_ids:
            hits += 1
            mrr_sum += 1.0 / (doc_ids.index(gold) + 1)
    n = len(cases)
    return {"hit@k": hits / n, "mrr": mrr_sum / n, "n": n}

if __name__ == "__main__":
    cases = json.load(open("eval/golden.json"))
    print(eval_retrieval(cases, k=10))

Wire into CI: fail the build if hit@k drops more than 2 points vs baseline.

Checklist

Must have

  • 30+ golden questions
  • hit@k in one command
  • Baseline recorded
  • Failure labels on misses

Should have

  • Eval after KB sync
  • Vector vs rerank metrics
  • Weekly triage ritual
  • Prod → golden set pipeline

Later

  • LLM-as-judge on answers
  • Per-topic hit@k
  • CI regression alerts

Glossary

TermMeaning
Golden eval setFixed test questions with known correct sources
hit@k% of queries where gold doc appears in top-k
MRRMean reciprocal rank — rewards higher placement
Retrieval missExpected doc not returned in top-k
Rerank lifthit@k improvement after cross-encoder rerank
Regressionhit@k drops after a change — roll back

Measure retrieval first. Prompts are the last lever, not the first.

Related: How to Build a RAG Chatbot · Cohere Reranking in Production RAG