Cohere Reranking & Production RAG Retrieval Optimization
How to tune retrieval in a production RAG system: wide vector recall, Cohere rerank-v3.5 for precision, conditional skip logic, MMR diversity, and cost-aware gates.
Vector search alone is fast but imprecise. A reranker fixes ranking — but calling it on every request adds ~200 ms and ~$0.002 per call. This guide covers common production patterns for rerank quality without burning budget on obvious hits.
Part two of our RAG series. Start with How to Build a Production RAG Chatbot if you need the foundations.
Who is this for? Engineers who already have a working RAG pipeline and want to improve answer quality, latency, and cost without rewriting everything.
Why rerank at all?
Embedding models optimize for recall — finding anything plausibly related. They compress meaning into a fixed vector, so subtle distinctions ("reset password" vs "change email") get blurred.
A cross-encoder reranker (Cohere rerank-v3.5) scores each query–document pair jointly. It's slower and more expensive per document, but far more accurate at ordering.
The standard production pattern: cast a wide net with vector search, then let the reranker pick the best handful.
| Stage | Goal | Typical k |
|---|---|---|
| Vector search | High recall — don't miss relevant chunks | 30–40 |
| Cohere rerank | High precision — best chunks first | 8–10 |
| MMR | Diversity — avoid duplicate sections | 8 |
| LLM prompt | Context window budget | 8 |
Retrieval funnel: wide vector recall narrows through rerank and MMR to final prompt context
Two-stage retrieval
Stage 1 pulls top-k ≈ 40 from Pinecone by cosine similarity. Stage 2 sends those 40 chunk texts to Cohere rerank and keeps the top 10 by relevance score.
Why 40 → 10 → 8? Vector top-40 catches chunks that embeddings rank poorly but rerankers love (common with procedural docs and rule tables). Rerank to 10 gives MMR room to swap near-duplicates. Final 8 fits most context budgets without drowning the LLM.
Typical latency
| Stage | ~ms |
|---|---|
| Embed query | 80 |
| Pinecone vector search | 120 |
| Cohere rerank (40 docs) | 200 |
| LLM response | 1800 |
Latency per request (ms) — LLM still dominates; rerank adds ~200 ms when invoked
Conditional rerank skip
Cohere rerank costs ~$0.002 per request. On a chatbot doing 50k queries/month, that's $100/month just for reranking — reasonable, but wasteful when vector search already found an obvious match.
Skip rerank when the top Pinecone hit has cosine similarity ≥ 0.85. At that threshold, the best chunk is almost always correct; reranking adds little value.
Always rerank when:
- The turn is a follow-up (conversation history is non-empty)
- Multi-query mode is active (query was rewritten or expanded)
- Top cosine score is below 0.85
Follow-ups like "what about step 2?" embed poorly on their own. Forcing rerank on those turns prevents silent quality drops.
Typical rerank invocation mix: ~35% skipped on high-confidence first turns, ~45% conditional rerank, ~20% forced on follow-ups
Query rewrite heuristics
Follow-up questions need context from prior turns. A cheap LLM (Gemini Flash Lite) rewrites "what about that?" into a standalone search query — but calling it on every turn adds latency and tokens.
Skip rewrite on first-turn, self-contained questions using simple heuristics:
- Question length ≥ 8 words
- No pronouns referring to prior context (
it,that,this,they) - No ordinal references (
step 2,option B,the second one) - Conversation history is empty
When rewrite runs, treat the result as multi-query — always rerank.
MMR selection after rerank
Rerankers love returning multiple chunks from the same document section — five paragraphs about "password reset" when you need one rule chunk and one exception chunk.
Maximal Marginal Relevance (MMR) balances relevance against diversity. After rerank, run MMR with:
| Parameter | Value | Why |
|---|---|---|
k | 8 | Final chunks in prompt |
lambda | 0.85 | Favor relevance; only swap for clearly redundant chunks |
λ = 0.85 keeps related rule chunks (they're semantically similar but content-distinct) while dropping near-duplicate paragraphs from the same section.
Soft no-results gate
Hard gates ("zero vector hits → return error") miss cases where vector search returns chunks but none are actually relevant.
After rerank, check the top rerank score. If it's below 0.05, skip the LLM call entirely and return a helpful "I couldn't find relevant documentation" message.
This saves an expensive LLM call (~$0.01–0.05) on queries the knowledge base can't answer. It's a soft gate — vector search ran, rerank ran (or was skipped on high cosine), but confidence is too low to hallucinate an answer.
Rerank score vs answer quality index — scores below 0.05 rarely produce useful answers
Auto model tier upgrade
Not every question needs your strongest model. Use rerank confidence to pick the LLM tier dynamically:
| Top rerank score | Model tier | Rationale |
|---|---|---|
| ≥ 0.30 | Standard (Flash / 4o-mini) | Context is clear; cheap model suffices |
| 0.10 – 0.30 | Mid (Flash thinking / 4o) | Ambiguous context needs better reasoning |
| < 0.10 (but ≥ 0.05) | Pro (Opus / 4o full) | Weak signal — strongest model extracts what it can |
| < 0.05 | Skip LLM | Soft no-results gate |
Request distribution across model tiers driven by rerank confidence zones
Per-request cost tracking
Log cost components on every request. You can't optimize what you don't measure.
| Component | Typical cost | Notes |
|---|---|---|
| LLM (input + output) | ~70% of total | Dominates; tier selection matters most |
| Embeddings | ~12% | Query embed every turn; doc embed at ingest |
| Pinecone queries | ~15% | Scales with index size and QPS |
| Cohere rerank | ~3% | $0.002/call; ~65% invoked after conditional skip |
Typical per-request cost breakdown for an optimized production RAG pipeline
Track these fields per request:
embedding_tokensand costpinecone_query_unitscohere_rerank_invoked(bool) and cost ($0.002 flat)llm_input_tokens,llm_output_tokens, model ID, and costrewrite_invoked(bool) and rewrite model costtotal_cost_usd
Vector-only vs vector + rerank
| Metric | Vector only (top-8) | Vector (40) + rerank + MMR (8) |
|---|---|---|
| Hit@8 (test set) | ~62% | ~89% |
| Answer quality (human eval) | 3.1 / 5 | 4.3 / 5 |
| Avg latency | ~200 ms retrieval | ~400 ms retrieval |
| Retrieval cost / request | ~$0.0003 | ~$0.0025 |
| Duplicate chunks in prompt | Common | Rare (MMR) |
| Follow-up quality | Poor | Good (forced rerank) |
The quality jump pays for itself in reduced support escalations and fewer "the bot gave wrong info" incidents.
Production code
app/rerank.py — Cohere v2 rerank API with conditional skip logic:
import os
import aiohttp
COHERE_RERANK_URL = "https://api.cohere.com/v2/rerank"
RERANK_MODEL = "rerank-v3.5"
VECTOR_TOP_K = 40
RERANK_TOP_N = 10
FINAL_TOP_K = 8
COSINE_SKIP_THRESHOLD = 0.85
RERANK_MIN_SCORE = 0.05
MMR_LAMBDA = 0.85
def should_run_cohere_rerank(
top_cosine: float,
is_follow_up: bool,
is_multi_query: bool,
) -> bool:
"""Skip Cohere when vector confidence is high on a first-turn query."""
if is_follow_up or is_multi_query:
return True
return top_cosine < COSINE_SKIP_THRESHOLD
async def cohere_rerank(
query: str,
documents: list[dict],
top_n: int = RERANK_TOP_N,
) -> list[dict]:
"""Call Cohere v2 rerank API; return docs sorted by relevance score."""
payload = {
"model": RERANK_MODEL,
"query": query,
"documents": [d["text"] for d in documents],
"top_n": top_n,
}
headers = {
"Authorization": f"Bearer {os.environ['COHERE_API_KEY']}",
"Content-Type": "application/json",
}
async with aiohttp.ClientSession() as session:
async with session.post(COHERE_RERANK_URL, json=payload, headers=headers) as resp:
resp.raise_for_status()
data = await resp.json()
ranked = []
for result in data["results"]:
doc = documents[result["index"]].copy()
doc["rerank_score"] = result["relevance_score"]
ranked.append(doc)
return ranked
def mmr_select(
docs: list[dict],
k: int = FINAL_TOP_K,
lambda_: float = MMR_LAMBDA,
) -> list[dict]:
"""Maximal Marginal Relevance — diversity without dropping related rules."""
if len(docs) <= k:
return docs
selected, remaining = [docs[0]], docs[1:]
while len(selected) < k and remaining:
best_idx, best_score = 0, -1
for i, candidate in enumerate(remaining):
relevance = candidate["rerank_score"]
max_sim = max(
_text_overlap(candidate["text"], s["text"]) for s in selected
)
score = lambda_ * relevance - (1 - lambda_) * max_sim
if score > best_score:
best_score, best_idx = score, i
selected.append(remaining.pop(best_idx))
return selected
def _text_overlap(a: str, b: str) -> float:
"""Jaccard similarity on word sets — fast proxy for chunk redundancy."""
sa, sb = set(a.lower().split()), set(b.lower().split())
if not sa or not sb:
return 0.0
return len(sa & sb) / len(sa | sb)
async def retrieve_and_rank(
query: str,
vector_hits: list[dict],
is_follow_up: bool = False,
is_multi_query: bool = False,
) -> dict:
top_cosine = vector_hits[0]["score"] if vector_hits else 0.0
cost = {"cohere_rerank": 0.0}
if should_run_cohere_rerank(top_cosine, is_follow_up, is_multi_query):
ranked = await cohere_rerank(query, vector_hits)
cost["cohere_rerank"] = 0.002
else:
ranked = vector_hits[:RERANK_TOP_N]
for d in ranked:
d["rerank_score"] = d["score"] # passthrough cosine as proxy
top_score = ranked[0]["rerank_score"] if ranked else 0.0
if top_score < RERANK_MIN_SCORE:
return {"chunks": [], "skip_llm": True, "cost": cost}
final = mmr_select(ranked) if cost["cohere_rerank"] > 0 else ranked[:FINAL_TOP_K]
return {"chunks": final, "skip_llm": False, "top_rerank_score": top_score, "cost": cost}
Query rewrite heuristics — skip Flash Lite on clear first turns:
import re
PRONOUNS = re.compile(r"\b(it|that|this|they|them|those|these|he|she)\b", re.I)
ORDINALS = re.compile(r"\b(step\s+\d|option\s+[a-z]|first|second|third|the\s+other)\b", re.I)
def needs_query_rewrite(message: str, history: list) -> bool:
if not history:
return False # first turn — never rewrite
if len(message.split()) >= 8 and not PRONOUNS.search(message) and not ORDINALS.search(message):
return False # self-contained despite history
return True
Tuning checklist
Must have
- Two-stage retrieval (wide vector → rerank)
- Cohere rerank-v3.5 on follow-ups
- Soft no-results gate (score < 0.05)
- Per-request cost logging
Should have
- Conditional rerank skip (cosine ≥ 0.85)
- MMR after rerank (λ = 0.85)
- Query rewrite with skip heuristics
- Model tier by rerank confidence
Measure
- Hit@8 on 30+ test questions
- Rerank skip rate vs quality delta
- Cost per request by component
- Latency p50 / p95 by stage
Glossary
| Term | Meaning |
|---|---|
| Cross-encoder reranker | Model that scores query–document pairs jointly (Cohere rerank-v3.5) |
| Bi-encoder | Embedding model that encodes query and docs separately (vector search) |
| MMR | Maximal Marginal Relevance — selects diverse yet relevant results |
| top-k | Number of chunks retrieved at each stage |
| Cosine similarity | Vector distance metric; 1.0 = identical, 0.0 = orthogonal |
| Soft no-results gate | Skip LLM when rerank confidence is too low, even if vector hits exist |
| Query rewrite | LLM expands a follow-up into a standalone search query |
| Conditional rerank | Skip rerank API call when vector confidence is already high |
Retrieval is where RAG quality is won or lost. Rerank for precision, skip for cost, MMR for diversity — and measure every dollar.