Cohere Reranking & Production RAG Retrieval Optimization

June 2026 · Published by Amar Kumar

How to tune retrieval in a production RAG system: wide vector recall, Cohere rerank-v3.5 for precision, conditional skip logic, MMR diversity, and cost-aware gates.

Vector search alone is fast but imprecise. A reranker fixes ranking — but calling it on every request adds ~200 ms and ~$0.002 per call. This guide covers common production patterns for rerank quality without burning budget on obvious hits.

Part two of our RAG series. Start with How to Build a Production RAG Chatbot if you need the foundations.

Who is this for? Engineers who already have a working RAG pipeline and want to improve answer quality, latency, and cost without rewriting everything.

Why rerank at all?

Embedding models optimize for recall — finding anything plausibly related. They compress meaning into a fixed vector, so subtle distinctions ("reset password" vs "change email") get blurred.

A cross-encoder reranker (Cohere rerank-v3.5) scores each query–document pair jointly. It's slower and more expensive per document, but far more accurate at ordering.

The standard production pattern: cast a wide net with vector search, then let the reranker pick the best handful.

StageGoalTypical k
Vector searchHigh recall — don't miss relevant chunks30–40
Cohere rerankHigh precision — best chunks first8–10
MMRDiversity — avoid duplicate sections8
LLM promptContext window budget8

Retrieval funnel: wide vector recall narrows through rerank and MMR to final prompt context

Two-stage retrieval

Stage 1 pulls top-k ≈ 40 from Pinecone by cosine similarity. Stage 2 sends those 40 chunk texts to Cohere rerank and keeps the top 10 by relevance score.

Why 40 → 10 → 8? Vector top-40 catches chunks that embeddings rank poorly but rerankers love (common with procedural docs and rule tables). Rerank to 10 gives MMR room to swap near-duplicates. Final 8 fits most context budgets without drowning the LLM.

Typical latency

Stage~ms
Embed query80
Pinecone vector search120
Cohere rerank (40 docs)200
LLM response1800

Latency per request (ms) — LLM still dominates; rerank adds ~200 ms when invoked

Conditional rerank skip

Cohere rerank costs ~$0.002 per request. On a chatbot doing 50k queries/month, that's $100/month just for reranking — reasonable, but wasteful when vector search already found an obvious match.

Skip rerank when the top Pinecone hit has cosine similarity ≥ 0.85. At that threshold, the best chunk is almost always correct; reranking adds little value.

Always rerank when:

Follow-ups like "what about step 2?" embed poorly on their own. Forcing rerank on those turns prevents silent quality drops.

Typical rerank invocation mix: ~35% skipped on high-confidence first turns, ~45% conditional rerank, ~20% forced on follow-ups

Query rewrite heuristics

Follow-up questions need context from prior turns. A cheap LLM (Gemini Flash Lite) rewrites "what about that?" into a standalone search query — but calling it on every turn adds latency and tokens.

Skip rewrite on first-turn, self-contained questions using simple heuristics:

When rewrite runs, treat the result as multi-query — always rerank.

MMR selection after rerank

Rerankers love returning multiple chunks from the same document section — five paragraphs about "password reset" when you need one rule chunk and one exception chunk.

Maximal Marginal Relevance (MMR) balances relevance against diversity. After rerank, run MMR with:

ParameterValueWhy
k8Final chunks in prompt
lambda0.85Favor relevance; only swap for clearly redundant chunks

λ = 0.85 keeps related rule chunks (they're semantically similar but content-distinct) while dropping near-duplicate paragraphs from the same section.

Soft no-results gate

Hard gates ("zero vector hits → return error") miss cases where vector search returns chunks but none are actually relevant.

After rerank, check the top rerank score. If it's below 0.05, skip the LLM call entirely and return a helpful "I couldn't find relevant documentation" message.

This saves an expensive LLM call (~$0.01–0.05) on queries the knowledge base can't answer. It's a soft gate — vector search ran, rerank ran (or was skipped on high cosine), but confidence is too low to hallucinate an answer.

Rerank score vs answer quality index — scores below 0.05 rarely produce useful answers

Auto model tier upgrade

Not every question needs your strongest model. Use rerank confidence to pick the LLM tier dynamically:

Top rerank scoreModel tierRationale
≥ 0.30Standard (Flash / 4o-mini)Context is clear; cheap model suffices
0.10 – 0.30Mid (Flash thinking / 4o)Ambiguous context needs better reasoning
< 0.10 (but ≥ 0.05)Pro (Opus / 4o full)Weak signal — strongest model extracts what it can
< 0.05Skip LLMSoft no-results gate

Request distribution across model tiers driven by rerank confidence zones

Per-request cost tracking

Log cost components on every request. You can't optimize what you don't measure.

ComponentTypical costNotes
LLM (input + output)~70% of totalDominates; tier selection matters most
Embeddings~12%Query embed every turn; doc embed at ingest
Pinecone queries~15%Scales with index size and QPS
Cohere rerank~3%$0.002/call; ~65% invoked after conditional skip

Typical per-request cost breakdown for an optimized production RAG pipeline

Track these fields per request:

Vector-only vs vector + rerank

MetricVector only (top-8)Vector (40) + rerank + MMR (8)
Hit@8 (test set)~62%~89%
Answer quality (human eval)3.1 / 54.3 / 5
Avg latency~200 ms retrieval~400 ms retrieval
Retrieval cost / request~$0.0003~$0.0025
Duplicate chunks in promptCommonRare (MMR)
Follow-up qualityPoorGood (forced rerank)

The quality jump pays for itself in reduced support escalations and fewer "the bot gave wrong info" incidents.

Production code

app/rerank.py — Cohere v2 rerank API with conditional skip logic:

import os
import aiohttp

COHERE_RERANK_URL = "https://api.cohere.com/v2/rerank"
RERANK_MODEL = "rerank-v3.5"
VECTOR_TOP_K = 40
RERANK_TOP_N = 10
FINAL_TOP_K = 8
COSINE_SKIP_THRESHOLD = 0.85
RERANK_MIN_SCORE = 0.05
MMR_LAMBDA = 0.85

def should_run_cohere_rerank(
    top_cosine: float,
    is_follow_up: bool,
    is_multi_query: bool,
) -> bool:
    """Skip Cohere when vector confidence is high on a first-turn query."""
    if is_follow_up or is_multi_query:
        return True
    return top_cosine < COSINE_SKIP_THRESHOLD

async def cohere_rerank(
    query: str,
    documents: list[dict],
    top_n: int = RERANK_TOP_N,
) -> list[dict]:
    """Call Cohere v2 rerank API; return docs sorted by relevance score."""
    payload = {
        "model": RERANK_MODEL,
        "query": query,
        "documents": [d["text"] for d in documents],
        "top_n": top_n,
    }
    headers = {
        "Authorization": f"Bearer {os.environ['COHERE_API_KEY']}",
        "Content-Type": "application/json",
    }
    async with aiohttp.ClientSession() as session:
        async with session.post(COHERE_RERANK_URL, json=payload, headers=headers) as resp:
            resp.raise_for_status()
            data = await resp.json()

    ranked = []
    for result in data["results"]:
        doc = documents[result["index"]].copy()
        doc["rerank_score"] = result["relevance_score"]
        ranked.append(doc)
    return ranked

def mmr_select(
    docs: list[dict],
    k: int = FINAL_TOP_K,
    lambda_: float = MMR_LAMBDA,
) -> list[dict]:
    """Maximal Marginal Relevance — diversity without dropping related rules."""
    if len(docs) <= k:
        return docs
    selected, remaining = [docs[0]], docs[1:]
    while len(selected) < k and remaining:
        best_idx, best_score = 0, -1
        for i, candidate in enumerate(remaining):
            relevance = candidate["rerank_score"]
            max_sim = max(
                _text_overlap(candidate["text"], s["text"]) for s in selected
            )
            score = lambda_ * relevance - (1 - lambda_) * max_sim
            if score > best_score:
                best_score, best_idx = score, i
        selected.append(remaining.pop(best_idx))
    return selected

def _text_overlap(a: str, b: str) -> float:
    """Jaccard similarity on word sets — fast proxy for chunk redundancy."""
    sa, sb = set(a.lower().split()), set(b.lower().split())
    if not sa or not sb:
        return 0.0
    return len(sa & sb) / len(sa | sb)

async def retrieve_and_rank(
    query: str,
    vector_hits: list[dict],
    is_follow_up: bool = False,
    is_multi_query: bool = False,
) -> dict:
    top_cosine = vector_hits[0]["score"] if vector_hits else 0.0
    cost = {"cohere_rerank": 0.0}

    if should_run_cohere_rerank(top_cosine, is_follow_up, is_multi_query):
        ranked = await cohere_rerank(query, vector_hits)
        cost["cohere_rerank"] = 0.002
    else:
        ranked = vector_hits[:RERANK_TOP_N]
        for d in ranked:
            d["rerank_score"] = d["score"]  # passthrough cosine as proxy

    top_score = ranked[0]["rerank_score"] if ranked else 0.0
    if top_score < RERANK_MIN_SCORE:
        return {"chunks": [], "skip_llm": True, "cost": cost}

    final = mmr_select(ranked) if cost["cohere_rerank"] > 0 else ranked[:FINAL_TOP_K]
    return {"chunks": final, "skip_llm": False, "top_rerank_score": top_score, "cost": cost}

Query rewrite heuristics — skip Flash Lite on clear first turns:

import re

PRONOUNS = re.compile(r"\b(it|that|this|they|them|those|these|he|she)\b", re.I)
ORDINALS = re.compile(r"\b(step\s+\d|option\s+[a-z]|first|second|third|the\s+other)\b", re.I)

def needs_query_rewrite(message: str, history: list) -> bool:
    if not history:
        return False  # first turn — never rewrite
    if len(message.split()) >= 8 and not PRONOUNS.search(message) and not ORDINALS.search(message):
        return False  # self-contained despite history
    return True

Tuning checklist

Must have

  • Two-stage retrieval (wide vector → rerank)
  • Cohere rerank-v3.5 on follow-ups
  • Soft no-results gate (score < 0.05)
  • Per-request cost logging

Should have

  • Conditional rerank skip (cosine ≥ 0.85)
  • MMR after rerank (λ = 0.85)
  • Query rewrite with skip heuristics
  • Model tier by rerank confidence

Measure

  • Hit@8 on 30+ test questions
  • Rerank skip rate vs quality delta
  • Cost per request by component
  • Latency p50 / p95 by stage

Glossary

TermMeaning
Cross-encoder rerankerModel that scores query–document pairs jointly (Cohere rerank-v3.5)
Bi-encoderEmbedding model that encodes query and docs separately (vector search)
MMRMaximal Marginal Relevance — selects diverse yet relevant results
top-kNumber of chunks retrieved at each stage
Cosine similarityVector distance metric; 1.0 = identical, 0.0 = orthogonal
Soft no-results gateSkip LLM when rerank confidence is too low, even if vector hits exist
Query rewriteLLM expands a follow-up into a standalone search query
Conditional rerankSkip rerank API call when vector confidence is already high

Retrieval is where RAG quality is won or lost. Rerank for precision, skip for cost, MMR for diversity — and measure every dollar.

How to Build a Production RAG Chatbot