Why not use an LLM to classify which model to route to?

An LLM classifier adds latency, tokens, and cost on every request, is non-deterministic and hard to test, and creates a failure point before the main answer. Heuristic routing is instant, free, and fully testable with pytest.

What is two-pass model routing?

Pass 1 picks a starting model tier from message heuristics before retrieval. Pass 2 bumps the tier after rerank if retrieval confidence is low — catching cases where a cheap model was chosen but the knowledge base returned weak matches.

How is model routing different from query rewrite in RAG?

Query rewrite expands follow-up questions into standalone search queries using a cheap LLM. Model routing selects which generator model produces the final answer. Rewrite runs before retrieval; routing Pass 1 runs before rewrite and Pass 2 runs after rerank.

Auto Model Routing Without Calling an LLM to Pick an LLM

July 2026 · Published by Amar Kumar

How to build Cursor-style Auto model routing for production RAG chat — heuristic tier selection, regex signals, post-retrieval upgrades, and pytest-tested routing with zero classifier LLM calls.

Most chat products offer an Auto model mode. The obvious implementation: call a cheap LLM to classify the user's question, then route to the right tier. That works — but it adds latency, tokens, and failure modes on every request.

We built a different approach for a production RAG chatbot: heuristic routing (regex + feature scoring) before retrieval, then a post-retrieval tier bump when the knowledge base returns weak matches. No classifier call. Deterministic. Testable with plain pytest.

Part of our RAG engineering series. See Best Economical LLM Models for RAG for pricing context and Cohere Reranking for Production RAG for the retrieval layer this router plugs into.

Who is this for? Teams running multi-provider RAG chat (Gemini, OpenRouter, OpenAI) who want Cursor-style Auto routing without paying for a routing LLM on every turn.

Why not use an LLM classifier?

The classifier pattern:

User message → Cheap LLM classify → Pick model → RAG → Answer

Issue	Impact
Extra LLM call every turn	+200–400 ms latency, +500–1500 input tokens
Classifier drift	Model updates change routing behavior silently
Hard to test	Non-deterministic; flaky CI
Double billing	You pay for routing and answering
Failure coupling	Classifier 503 blocks the whole chat

Heuristic routing accepts imperfect classification on edge cases in exchange for zero routing cost, instant routing, and full test coverage. Cursor's Auto mode uses a similar philosophy — feature-based routing, not a meta-LLM call.

Per-request routing overhead: heuristic routing adds ~0 ms and $0; an LLM classifier adds tokens and latency on every turn

Two-pass routing architecture

Routing happens in two passes:

User message → Pass 1: Heuristics → Retrieve + rerank → Pass 2: Upgrade → LLM answer

Pass 1 (pre-retrieval): Inspect message text, word count, conversation history, and regex patterns. Pick a starting model tier.

Pass 2 (post-retrieval): After vector search and optional Cohere rerank, check retrieval confidence. If the top rerank score is low, bump the model up one or two tiers on the ladder.

This split matters because a short FAQ ("how do I zoom in?") needs a cheap model when retrieval is confident — but the same cheap model fails when retrieval returns weak chunks on a troubleshooting question.

The tier ladder

Define a fixed ladder of models from economy to premium. The router never jumps to an arbitrary model — it always moves along this ladder.

Tier	Role	Example model IDs
0 — Economy	Cheapest; high-volume FAQs	`gemini-2.5-flash-lite`, `deepseek-v4-flash`
1 — Standard	Routine support, moderate explain	`deepseek-chat-v3.1`, `mistral-small-2603`
2 — Capable	Deep guides, long questions, first-turn debug	`gemini-2.5-flash`, `qwen3-235b-a22b`
3 — Premium	Hard multi-turn troubleshoot	`gemini-3-flash-preview`

_TIER_LADDER = (
    "gemini-2.5-flash-lite",   # 0 economy
    "deepseek-chat-v3.1",      # 1 standard
    "gemini-2.5-flash",        # 2 capable
    "gemini-3-flash-preview",  # 3 premium
)

def _bump_model_tier(model: str, steps: int = 1) -> str:
    new_idx = min(_ladder_index(model) + steps, len(_TIER_LADDER) - 1)
    return _TIER_LADDER[new_idx]

Typical Auto resolution distribution — most traffic stays on economy/standard tiers; premium is reserved for hard troubleshoot follow-ups

Pre-retrieval heuristics

Before any vector search, resolve_auto_model() inspects the user message using regex signal patterns:

Pattern	Keywords	Routes to
Hard troubleshoot	`not working`, `error`, `crash`, `debug`	Capable or Premium
Compare / either-or	`compare`, `vs`, `pros and cons`, `A or B`	Mistral (comparison reasoning)
Deep / guide	`in detail`, `step by step`, `walk me through`	Capable Flash
Support / account	`license`, `activate`, `subscription`	Standard tier

Hard troubleshoot runs before support keywords — so "My license export keeps failing with an error" routes to debug tier, not routine support.

_HARD_RE = re.compile(
    r"\b(not working|doesn't work|error|bug|broken|crash|"
    r"troubleshoot|debug|fix this|still failing)\b",
    re.I,
)

_COMPARE_RE = re.compile(
    r"\b(compare|comparison|difference|versus|vs\.?|"
    r"which one|pros and cons)\b|"
    r"\bor\b.+\b(or|local|cloud|vs)\b",
    re.I,
)

_DEEP_RE = re.compile(
    r"\b(in detail|step by step|walk me through|"
    r"full explanation|tell me more|elaborate)\b",
    re.I,
)

Length and structure signals

Signal	Threshold	Effect
Word count	> 45 words	+2 complexity; route to Capable
Multiple questions	> 1 `?` in message	+2 complexity
Long paste	> 2500 chars	Route to Capable
Follow-up turn	Prior assistant message exists	+1 complexity; may bump tier

Complexity scoring

A numeric score aggregates structural features before rule matching:

def _complexity_score(lower, words, chars, prior_assistant, conversation_summary=None):
    score = 0
    if words > 12:  score += 1
    if words > 28:  score += 1
    if words > 45:  score += 2
    if chars > 1200: score += 1
    if chars > 2500: score += 2
    if lower.count("?") > 1: score += 2
    if prior_assistant > 0: score += 1
    if _DEEP_RE.search(lower) or _GUIDE_RE.search(lower): score += 2
    if _HARD_RE.search(lower): score += 2
    if conversation_summary and len(conversation_summary.strip()) > 800:
        score += 1
    return score

Score ≥ 6 routes to Capable tier regardless of other rules.

Rule priority and examples

Rules are evaluated in priority order — first match wins:

Priority	Condition	Model tier	Reason code
1	Hard troubleshoot + follow-up or score ≥ 5	Premium	`hard_troubleshoot_premium`
2	Hard troubleshoot (first turn)	Capable	`hard_troubleshoot`
3	Compare / either-or	Mistral	`compare_or_either_or`
4	Deep explanation or step-by-step	Capable	`deep_or_guide`
5	Very long message	Capable	`long_context`
6	Multi-question or score ≥ 6	Capable	`high_complexity`
7	Follow-up + elaboration	Capable	`follow_up_elaboration`
8	Follow-up (generic)	Standard	`follow_up`
9	Support keywords or 14–40 words	Standard	`routine_support`
10	Short FAQ (> 8 words)	Economy+	`short_faq`
11	Minimal (≤ 3 words)	Default economy	`minimal`

Worked examples

User message	Resolved model	Why
"how do I zoom in?"	`gemini-3.1-flash-lite`	Short FAQ
"How do I activate my license?"	`deepseek-chat-v3.1`	Support keyword
"Compare frameless vs framed cabinets"	`mistral-small-2603`	Compare pattern
"explain the export folder in very detail"	`gemini-2.5-flash`	Deep pattern
"export keeps crashing" (turn 1)	`gemini-2.5-flash`	Hard troubleshoot
Same message (turn 2)	`gemini-3-flash-preview`	Hard + follow-up → premium

Log the reason code on every route — essential for tuning.

Post-retrieval tier upgrade

After retrieval and rerank, adjust_auto_model_for_retrieval() may bump the tier based on confidence signals:

Condition	Bump
Rerank ran, top score < 0.08	+2 tiers
Rerank ran, top score < 0.15	+1 tier
No rerank, top cosine < 0.72	+1 tier
Otherwise	No change

RERANK_UPGRADE_THRESHOLD = 0.15
RERANK_STRONG_UPGRADE_THRESHOLD = 0.08
COSINE_UPGRADE_THRESHOLD = 0.72

def adjust_auto_model_for_retrieval(
    model: str,
    *,
    top_rerank_score: float = 0.0,
    rerank_ran: bool = False,
    top_cosine: float = 0.0,
) -> str:
    bumps = 0
    if rerank_ran:
        if top_rerank_score < RERANK_STRONG_UPGRADE_THRESHOLD:
            bumps = 2
        elif top_rerank_score < RERANK_UPGRADE_THRESHOLD:
            bumps = 1
    elif top_cosine > 0 and top_cosine < COSINE_UPGRADE_THRESHOLD:
        bumps = 1
    return _bump_model_tier(model, bumps) if bumps > 0 else model

Example: economy model selected for a short question, but rerank top score is 0.04 → bump +2 from tier 0 to tier 2. This catches the gray zone between the soft no-results gate (score < 0.05) and confident retrieval — where a stronger model can synthesize weak context.

Rerank score zones: no upgrade above 0.15, +1 tier between 0.08–0.15, +2 tiers below 0.08

Auto pool vs manual picker

Auto mode routes only within a fixed pool of budget-to-mid models. Premium manual models (Claude Sonnet, Claude Haiku) stay in the UI picker but are never Auto-selected.

Design choice	Why
Exclude Claude from Auto	Cost control — Sonnet at default would erase routing savings
Fixed Auto pool	Predictable cost envelope per request
Manual picker for frontier	Power users pick Claude explicitly when needed

AUTO_MODEL_POOL = frozenset({
    "gemini-2.5-flash-lite",
    "gemini-3.1-flash-lite",
    "deepseek-v4-flash",
    "deepseek-chat-v3.1",
    "mistral-small-2603",
    "gemini-2.5-flash",
    "qwen3-235b-a22b",
    "gemini-3-flash-preview",
})
# Claude models: manual picker only, never Auto

Wiring into the chat pipeline

In the FastAPI chat handler, routing fits between request parsing and retrieval:

def _resolve_chat_model(request):
    requested = request.model or default_chat_model()
    if requested == "auto":
        resolved = resolve_auto_model(
            request.message,
            request.conversation_history,
            conversation_summary=request.conversation_summary,
        )
        return resolved, "auto"
    return requested, None

# ... after retrieval + rerank ...
if auto_requested == "auto":
    model_name = adjust_auto_model_for_retrieval(
        model_name,
        top_rerank_score=top_rerank_score,
        rerank_ran=rerank_scores is not None,
        top_cosine=top_cosine,
    )

Return both requested_model: "auto" and resolved_model in usage headers so billing dashboards show what Auto picked.

Resolve Auto (Pass 1) → Query rewrite → Embed + search → Rerank → Upgrade (Pass 2) → Generate

Testing with pytest

Heuristic routers are fully deterministic — write table-driven tests and run them on every PR:

def test_heuristic_short_faq():
    assert resolve_auto_model("how do I zoom in?") in (
        "deepseek-v4-flash",
        "gemini-2.5-flash-lite",
        "gemini-3.1-flash-lite",
    )

def test_heuristic_compare_routes_to_mistral():
    assert resolve_auto_model("Compare option A vs option B") == "mistral-small-2603"

def test_heuristic_hard_troubleshoot_premium_on_follow_up():
    history = [{"role": "assistant", "content": "Try restarting."}]
    assert resolve_auto_model(
        "export keeps crashing when I click export", history
    ) == "gemini-3-flash-preview"

def test_retrieval_upgrade_bumps_on_low_rerank():
    assert adjust_auto_model_for_retrieval(
        "gemini-2.5-flash-lite",
        top_rerank_score=0.10,
        rerank_ran=True,
    ) == "deepseek-chat-v3.1"

def test_retrieval_upgrade_skips_when_confident():
    assert adjust_auto_model_for_retrieval(
        "gemini-2.5-flash-lite",
        top_rerank_score=0.42,
        rerank_ran=True,
    ) == "gemini-2.5-flash-lite"

When you tune a regex or threshold, update the test — the test suite is your routing spec.

Tuning checklist

Must have

Fixed tier ladder (no arbitrary model jumps)
Reason codes logged on every route
Post-retrieval bump tied to rerank scores
pytest coverage for top message patterns
Usage header with requested + resolved model

Should have

Hard-troubleshoot priority over support keywords
Compare questions routed to comparison-capable model
Frontier models excluded from Auto pool
Conversation summary length as complexity signal

Measure

Auto resolution distribution by tier
Cost per request: Auto vs always-Sonnet baseline
Quality on hard-troubleshoot routes
How often Pass 2 bumps tier

FAQ

Why not use token count alone?

Word count misses intent. "Compare X vs Y" is 4 words but needs a comparison-capable model. Regex patterns catch intent; length catches paste dumps and multi-question messages.

Doesn't heuristic routing miss edge cases?

Yes — occasionally. Pass 2 (retrieval upgrade) catches the worst misses. Manual model picker covers power users. The goal is 90% cost savings on 95% of traffic, not perfect classification.

How is this different from query rewrite?

Query rewrite expands follow-ups into standalone search queries (an LLM call). Model routing picks which generator model to use. Rewrite runs before retrieval; routing Pass 1 runs before rewrite; Pass 2 runs after rerank.

Can I add an LLM classifier as a fallback?

You can, but we don't. If heuristics fail often enough to justify a classifier, fix the heuristics first — they're cheaper to iterate.

Auto routing is a unit economics feature. Skip the classifier, test the heuristics, and let retrieval confidence finish the job.

← Best Economical LLM Models for RAG · Cohere Reranking for Production RAG