Auto Model Routing Without Calling an LLM to Pick an LLM
How to build Cursor-style Auto model routing for production RAG chat — heuristic tier selection, regex signals, post-retrieval upgrades, and pytest-tested routing with zero classifier LLM calls.
Most chat products offer an Auto model mode. The obvious implementation: call a cheap LLM to classify the user's question, then route to the right tier. That works — but it adds latency, tokens, and failure modes on every request.
We built a different approach for a production RAG chatbot: heuristic routing (regex + feature scoring) before retrieval, then a post-retrieval tier bump when the knowledge base returns weak matches. No classifier call. Deterministic. Testable with plain pytest.
Part of our RAG engineering series. See Best Economical LLM Models for RAG for pricing context and Cohere Reranking for Production RAG for the retrieval layer this router plugs into.
Who is this for? Teams running multi-provider RAG chat (Gemini, OpenRouter, OpenAI) who want Cursor-style Auto routing without paying for a routing LLM on every turn.
Why not use an LLM classifier?
The classifier pattern:
| Issue | Impact |
|---|---|
| Extra LLM call every turn | +200–400 ms latency, +500–1500 input tokens |
| Classifier drift | Model updates change routing behavior silently |
| Hard to test | Non-deterministic; flaky CI |
| Double billing | You pay for routing and answering |
| Failure coupling | Classifier 503 blocks the whole chat |
Heuristic routing accepts imperfect classification on edge cases in exchange for zero routing cost, instant routing, and full test coverage. Cursor's Auto mode uses a similar philosophy — feature-based routing, not a meta-LLM call.
Per-request routing overhead: heuristic routing adds ~0 ms and $0; an LLM classifier adds tokens and latency on every turn
Two-pass routing architecture
Routing happens in two passes:
Pass 1 (pre-retrieval): Inspect message text, word count, conversation history, and regex patterns. Pick a starting model tier.
Pass 2 (post-retrieval): After vector search and optional Cohere rerank, check retrieval confidence. If the top rerank score is low, bump the model up one or two tiers on the ladder.
This split matters because a short FAQ ("how do I zoom in?") needs a cheap model when retrieval is confident — but the same cheap model fails when retrieval returns weak chunks on a troubleshooting question.
The tier ladder
Define a fixed ladder of models from economy to premium. The router never jumps to an arbitrary model — it always moves along this ladder.
| Tier | Role | Example model IDs |
|---|---|---|
| 0 — Economy | Cheapest; high-volume FAQs | gemini-2.5-flash-lite, deepseek-v4-flash |
| 1 — Standard | Routine support, moderate explain | deepseek-chat-v3.1, mistral-small-2603 |
| 2 — Capable | Deep guides, long questions, first-turn debug | gemini-2.5-flash, qwen3-235b-a22b |
| 3 — Premium | Hard multi-turn troubleshoot | gemini-3-flash-preview |
_TIER_LADDER = (
"gemini-2.5-flash-lite", # 0 economy
"deepseek-chat-v3.1", # 1 standard
"gemini-2.5-flash", # 2 capable
"gemini-3-flash-preview", # 3 premium
)
def _bump_model_tier(model: str, steps: int = 1) -> str:
new_idx = min(_ladder_index(model) + steps, len(_TIER_LADDER) - 1)
return _TIER_LADDER[new_idx]
Typical Auto resolution distribution — most traffic stays on economy/standard tiers; premium is reserved for hard troubleshoot follow-ups
Pre-retrieval heuristics
Before any vector search, resolve_auto_model() inspects the user message using regex signal patterns:
| Pattern | Keywords | Routes to |
|---|---|---|
| Hard troubleshoot | not working, error, crash, debug | Capable or Premium |
| Compare / either-or | compare, vs, pros and cons, A or B | Mistral (comparison reasoning) |
| Deep / guide | in detail, step by step, walk me through | Capable Flash |
| Support / account | license, activate, subscription | Standard tier |
Hard troubleshoot runs before support keywords — so "My license export keeps failing with an error" routes to debug tier, not routine support.
_HARD_RE = re.compile(
r"\b(not working|doesn't work|error|bug|broken|crash|"
r"troubleshoot|debug|fix this|still failing)\b",
re.I,
)
_COMPARE_RE = re.compile(
r"\b(compare|comparison|difference|versus|vs\.?|"
r"which one|pros and cons)\b|"
r"\bor\b.+\b(or|local|cloud|vs)\b",
re.I,
)
_DEEP_RE = re.compile(
r"\b(in detail|step by step|walk me through|"
r"full explanation|tell me more|elaborate)\b",
re.I,
)
Length and structure signals
| Signal | Threshold | Effect |
|---|---|---|
| Word count | > 45 words | +2 complexity; route to Capable |
| Multiple questions | > 1 ? in message | +2 complexity |
| Long paste | > 2500 chars | Route to Capable |
| Follow-up turn | Prior assistant message exists | +1 complexity; may bump tier |
Complexity scoring
A numeric score aggregates structural features before rule matching:
def _complexity_score(lower, words, chars, prior_assistant, conversation_summary=None):
score = 0
if words > 12: score += 1
if words > 28: score += 1
if words > 45: score += 2
if chars > 1200: score += 1
if chars > 2500: score += 2
if lower.count("?") > 1: score += 2
if prior_assistant > 0: score += 1
if _DEEP_RE.search(lower) or _GUIDE_RE.search(lower): score += 2
if _HARD_RE.search(lower): score += 2
if conversation_summary and len(conversation_summary.strip()) > 800:
score += 1
return score
Score ≥ 6 routes to Capable tier regardless of other rules.
Rule priority and examples
Rules are evaluated in priority order — first match wins:
| Priority | Condition | Model tier | Reason code |
|---|---|---|---|
| 1 | Hard troubleshoot + follow-up or score ≥ 5 | Premium | hard_troubleshoot_premium |
| 2 | Hard troubleshoot (first turn) | Capable | hard_troubleshoot |
| 3 | Compare / either-or | Mistral | compare_or_either_or |
| 4 | Deep explanation or step-by-step | Capable | deep_or_guide |
| 5 | Very long message | Capable | long_context |
| 6 | Multi-question or score ≥ 6 | Capable | high_complexity |
| 7 | Follow-up + elaboration | Capable | follow_up_elaboration |
| 8 | Follow-up (generic) | Standard | follow_up |
| 9 | Support keywords or 14–40 words | Standard | routine_support |
| 10 | Short FAQ (> 8 words) | Economy+ | short_faq |
| 11 | Minimal (≤ 3 words) | Default economy | minimal |
Worked examples
| User message | Resolved model | Why |
|---|---|---|
| "how do I zoom in?" | gemini-3.1-flash-lite | Short FAQ |
| "How do I activate my license?" | deepseek-chat-v3.1 | Support keyword |
| "Compare frameless vs framed cabinets" | mistral-small-2603 | Compare pattern |
| "explain the export folder in very detail" | gemini-2.5-flash | Deep pattern |
| "export keeps crashing" (turn 1) | gemini-2.5-flash | Hard troubleshoot |
| Same message (turn 2) | gemini-3-flash-preview | Hard + follow-up → premium |
Log the reason code on every route — essential for tuning.
Post-retrieval tier upgrade
After retrieval and rerank, adjust_auto_model_for_retrieval() may bump the tier based on confidence signals:
| Condition | Bump |
|---|---|
| Rerank ran, top score < 0.08 | +2 tiers |
| Rerank ran, top score < 0.15 | +1 tier |
| No rerank, top cosine < 0.72 | +1 tier |
| Otherwise | No change |
RERANK_UPGRADE_THRESHOLD = 0.15
RERANK_STRONG_UPGRADE_THRESHOLD = 0.08
COSINE_UPGRADE_THRESHOLD = 0.72
def adjust_auto_model_for_retrieval(
model: str,
*,
top_rerank_score: float = 0.0,
rerank_ran: bool = False,
top_cosine: float = 0.0,
) -> str:
bumps = 0
if rerank_ran:
if top_rerank_score < RERANK_STRONG_UPGRADE_THRESHOLD:
bumps = 2
elif top_rerank_score < RERANK_UPGRADE_THRESHOLD:
bumps = 1
elif top_cosine > 0 and top_cosine < COSINE_UPGRADE_THRESHOLD:
bumps = 1
return _bump_model_tier(model, bumps) if bumps > 0 else model
Example: economy model selected for a short question, but rerank top score is 0.04 → bump +2 from tier 0 to tier 2. This catches the gray zone between the soft no-results gate (score < 0.05) and confident retrieval — where a stronger model can synthesize weak context.
Rerank score zones: no upgrade above 0.15, +1 tier between 0.08–0.15, +2 tiers below 0.08
Auto pool vs manual picker
Auto mode routes only within a fixed pool of budget-to-mid models. Premium manual models (Claude Sonnet, Claude Haiku) stay in the UI picker but are never Auto-selected.
| Design choice | Why |
|---|---|
| Exclude Claude from Auto | Cost control — Sonnet at default would erase routing savings |
| Fixed Auto pool | Predictable cost envelope per request |
| Manual picker for frontier | Power users pick Claude explicitly when needed |
AUTO_MODEL_POOL = frozenset({
"gemini-2.5-flash-lite",
"gemini-3.1-flash-lite",
"deepseek-v4-flash",
"deepseek-chat-v3.1",
"mistral-small-2603",
"gemini-2.5-flash",
"qwen3-235b-a22b",
"gemini-3-flash-preview",
})
# Claude models: manual picker only, never Auto
Wiring into the chat pipeline
In the FastAPI chat handler, routing fits between request parsing and retrieval:
def _resolve_chat_model(request):
requested = request.model or default_chat_model()
if requested == "auto":
resolved = resolve_auto_model(
request.message,
request.conversation_history,
conversation_summary=request.conversation_summary,
)
return resolved, "auto"
return requested, None
# ... after retrieval + rerank ...
if auto_requested == "auto":
model_name = adjust_auto_model_for_retrieval(
model_name,
top_rerank_score=top_rerank_score,
rerank_ran=rerank_scores is not None,
top_cosine=top_cosine,
)
Return both requested_model: "auto" and resolved_model in usage headers so billing dashboards show what Auto picked.
Testing with pytest
Heuristic routers are fully deterministic — write table-driven tests and run them on every PR:
def test_heuristic_short_faq():
assert resolve_auto_model("how do I zoom in?") in (
"deepseek-v4-flash",
"gemini-2.5-flash-lite",
"gemini-3.1-flash-lite",
)
def test_heuristic_compare_routes_to_mistral():
assert resolve_auto_model("Compare option A vs option B") == "mistral-small-2603"
def test_heuristic_hard_troubleshoot_premium_on_follow_up():
history = [{"role": "assistant", "content": "Try restarting."}]
assert resolve_auto_model(
"export keeps crashing when I click export", history
) == "gemini-3-flash-preview"
def test_retrieval_upgrade_bumps_on_low_rerank():
assert adjust_auto_model_for_retrieval(
"gemini-2.5-flash-lite",
top_rerank_score=0.10,
rerank_ran=True,
) == "deepseek-chat-v3.1"
def test_retrieval_upgrade_skips_when_confident():
assert adjust_auto_model_for_retrieval(
"gemini-2.5-flash-lite",
top_rerank_score=0.42,
rerank_ran=True,
) == "gemini-2.5-flash-lite"
When you tune a regex or threshold, update the test — the test suite is your routing spec.
Tuning checklist
Must have
- Fixed tier ladder (no arbitrary model jumps)
- Reason codes logged on every route
- Post-retrieval bump tied to rerank scores
- pytest coverage for top message patterns
- Usage header with requested + resolved model
Should have
- Hard-troubleshoot priority over support keywords
- Compare questions routed to comparison-capable model
- Frontier models excluded from Auto pool
- Conversation summary length as complexity signal
Measure
- Auto resolution distribution by tier
- Cost per request: Auto vs always-Sonnet baseline
- Quality on hard-troubleshoot routes
- How often Pass 2 bumps tier
FAQ
Why not use token count alone?
Word count misses intent. "Compare X vs Y" is 4 words but needs a comparison-capable model. Regex patterns catch intent; length catches paste dumps and multi-question messages.
Doesn't heuristic routing miss edge cases?
Yes — occasionally. Pass 2 (retrieval upgrade) catches the worst misses. Manual model picker covers power users. The goal is 90% cost savings on 95% of traffic, not perfect classification.
How is this different from query rewrite?
Query rewrite expands follow-ups into standalone search queries (an LLM call). Model routing picks which generator model to use. Rewrite runs before retrieval; routing Pass 1 runs before rewrite; Pass 2 runs after rerank.
Can I add an LLM classifier as a fallback?
You can, but we don't. If heuristics fail often enough to justify a classifier, fix the heuristics first — they're cheaper to iterate.
Auto routing is a unit economics feature. Skip the classifier, test the heuristics, and let retrieval confidence finish the job.
← Best Economical LLM Models for RAG · Cohere Reranking for Production RAG