Best Economical LLM Models for RAG
Picking a chat model is not just a quality decision — it is a unit economics decision. In RAG, every turn sends system prompt + chat history + retrieved chunks + user query to the API. A model that costs 10× more per token can turn a $6/month prototype into a $200/month bill at the same traffic.
This guide compares OpenAI, Google Gemini 3 (plus budget 2.5 tiers), and Anthropic Claude — with list-price tables, a worked RAG cost model that includes chat history and context, and charts for monthly spend at 10,000 queries.
Pricing note: Figures use published API list prices as of June 2026. Google's current generation is the Gemini 3 family — see Gemini 3 docs and official pricing. Verify before budgeting; thinking tokens, caching, and batch tiers change effective rates.
Why model choice matters in RAG
Plain chat bills history + new message + reply. RAG also bills:
- Retrieved chunks — often 2,000–6,000 tokens per turn
- System instructions — citations, tone, guardrails
- Optional query rewrite — extra LLM call
- Embeddings + reranking — small but non-zero
Using GPT-5.4, Gemini 3.5 Flash, or Claude Sonnet on every turn is like overnight shipping every package — fine for escalations, expensive as a default.
Gemini 3 lineup (current)
Google's latest models are in the Gemini 3 series. Older 2.5 Flash-Lite remains the cheapest text option for many RAG workloads.
| Model | Model ID | Input / 1M | Output / 1M | Context | Role |
|---|---|---|---|---|---|
| Gemini 3.1 Flash-Lite | gemini-3.1-flash-lite | $0.25 | $1.50 | 1M | Cheapest 3-series — high-volume RAG |
| Gemini 3 Flash Preview | gemini-3-flash-preview | $0.50 | $3.00 | 1M | Pro-level intelligence at Flash speed |
| Gemini 3.5 Flash | gemini-3.5-flash | $1.50 | $9.00 | 1M | Latest fast flagship (Google I/O 2026) |
| Gemini 3.1 Pro Preview | gemini-3.1-pro-preview | $2 / $4* | $12 / $18* | 1M | Deepest reasoning in 3 family |
| Gemini 2.5 Flash-Lite | gemini-2.5-flash-lite | $0.10 | $0.40 | 1M | Still available — often cheapest overall |
* Gemini 3.1 Pro pricing doubles above 200k input tokens.
General chat comparison (OpenAI · Gemini · Anthropic)
| Model | Provider | Input / 1M | Output / 1M | Context | Best for |
|---|---|---|---|---|---|
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | 1M | Lowest $/token chat | |
| Gemini 3.1 Flash-Lite | $0.25 | $1.50 | 1M | Cheap 3-series chat | |
| GPT-5.4 nano | OpenAI | $0.20 | $1.25 | 400K | Cheapest OpenAI 5.x |
| Gemini 3 Flash Preview | $0.50 | $3.00 | 1M | Smart Gemini 3 default | |
| GPT-5.4 mini | OpenAI | $0.75 | $4.50 | 400K | OpenAI volume workhorse |
| Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | 200K | Careful answers + caching |
| Gemini 3.5 Flash | $1.50 | $9.00 | 1M | Latest fast flagship | |
| GPT-5.4 | OpenAI | $2.50 | $15.00 | 1M | OpenAI production default |
| Gemini 3.1 Pro Preview | $2.00 | $12.00 | 1M | Hard multimodal + agents | |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | 1M | Coding, agents, precision |
| GPT-5.5 | OpenAI | $5.00 | $30.00 | 1M | Frontier reasoning |
| Claude Opus 4.7 | Anthropic | $5.00 | $25.00 | 1M | Hardest tasks only |
Sources: Gemini API pricing, OpenAI pricing, Anthropic pricing.
API list price per 1M tokens — input vs output (budget + mid-tier models)
Budget tier by provider
| Tier | OpenAI | Google Gemini | Anthropic |
|---|---|---|---|
| Ultra-cheap | GPT-5.4 nano ($0.20 / $1.25) | 2.5 Flash-Lite ($0.10 / $0.40) | — |
| Cheap 3-series | — | 3.1 Flash-Lite ($0.25 / $1.50) | — |
| Mid-budget | GPT-5.4 mini ($0.75 / $4.50) | 3 Flash Preview ($0.50 / $3.00) | Haiku 4.5 ($1.00 / $5.00) |
| Fast flagship | GPT-5.4 ($2.50 / $15.00) | 3.5 Flash ($1.50 / $9.00) | Sonnet 4.6 ($3.00 / $15.00) |
| Frontier | GPT-5.5 ($5.00 / $30.00) | 3.1 Pro ($2.00 / $12.00) | Opus 4.7 ($5.00 / $25.00) |
What a RAG request actually bills
Typical single RAG turn token budget:
| Component | Tokens | Notes |
|---|---|---|
| System prompt | 400 | Citations, persona, safety |
| Chat history (last 8 turns) | 1,800 | Grows every turn — trim or summarize |
| Retrieved chunks (5 × ~450) | 2,250 | Main cost driver after history |
| User query | 80 | — |
| Total input | ~4,530 | Billed as input tokens |
| Assistant reply | 350 | Billed as output tokens |
Add-ons: embeddings (~$0.000002/query), optional rerank (~$0.001/search), optional query rewrite (cheap LLM call on Flash-Lite or nano).
Chat history tip: A 20-turn support thread can add 5,000+ tokens before retrieval even starts. Summarize older turns.
Cost per RAG turn (history + context included)
Formula: (input_tokens / 1M × input_price) + (output_tokens / 1M × output_price) using 4,530 in / 350 out:
| Model | Cost / RAG turn | vs cheapest |
|---|---|---|
| Gemini 2.5 Flash-Lite | ~$0.00059 | — |
| Gemini 3.1 Flash-Lite | ~$0.00166 | +181% |
| GPT-5.4 nano | ~$0.00134 | +127% |
| Gemini 3 Flash Preview | ~$0.00332 | +463% |
| GPT-5.4 mini | ~$0.00497 | +743% |
| Claude Haiku 4.5 | ~$0.00628 | +964% |
| Gemini 3.5 Flash | ~$0.00995 | +1,586% |
| GPT-5.4 | ~$0.0166 | +2,712% |
| Claude Sonnet 4.6 | ~$0.0188 | +3,089% |
Standard list prices. Gemini batch API is 50% off; context caching can cut repeated input by ~90%.
Cost per RAG turn — 4,530 input + 350 output tokens (history + 5 chunks)
Monthly cost at 10,000 RAG queries
| Model | ~$/month (10k queries) |
|---|---|
| Gemini 2.5 Flash-Lite | ~$5.90 |
| GPT-5.4 nano | ~$13.40 |
| Gemini 3.1 Flash-Lite | ~$16.60 |
| Gemini 3 Flash Preview | ~$33.20 |
| GPT-5.4 mini | ~$49.70 |
| Claude Haiku 4.5 | ~$62.80 |
| Gemini 3.5 Flash | ~$99.50 |
| GPT-5.4 | ~$166 |
| Claude Sonnet 4.6 | ~$188 |
At 100k queries/month, Sonnet-default RAG ≈ $1,880 vs ~$59 on 2.5 Flash-Lite.
Monthly generation cost at 10,000 RAG queries (same token assumptions)
Best economical picks for RAG
| Use case | Pick | Why |
|---|---|---|
| Cheapest RAG (any provider) | Gemini 2.5 Flash-Lite | $0.10/$0.40, 1M context, free dev tier |
| Cheapest Gemini 3 | Gemini 3.1 Flash-Lite | Current 3-series budget; better quality than 2.5 |
| Balanced Gemini 3 RAG | Gemini 3 Flash Preview | Stronger answers; 3× cheaper output than 3.5 Flash |
| OpenAI-only stack | GPT-5.4 nano → mini | Nano for volume; mini when reasoning matters |
| Precision / compliance | Claude Haiku 4.5 + caching | Instruction adherence; cache system + doc prefix |
| Hard queries only | 3.1 Pro / GPT-5.4 / Sonnet | Route <10% of traffic — not every RAG turn |
Production pattern: cheap model for answer generation, optional rerank gate, frontier model on escalation. See Cohere reranking in production.
Two-tier routing pattern
Keeps 90–95% of tokens on sub-$2/1M-output models while preserving quality on hard queries.
FAQ
What is the cheapest model for RAG in 2026?
Gemini 2.5 Flash-Lite ($0.10 / $0.40 per 1M) is typically lowest for text RAG. Among Gemini 3, use 3.1 Flash-Lite ($0.25 / $1.50).
Is Gemini 3 cheaper than Gemini 2.5?
Not for RAG volume. 3.1 Flash-Lite costs more than 2.5 Flash-Lite. Pick 3-series for newer reasoning; pick 2.5 Flash-Lite when minimizing $/query.
When should I use Gemini 3.5 Flash?
For agents, coding, and search-grounded chat where quality justifies $1.50/$9.00. For chunk-based RAG at scale, it is usually overkill — ~17× the per-turn cost of 2.5 Flash-Lite in our model.
Does chat history increase RAG cost?
Yes — linearly. Cap or summarize history; don't send the full thread on every retrieval.
Should I use Claude Sonnet for every RAG answer?
No. Sonnet is ~32× the per-turn cost of 2.5 Flash-Lite in this example. Route hard queries only.
Related guides
Default cheap. Escalate smart. Measure tokens per turn.