According to IDC's 2026 enterprise AI survey, 37% of enterprises now run five or more AI models in production. Most are not routing intelligently between them — they pick one model per use case and leave it there. That is the most expensive mistake in AI infrastructure today.

Smart model routing — sending each request to the cheapest model that can handle it reliably — cuts API costs 60–85% in documented production deployments, with one case study showing a simultaneous 28% faster resolution time and 19% improvement in user satisfaction scores. This guide covers the architecture, the implementation patterns, and the exact tier structure that works in 2026.

What Model Routing Is

Model routing is the practice of evaluating each incoming request and dispatching it to the most cost-effective model capable of handling it correctly. The alternative — sending every request to a frontier model — is equivalent to using a freight truck to deliver every package regardless of size.

The economic case is stark. Compare the cost of sending 1 million requests per month, each consuming 2K input tokens and 500 output tokens, to a single frontier model versus a routed stack:

Routing StrategyModel UsedMonthly Cost
All requests → Claude Opus 4.6$15/$75 per 1M$37,500
All requests → Gemini 3.1 Pro$2/$12 per 1M$6,000
All requests → Gemini 3.1 Flash$0.75/$3 per 1M$2,250
Routed (30% nano / 50% mid / 20% frontier)Mixed$5,625–$7,500
Routed with Flash-Lite for simple tasksMixed$1,500–$3,000

The routed stack does not just save money — it is faster on average, because simple requests that previously waited for a slow frontier model now complete on a lightweight model in milliseconds.

The Three-Tier Model Stack for 2026

Effective routing in 2026 maps to three tiers, each with a clear cost and capability profile:

Tier 1: Nano (Classification and Routing)

Models: Gemini 3.1 Flash-Lite ($0.25/$1.50), GPT-5.4-mini, Claude Haiku 4.5

Task profile: Intent classification, entity extraction, yes/no decisions, simple lookups, routing decisions themselves, keyword tagging

Latency target: Under 300ms Time to First Token

Tier 1 handles the highest request volume at the lowest cost. A well-prompted Gemini 3.1 Flash-Lite at minimal thinking mode classifies intent with accuracy equivalent to larger models on well-defined taxonomies.

Tier 2: Mid-Tier (Standard Production Tasks)

Models: Gemini 3.1 Flash ($0.75/$3.00), Claude Sonnet 4.6 ($3/$15), GPT-5.4 ($2.50/$15)

Task profile: Code generation, summarization, document Q&A, multi-turn conversation, standard reasoning, data transformation

Latency target: Under 2 seconds Time to First Token

Tier 2 handles the bulk of substantive work — approximately 50% of requests in a typical enterprise pipeline. These tasks need real reasoning capability but do not require the maximum reasoning depth of frontier models.

Tier 3: Frontier (Complex and High-Stakes Tasks)

Models: Claude Opus 4.6 ($15/$75), Gemini 3.1 Pro ($2/$12), GPT-5.4 for complex tasks

Task profile: Long-horizon agentic coding, complex algorithm design, security analysis, multi-document synthesis, tasks explicitly requiring maximum reasoning depth

Latency target: Flexible — quality over speed

Tier 3 should handle no more than 15–20% of total request volume. If more than 20% of requests are reaching Tier 3, your classification logic is too conservative.

Routing Architecture: Two Approaches

Approach 1: Rule-Based Router (50–100 Lines of Code)

For most teams, a rule-based router is sufficient and faster to deploy than an LLM-based classifier. It evaluates explicit signals — token count, keywords, task type flags — and dispatches accordingly.

import os
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import google.generativeai as genai
import anthropic

class ModelTier(Enum):
    NANO = \"nano\"
    MID = \"mid\"
    FRONTIER = \"frontier\"

@dataclass
class RoutingDecision:
    tier: ModelTier
    model: str
    reason: str

def route_request(
    prompt: str,
    task_type: Optional[str] = None,
    input_token_estimate: Optional[int] = None,
    force_tier: Optional[ModelTier] = None
) -> RoutingDecision:
    \"\"\"
    Rule-based router. Returns the cheapest model tier for the request.
    \"\"\"
    if force_tier:
        return _dispatch_to_tier(force_tier, \"forced\")

    token_count = input_token_estimate or len(prompt.split()) * 1.3

    # Explicit task type overrides
    NANO_TASKS = {\"classify\", \"tag\", \"extract\", \"yes_no\", \"route\", \"intent\"}
    FRONTIER_TASKS = {\"agentic_code\", \"security_audit\", \"algorithm_design\", \"long_synthesis\"}

    if task_type in NANO_TASKS:
        return RoutingDecision(ModelTier.NANO, \"gemini-3.1-flash-lite\", f\"task_type={task_type}\")

    if task_type in FRONTIER_TASKS:
        return RoutingDecision(ModelTier.FRONTIER, \"claude-opus-4-6\", f\"task_type={task_type}\")

    # Token-count heuristics
    if token_count < 500:
        return RoutingDecision(ModelTier.NANO, \"gemini-3.1-flash-lite\", \"short_prompt\")

    if token_count > 50000:
        return RoutingDecision(ModelTier.FRONTIER, \"claude-opus-4-6\", \"large_context\")

    # Keyword signals for frontier escalation
    frontier_signals = [
        \"security vulnerability\", \"CVE\", \"cryptograph\",
        \"design an algorithm\", \"formal proof\", \"optimize for\"
    ]
    if any(signal.lower() in prompt.lower() for signal in frontier_signals):
        return RoutingDecision(ModelTier.FRONTIER, \"gemini-3.1-pro\", \"frontier_keyword\")

    # Default: mid-tier handles everything else
    return RoutingDecision(ModelTier.MID, \"claude-sonnet-4-6\", \"default_mid\")

def _dispatch_to_tier(tier: ModelTier, reason: str) -> RoutingDecision:
    defaults = {
        ModelTier.NANO: (\"gemini-3.1-flash-lite\", reason),
        ModelTier.MID: (\"claude-sonnet-4-6\", reason),
        ModelTier.FRONTIER: (\"claude-opus-4-6\", reason),
    }
    model, r = defaults[tier]
    return RoutingDecision(tier, model, r)

This router fits in 80 lines, requires no external dependencies beyond the model SDKs, and covers the routing logic that handles 80% of real-world cases correctly. Add new rules by extending the NANO_TASKS and FRONTIER_TASKS sets and the keyword list.

Approach 2: LLM Classifier Router

For more nuanced routing — ambiguous task types, user-generated inputs with unpredictable complexity — use a nano-tier model as the classifier itself. The classifier reads the request and returns a structured routing decision:

import os
import json
import google.generativeai as genai

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])

CLASSIFIER_MODEL = genai.GenerativeModel(\"gemini-3.1-flash-lite\")

CLASSIFIER_PROMPT = \"\"\"You are a request classifier for an AI routing system.
Analyze the user request and respond ONLY with a JSON object, no other text:

{{
  \"tier\": \"nano\" | \"mid\" | \"frontier\",
  \"task_type\": \"<one of: classify, summarize, code_gen, bug_fix, algorithm_design, security_audit, general>\",
  \"estimated_complexity\": 1-10,
  \"reason\": \"<one sentence>\"
}}

Tier definitions:
- nano: simple classification, extraction, yes/no, single-fact lookup
- mid: code generation, summarization, standard Q&A, multi-step reasoning under 10 steps
- frontier: complex algorithm design, security analysis, 50+ step agentic tasks, novel reasoning

User request: {prompt}\"\"\"

def classify_request(prompt: str) -> dict:
    \"\"\"
    Use Flash-Lite as a classifier to determine the appropriate routing tier.
    Cost: approximately $0.000025 per classification at Flash-Lite pricing.
    \"\"\"
    response = CLASSIFIER_MODEL.generate_content(
        CLASSIFIER_PROMPT.format(prompt=prompt[:2000]),  # Truncate for classifier
        generation_config=genai.GenerationConfig(
            thinking_config={\"thinking_mode\": \"flash\"},
            temperature=0.0  # Deterministic classification
        )
    )

    try:
        raw = response.text.strip().replace(\"```json\", \"\").replace(\"```\", \"\")
        return json.loads(raw)
    except json.JSONDecodeError:
        # Fallback to mid-tier on classification failure
        return {\"tier\": \"mid\", \"task_type\": \"general\", \"reason\": \"classification_failed\"}

The LLM classifier approach costs approximately $0.000025 per classification at Gemini 3.1 Flash-Lite pricing — negligible overhead that pays for itself if it correctly downgrades even a single frontier request to mid-tier per thousand classifications.

Putting It Together: A Complete Router

import os
from typing import Optional
import google.generativeai as genai
import anthropic

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])
anthropicClient = anthropic.Anthropic(api_key=os.environ[\"ANTHROPIC_API_KEY\"])

MODEL_CONFIGS = {
    \"gemini-3.1-flash-lite\": {\"provider\": \"google\", \"thinking_mode\": \"flash\"},
    \"gemini-3.1-flash\": {\"provider\": \"google\", \"thinking_mode\": \"standard\"},
    \"gemini-3.1-pro\": {\"provider\": \"google\", \"thinking_mode\": \"standard\"},
    \"claude-sonnet-4-6\": {\"provider\": \"anthropic\", \"max_tokens\": 4096},
    \"claude-opus-4-6\": {\"provider\": \"anthropic\", \"max_tokens\": 8192},
}

def execute_request(prompt: str, model: str) -> str:
    \"\"\"
    Execute a request against the specified model.
    \"\"\"
    config = MODEL_CONFIGS[model]

    if config[\"provider\"] == \"google\":
        m = genai.GenerativeModel(model)
        response = m.generate_content(
            prompt,
            generation_config=genai.GenerationConfig(
                thinking_config={\"thinking_mode\": config[\"thinking_mode\"]}
            )
        )
        return response.text

    elif config[\"provider\"] == \"anthropic\":
        message = anthropicClient.messages.create(
            model=model,
            max_tokens=config[\"max_tokens\"],
            messages=[{\"role\": \"user\", \"content\": prompt}]
        )
        return message.content[0].text

def route_and_execute(
    prompt: str,
    task_type: Optional[str] = None,
    use_llm_classifier: bool = False
) -> dict:
    \"\"\"
    Full routing pipeline: classify → dispatch → execute → return with metadata.
    \"\"\"
    if use_llm_classifier:
        classification = classify_request(prompt)
        tier = classification[\"tier\"]
        model_map = {\"nano\": \"gemini-3.1-flash-lite\", \"mid\": \"gemini-3.1-flash\", \"frontier\": \"gemini-3.1-pro\"}
        selected_model = model_map[tier]
        reason = classification[\"reason\"]
    else:
        decision = route_request(prompt, task_type=task_type)
        selected_model = decision.model
        reason = decision.reason

    result = execute_request(prompt, selected_model)

    return {
        \"response\": result,
        \"model_used\": selected_model,
        \"routing_reason\": reason
    }

# Example usage
if __name__ == \"__main__\":
    # Should route to nano
    r1 = route_and_execute(\"Is Python dynamically typed?\", task_type=\"yes_no\")
    print(f\"[{r1['model_used']}] {r1['response'][:100]}\")

    # Should route to mid
    r2 = route_and_execute(\"Write a Python function to parse a CSV and return row count by category.\")
    print(f\"[{r2['model_used']}] {r2['response'][:100]}\")

    # Should route to frontier
    r3 = route_and_execute(\"Design an algorithm for distributed consensus under Byzantine fault conditions.\")
    print(f\"[{r3['model_used']}] {r3['response'][:100]}\")

How Cursor Uses Routing Internally

Cursor's internal architecture is a production example of model routing at scale. When you use Cursor today, requests are automatically dispatched between models based on task type:

  • Autocomplete suggestions route to a fast, lightweight completion model
  • Composer sessions route to Cursor Composer 2 by default ($0.50/$2.50 per million tokens)
  • Tasks explicitly requiring frontier capability can be escalated to Claude Opus 4.6 or Gemini 3.1 Pro

This multi-tier dispatch is why Cursor can serve 1 million daily active users while keeping per-user costs manageable. The routing layer is the infrastructure that makes the economics work.

Measuring Routing Effectiveness

Before deploying a router to production, establish baseline metrics so you can measure impact:

MetricHow to MeasureTarget
Cost per 1,000 requestsTotal API spend ÷ request count60–85% reduction vs single frontier
Tier distributionLog model_used per request~30% nano / ~50% mid / ~20% frontier
Quality regression rateHuman eval sample or LLM judgeUnder 2% downgrade vs frontier-only
p95 latencyRequest timing logsImprove or neutral vs frontier-only
Misrouting rateFlag requests where users retryUnder 5%

Log the model_used and routing_reason fields on every request from day one. Without this data, you cannot diagnose whether cost savings are coming from correct downgrades or from quality-degrading misroutes.

Common Routing Mistakes

Routing on prompt length alone. A 50-token prompt asking "Design a lock-free concurrent hash map in Rust" needs frontier routing. Prompt length is a weak signal — use task type classification as the primary signal.

Setting the frontier threshold too low. If more than 20% of requests hit Tier 3, audit your classifier. Most production pipelines have a higher proportion of simple tasks than engineers initially assume.

Not logging routing decisions. Without per-request routing logs, cost anomalies are invisible and quality regressions take weeks to detect.

Ignoring output token asymmetry. A request that generates a long response costs more at output than input. Route long-form generation tasks to mid-tier models where possible — the output token savings often exceed the input savings for verbose tasks.

FAQ

How do I handle requests where I do not know the task type in advance?

Use the LLM classifier approach with Gemini 3.1 Flash-Lite at flash thinking mode. At $0.000025 per classification, you can afford to classify every request. The classifier's routing decision accuracy is typically 90–95% on well-defined tier boundaries.

What if a mid-tier model produces a wrong answer and needs escalation?

Implement a retry-with-escalation pattern: catch quality-failure signals (explicit error responses, user retry within 10 seconds, output length significantly shorter than expected) and re-run the same request at the next tier up. Cap escalation at one level to avoid cascading to frontier on every retry.

Is 60–85% cost reduction realistic for all workloads?

No. The reduction depends entirely on your request distribution. Workloads that are inherently frontier-heavy — complex agentic coding, long-document synthesis — will see smaller savings. The 60–85% figure applies to mixed enterprise pipelines where a significant portion of requests are simple enough for nano or mid-tier models.

Should I build a router or use a managed routing service?

For under 500K requests/month, build your own — the rule-based router above covers most cases in under 100 lines. Above that volume, managed services like LiteLLM's proxy or Martian add observability, fallback handling, and semantic caching that become worth the overhead.

How does model routing interact with context caching?

They compound. Cache your large shared contexts (codebase, knowledge base, system prompts) in Gemini 3.1 Pro, then route the majority of queries against the cached context to mid-tier models that reference the same cache. You reduce both per-request token cost and re-transmission cost simultaneously.


Next step: Add routing metadata logging to your next API call today — just log the model name, token count, and task type alongside every response. After one week of data, you will have the request distribution you need to design a routing tier structure that actually fits your workload.