Gemini 3.1 Flash-Lite launched on March 3, 2026 as Google's lowest-cost, highest-throughput model in the Gemini 3 family. At $0.25 per million input tokens and $1.50 per million output tokens, it undercuts every comparable frontier-class model on price while posting benchmark numbers that would have been considered strong for a mid-tier model six months ago.

Pricing and Speed

Flash-Lite is positioned as the cost-optimized tier of the Gemini 3.1 lineup, below Gemini 3.1 Flash and Gemini 3.1 Pro.

ModelInput (per 1M)Output (per 1M)TFAT vs 2.5 Flash
Gemini 3.1 Flash-Lite$0.25$1.502.5x faster
Gemini 3.1 Flash$0.75$3.00baseline
Gemini 3.1 Pro$2.00$12.00slower
Claude Sonnet 4.6$3.00$15.00
GPT-5.4$2.50$15.00

The 2.5x Time to First Answer Token (TFAT) figure comes from Artificial Analysis benchmarking published in March 2026. TFAT measures latency from prompt submission to the first token appearing in the response — the metric that matters most for real-time, user-facing applications. Output speed increased 45% over Gemini 2.5 Flash on the same Artificial Analysis benchmark suite.

Benchmark Performance

For a model at this price point, Flash-Lite's benchmark numbers are notable:

BenchmarkGemini 3.1 Flash-LiteCategory
GPQA Diamond86.9%Graduate-level science reasoning
MMMU Pro76.8%Multimodal understanding
Arena.ai Elo1432Human preference (chatbot arena)

GPQA Diamond at 86.9% measures graduate-level reasoning across biology, chemistry, and physics. For context, human domain experts score around 65% on this benchmark. An 86.9% score from a model at $0.25/M input is a direct challenge to the assumption that high-accuracy science reasoning requires frontier-tier pricing.

MMMU Pro at 76.8% covers college-level multimodal tasks — interpreting charts, diagrams, and scientific figures alongside text. This confirms Flash-Lite inherits meaningful multimodal capability from the Gemini 3 Pro architecture.

Arena.ai Elo of 1432 reflects human preference ratings from blind A/B comparisons on the Chatbot Arena platform. The score places Flash-Lite competitively against models costing 3-5x more per token.

Architecture: Gemini 3 Pro MoE Base

Gemini 3.1 Flash-Lite is built on the Gemini 3 Pro architecture, which uses a Mixture-of-Experts (MoE) design. MoE models activate only a subset of parameters per forward pass, which is what allows Flash-Lite to achieve high throughput without proportionally higher inference cost.

Google has not published the exact number of total vs active parameters for Gemini 3.1 Flash-Lite. The MoE architecture explains both the speed advantage and why benchmark quality remains high despite the low price — the model routes each token through expert layers specialized for the relevant task domain rather than running the full parameter set.

Configurable Thinking Levels

One of the more useful operational features in Flash-Lite is configurable thinking levels. Four settings are available:

LevelBest ForLatency Impact
minimalClassification, routing, simple Q&ALowest latency
lowSummarization, extraction, short-form generationLow latency
mediumCode generation, multi-step reasoning, analysisModerate
highComplex reasoning, math, research tasksHigher latency

Thinking levels control how much internal chain-of-thought reasoning the model performs before generating the response. Setting minimal for a classification task and high for a complex reasoning task on the same model means you can tune cost and latency per request type without switching models mid-pipeline.

This is directly useful for teams building tiered AI pipelines — route simple classification tasks to Flash-Lite at minimal thinking, and complex synthesis tasks to the same model at high thinking, rather than maintaining separate model integrations for each tier.

Availability

Gemini 3.1 Flash-Lite is available across all three Google AI distribution channels as of launch:

  • Gemini API — direct REST and SDK access for developers
  • Google AI Studio — browser-based prototyping and prompt testing, free to use
  • Vertex AI — enterprise deployment with VPC-SC, audit logging, and regional endpoints

The model ID for API calls is gemini-3.1-flash-lite. For Vertex AI, the endpoint follows the standard regional format. Full quickstart documentation is at ai.google.dev/gemini-api/docs.

When to Use Flash-Lite vs Flash vs Pro

The Gemini 3.1 family now covers a wide price-performance range. Choosing the right tier depends on task complexity and volume:

Use Flash-Lite when:

  • Request volume is high and cost per token is a primary constraint
  • Latency to first token matters more than maximum output quality
  • Tasks are well-scoped: classification, extraction, summarization, simple code generation
  • You want to use configurable thinking levels to tune per-request cost

Use Flash when:

  • You need slightly higher output quality than Flash-Lite with moderate cost
  • Tasks involve multi-turn conversation with moderate context length
  • You want a balance between the Flash-Lite speed and Pro accuracy

Use Pro when:

  • Tasks require maximum reasoning depth: complex coding, long-document analysis, research synthesis
  • Benchmark accuracy on GPQA or MMMU-class tasks is a hard requirement
  • You are building agent workflows that require reliable multi-step planning

How Flash-Lite Fits Existing Pipelines

For teams already using Gemini 2.5 Flash in production, Flash-Lite is a near-direct upgrade path. The API interface is identical, and the 2.5x TFAT improvement and 45% output speed increase (Artificial Analysis, March 2026) mean existing latency budgets now have more headroom.

The configurable thinking levels are additive — existing API calls without a thinking parameter default to a baseline level and do not require code changes. Opt into thinking level control when you are ready to tune cost per request type.

For teams currently using Claude Sonnet 4.6 or GPT-5.4 for high-volume, latency-sensitive tasks, the pricing gap warrants a benchmark comparison on your specific task distribution. At $0.25 vs $3.00 input per million tokens, even a modest quality trade-off may be acceptable at scale.

FAQ

How does Flash-Lite compare to Gemini 2.5 Flash-Lite?

Gemini 3.1 Flash-Lite replaces 2.5 Flash-Lite entirely. It posts 45% higher output speed and improved benchmark scores across GPQA Diamond and MMMU Pro. Pricing is comparable to the 2.5 generation at the Flash-Lite tier.

Is the $0.25/M price for all input types including images?

Google's published pricing of $0.25/M covers text input tokens. Image and video input tokens are billed separately at different rates. Check ai.google.dev/pricing for current multimodal token rates.

What context window does Flash-Lite support?

Gemini 3.1 Flash-Lite supports a 1 million token context window, consistent with the rest of the Gemini 3.1 family. Context caching is available for repeated large inputs to reduce cost on long-context workloads.

Does Flash-Lite support function calling and tool use?

Yes. Flash-Lite supports function calling, Google Search grounding, and code execution — the same tool suite available in Flash and Pro. Thinking level affects reasoning quality on tool use tasks but does not restrict tool availability.

Is Flash-Lite available for free during a trial period?

Google AI Studio access to Flash-Lite is free for prototyping within the free tier rate limits. Production usage via the Gemini API and Vertex AI is billed at published rates from day one of GA launch (March 3, 2026).


Next step: Open Google AI Studio, select gemini-3.1-flash-lite from the model dropdown, set thinking level to minimal, and run your highest-volume production prompt against it today to get a direct latency and quality comparison against your current model.