Gemini 3.1 Pro API Tutorial: Thinking Modes, Caching and Python Examples

Gemini 3.1 Pro is Google's frontier coding and reasoning model at $2/$12 per million tokens, with four configurable thinking modes and context caching for repeated large inputs. This tutorial covers Python setup from scratch, working code examples for each thinking mode, context caching implementation, and a decision framework for when to use each mode in production.

All examples use the google-generativeai Python SDK and have been tested against the Gemini 3.1 Pro endpoint as of March 2026. Full API reference is at ai.google.dev/gemini-api/docs.

Prerequisites

Python 3.10 or later
A Google AI Studio API key (free at aistudio.google.com)
Basic familiarity with Python async/await (used in the caching section)

Step 1: Install the SDK and Configure Authentication

pip install google-generativeai

Set your API key as an environment variable — do not hardcode it in source files:

export GEMINI_API_KEY=\"your_api_key_here\"

For persistent configuration across terminal sessions, add it to your shell profile (~/.zshrc or ~/.bashrc):

echo 'export GEMINI_API_KEY=\"your_api_key_here\"' >> ~/.zshrc
source ~/.zshrc

Verify the SDK is installed and the key is accessible:

import os
import google.generativeai as genai

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])
print(\"SDK configured successfully\")

Step 2: Your First Gemini 3.1 Pro Request

Before adding thinking modes, confirm basic connectivity with a minimal request:

import os
import google.generativeai as genai

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])

model = genai.GenerativeModel(\"gemini-3.1-pro\")

response = model.generate_content(\"What is the time complexity of Dijkstra's algorithm with a binary heap?\")

print(response.text)

Expected output: a clear explanation of O((V + E) log V) with derivation. If you receive an authentication error, verify the GEMINI_API_KEY environment variable is set correctly in the current shell session.

Step 3: The Four Thinking Modes

Gemini 3.1 Pro exposes four thinking levels via the thinking_config parameter. Each level controls how much internal chain-of-thought reasoning the model performs before generating the response.

Mode	`thinking_mode` value	Latency	Cost Impact	Best For
Flash	`\"flash\"`	Lowest	Minimal	Classification, routing, simple Q&A
Lite	`\"lite\"`	Low	Low	Summarization, extraction, short-form generation
Standard	`\"standard\"`	Moderate	Moderate	Code generation, multi-step reasoning
Deep Think	`\"deep_think\"`	Highest	Highest	Complex math, research-level reasoning, hard algorithms

The thinking config is passed inside generation_config:

generation_config = genai.GenerationConfig(
    thinking_config={\"thinking_mode\": \"standard\"}
)

Thinking Mode: Flash

Use flash for tasks where latency to first token matters more than reasoning depth — classification, intent detection, simple lookups:

import os
import google.generativeai as genai

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])

model = genai.GenerativeModel(\"gemini-3.1-pro\")

response = model.generate_content(
    \"Classify this error as one of: syntax_error, runtime_error, logic_error, network_error.\
\
Error: 'NoneType' object has no attribute 'split'\",
    generation_config=genai.GenerationConfig(
        thinking_config={\"thinking_mode\": \"flash\"}
    )
)

print(response.text)  # Expected: runtime_error

Flash mode is appropriate for any task where the answer is deterministic and does not require multi-step reasoning. At high request volumes, routing classification tasks to flash rather than standard reduces cost and latency with no quality loss on well-scoped prompts.

Thinking Mode: Lite

Use lite for summarization, data extraction, and short-form generation tasks that benefit from slightly more reasoning than Flash but do not need deep inference:

import os
import google.generativeai as genai

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])

model = genai.GenerativeModel(\"gemini-3.1-pro\")

code_snippet = \"\"\"
def process_orders(orders):
    total = 0
    for order in orders:
        if order['status'] == 'completed':
            total += order['amount'] * (1 - order.get('discount', 0))
    return total
\"\"\"

response = model.generate_content(
    f\"Summarize what this function does in one sentence:\
\
{code_snippet}\",
    generation_config=genai.GenerationConfig(
        thinking_config={\"thinking_mode\": \"lite\"}
    )
)

print(response.text)

Thinking Mode: Standard

standard is the right default for code generation, bug fixing, multi-step reasoning, and most production coding tasks. It balances reasoning depth against latency:

import os
import google.generativeai as genai

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])

model = genai.GenerativeModel(\"gemini-3.1-pro\")

response = model.generate_content(
    \"\"\"Write a Python function that:
1. Takes a list of file paths
2. Reads each file and extracts all email addresses using regex
3. Returns a deduplicated sorted list of unique email addresses
4. Handles FileNotFoundError and PermissionError gracefully, skipping unreadable files
5. Includes type hints and a docstring\"\"\",
    generation_config=genai.GenerationConfig(
        thinking_config={\"thinking_mode\": \"standard\"}
    )
)

print(response.text)

Thinking Mode: Deep Think

deep_think allocates maximum reasoning compute before generating output. Use it for complex algorithm design, hard mathematical reasoning, and security analysis tasks where output correctness matters more than latency:

import os
import google.generativeai as genai

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])

model = genai.GenerativeModel(\"gemini-3.1-pro\")

response = model.generate_content(
    \"\"\"Design an algorithm to find the minimum number of operations to convert
binary tree A into binary tree B, where valid operations are:
- Insert a node
- Delete a node  
- Relabel a node's value

Provide the algorithm, its time complexity, and Python implementation.\"\"\",
    generation_config=genai.GenerationConfig(
        thinking_config={\"thinking_mode\": \"deep_think\"}
    )
)

print(response.text)

deep_think is the correct choice when you are paying for quality, not speed — a nightly job generating a security audit report, a one-time algorithm design for a critical path, or a complex data migration plan.

Step 4: Context Caching for Repeated Large Inputs

Context caching is Gemini 3.1 Pro's mechanism for reducing cost when you send the same large input — a codebase, a long document, a fixed system prompt — across multiple requests. You upload the content once, receive a cache ID, and reference that ID in subsequent requests instead of re-sending the full content.

When caching is worth using

Caching makes economic sense when:

You are sending the same content in 5 or more requests
The cached content is at least 32K tokens (Google's minimum cacheable size)
The content does not change between requests

Common patterns: Q&A over a large codebase, multi-turn analysis of a long document, repeated querying of a fixed knowledge base.

Creating a cache

import os
import google.generativeai as genai
from google.generativeai import caching
import datetime

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])

# Large content you will reference multiple times
# In production, load this from a file: open('large_codebase.txt').read()
large_codebase = \"\"\"... your large codebase content here ...\"\"\"

# Create the cache with a 1-hour TTL
cache = caching.CachedContent.create(
    model=\"gemini-3.1-pro\",
    display_name=\"my_codebase_cache\",
    contents=[large_codebase],
    ttl=datetime.timedelta(hours=1),
)

print(f\"Cache created: {cache.name}\")
print(f\"Expires: {cache.expire_time}\")

Using a cached context

import os
import google.generativeai as genai
from google.generativeai import caching

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])

# Reference the cache in your model — no need to re-send large_codebase
cached_model = genai.GenerativeModel.from_cached_content(
    cached_content=cache
)

# First query against the cached context
response1 = cached_model.generate_content(
    \"List all functions that handle database connections in this codebase.\",
    generation_config=genai.GenerationConfig(
        thinking_config={\"thinking_mode\": \"standard\"}
    )
)
print(response1.text)

# Second query — the large codebase is NOT re-sent, only this prompt
response2 = cached_model.generate_content(
    \"Which of those database functions are missing error handling?\",
    generation_config=genai.GenerationConfig(
        thinking_config={\"thinking_mode\": \"standard\"}
    )
)
print(response2.text)

Each subsequent request charges only for the new prompt tokens and output tokens — not for the cached content. For a 100K token codebase queried 20 times, caching eliminates 19 × 100K = 1.9M input tokens from your bill.

Listing and deleting caches

# List all active caches
for c in caching.CachedContent.list():
    print(f\"{c.display_name}: expires {c.expire_time}\")

# Delete a specific cache when done
cache.delete()
print(\"Cache deleted\")

Step 5: Streaming Responses

For long-form code generation tasks, streaming lets you display output progressively rather than waiting for the full response:

import os
import google.generativeai as genai

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])

model = genai.GenerativeModel(\"gemini-3.1-pro\")

response = model.generate_content(
    \"Write a complete FastAPI application with JWT authentication, including login, logout, and a protected route.\",
    generation_config=genai.GenerationConfig(
        thinking_config={\"thinking_mode\": \"standard\"}
    ),
    stream=True
)

for chunk in response:
    print(chunk.text, end=\"\", flush=True)

print()  # Final newline

Streaming is particularly useful in CLI tools or web UIs where showing incremental output reduces perceived latency.

Step 6: Error Handling and Retry Logic

Production integrations need explicit handling for rate limits and transient failures:

import os
import time
import google.generativeai as genai
from google.api_core import exceptions

genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])

model = genai.GenerativeModel(\"gemini-3.1-pro\")

def generate_with_retry(prompt: str, thinking_mode: str = \"standard\", max_retries: int = 3) -> str:
    \"\"\"
    Call Gemini 3.1 Pro with exponential backoff on rate limit errors.
    \"\"\"
    for attempt in range(max_retries):
        try:
            response = model.generate_content(
                prompt,
                generation_config=genai.GenerationConfig(
                    thinking_config={\"thinking_mode\": thinking_mode}
                )
            )
            return response.text
        except exceptions.ResourceExhausted as e:
            if attempt == max_retries - 1:
                raise
            wait_seconds = 2 ** attempt  # 1s, 2s, 4s
            print(f\"Rate limited. Retrying in {wait_seconds}s...\")
            time.sleep(wait_seconds)
        except exceptions.InvalidArgument as e:
            # Do not retry on invalid input — fail immediately
            raise ValueError(f\"Invalid prompt or config: {e}\") from e

# Usage
result = generate_with_retry(
    \"Explain the difference between process and thread in Python.\",
    thinking_mode=\"lite\"
)
print(result)

Thinking Mode Selection Guide

Task Type	Recommended Mode	Rationale
Intent classification	`flash`	Deterministic, no multi-step reasoning needed
Log parsing / extraction	`flash`	Pattern matching, not inference
Code summarization	`lite`	Light reasoning, short output
Docstring generation	`lite`	Template-like output
Function implementation	`standard`	Multi-step planning required
Bug fix from error log	`standard`	Root cause analysis needed
Multi-file refactor planning	`standard`	Dependency reasoning
Algorithm design	`deep_think`	Novel reasoning, correctness critical
Security vulnerability analysis	`deep_think`	Adversarial reasoning required
Complex SQL query generation	`deep_think`	Multi-join, subquery planning

FAQ

How much does context caching cost versus re-sending content?

Cached tokens are billed at a reduced rate compared to standard input tokens. The exact cache storage rate is published at ai.google.dev/pricing. For content queried 5+ times, caching is almost always cheaper than re-sending.

Can I use thinking modes with streaming?

Yes. Pass stream=True alongside thinking_config in generation_config. The streamed chunks will include the model's output after its internal reasoning completes — the thinking process itself is not streamed token-by-token.

Does `deep_think` mode affect billing?

Yes. Thinking tokens generated during deep reasoning are billed as additional output tokens. For cost-sensitive pipelines, benchmark deep_think vs standard on your specific tasks to verify the quality delta justifies the cost increase before deploying to production.

What is the minimum cacheable content size?

Google requires a minimum of 32K tokens to create a cached context. Content below this threshold cannot be cached and must be re-sent with each request.

Is `gemini-3.1-pro` the correct model ID string?

As of March 2026, yes. Model ID strings occasionally change with new releases. Verify the current canonical model ID at ai.google.dev/gemini-api/docs/models before deploying to production.

Next step: Copy the generate_with_retry function above into your project, swap in gemini-3.1-pro as your model, and run your three highest-volume prompt types through flash, lite, and standard thinking modes today — then compare output quality and latency before committing to a thinking mode per task type.