Mistral Small 4: 119B MoE Model, Apache 2.0, and Europe's Sovereign AI Bet

What Is Mistral Small 4?

Mistral Small 4 launched on March 16, 2026 under an Apache 2.0 license — meaning you can download, self-host, fine-tune, and ship it in commercial products with no restrictions. It is the first model in Mistral's lineup to combine reasoning, vision, and agentic coding into a single endpoint. The API identifier is mistral-small-2603. Pricing via the Mistral API is $0.15 per million input tokens and $0.60 per million output tokens.

For developers evaluating open-weight models, this release changes the calculus: you no longer need to route between Magistral (reasoning), Pixtral (vision), and Devstral (coding). One deployment now handles all three.

Architecture: Why 119B Parameters Cost Less Than a 7B Dense Model

The headline numbers require some unpacking. Mistral Small 4 uses a Mixture of Experts (MoE) architecture with 128 total expert networks. For each token processed, only 4 experts activate. The model therefore has 119 billion total parameters but uses only 6.5 billion per inference — the knowledge capacity of a 119B model at the compute cost of a 6B model.

The routing layer automatically directs tokens to the most relevant experts. A code token goes to coding experts. A French-language token goes to language experts. The other 124 experts sit idle and cost nothing per token.

For self-hosting, the full model requires approximately 4× H100 GPUs minimum (roughly 242GB in BF16). That is still significantly cheaper than hosting a 119B dense model, which would need 8–16 H100s. Most teams will access it via the API at $0.15/1M.

A second architectural decision worth noting: the reasoning_effort parameter. You don't pay for reasoning tokens on simple requests. A classification call with reasoning_effort="none" is pure speed. A math problem with reasoning_effort="high" takes longer but gets the answer right — same model, same API, you just flip a parameter.

What Small 4 Replaces: Three Models in One

Mistral Small 4 is the first Mistral model to unify the capabilities of three prior flagship models: Magistral for reasoning, Pixtral for multimodal understanding, and Devstral for agentic coding — into a single, versatile model.

For teams already running the Mistral stack, the infrastructure reduction is concrete:

Previous Setup	Mistral Small 4 Equivalent
Mistral Small 3.2 (instruct)	`reasoning_effort="none"`
Magistral (step-by-step reasoning)	`reasoning_effort="high"`
Pixtral (image + text)	Native multimodal input
Devstral (agentic coding)	Built-in agentic coding support
4 deployments, 4 routing pipelines	1 endpoint, 1 API ID

Performance benchmarks indicate a 40% reduction in end-to-end completion time in latency-optimized setups and a threefold increase in requests per second in throughput-optimized configurations compared to Mistral Small 3.

Benchmarks and Performance: Where It Stands

Mistral published chart-based comparisons rather than clean tables at launch, so third-party evaluations fill in some gaps.

Mistral Small 4 (non-reasoning mode) generates output at 142 tokens per second via the Mistral API, well above the median of 61.7 t/s for comparable open-weight models. Time to first token is 0.61 seconds, versus a median of 1.58 seconds for comparable models.

On LCR LiveCodeBench and AIME 2025, Mistral Small 4 achieves comparable or superior scores to GPT-OSS 120B, while generating significantly shorter outputs — which translates directly to lower output token costs per task.

On the LCR coding benchmark specifically, Mistral notes the model achieved its score with only 1,600 characters of output, while comparable Qwen models needed 5,800–6,100 characters — meaning lower output token costs for the same result.

The clearest pricing comparison against competitors:

Model	Input ($/1M)	Output ($/1M)	Open Weights
Mistral Small 4	$0.15	$0.60	✅ Apache 2.0
GPT-5.4 Mini	~$0.75	~$4.50	❌
Gemini 2.0 Flash-Lite	$0.075	$0.30	❌
Claude Haiku 4.5	~$0.80	~$4.00	❌

At $0.15 per million input tokens, Mistral Small 4 is the cheapest multimodal model from a major provider that also combines configurable reasoning. The only cheaper options on input are Gemini 2.0 Flash-Lite and Mistral's own older Small 3.2, neither of which offers the unified capability set.

The Sovereign AI Angle: Why This Matters Outside the US

Mistral Small 4's most strategically differentiated feature for European deployments is not a benchmark number — it is the Apache 2.0 license combined with full self-hostability.

In January 2026, France's Ministry of the Armed Forces awarded Mistral AI a framework agreement to deploy its models across all military branches and affiliated agencies, specifying that models would run on French-controlled infrastructure. In 2026, Mistral also signed a framework agreement with France and Germany to deploy AI solutions for public administration.

The driver is regulatory, not preference. The EU's GDPR and AI Act require organizations to ensure data is stored and processed in compliance with local regulations. Regulated industries — banking, healthcare, defense — cannot risk relying on external providers that may change access rules or expose data to foreign jurisdictions. Mistral's open-source models allow customers to run inference on their own servers, avoiding vendor lock-in and ensuring compliance.

For companies needing to keep data in-house for GDPR compliance or data sovereignty, self-hosted deployment via vLLM is the recommended path. The Apache 2.0 license means zero API costs: only infrastructure is billable.

This positions Mistral Small 4 as the default evaluation candidate for any European enterprise or public-sector team that cannot send data to US cloud providers — a constraint that applies to most of the EU financial, healthcare, and government sectors.

Deployment Options

Mistral Small 4 is available on day one across four paths:

Mistral API — mistral-small-latest or mistral-small-2603. Simplest option. $0.15/$0.60 per million tokens.

NVIDIA NIM — Available as an NVIDIA NIM container for H100, H200, and B200 GPUs. Free prototyping on build.nvidia.com. Optimized via NVFP4 checkpoint for NVIDIA hardware.

vLLM (self-hosted) — Recommended for GDPR-compliant on-premises deployments. Mistral provides a dedicated Docker image. Requires 4× H100s minimum.

Hugging Face — Weights available on the Mistral organization page.

One current limitation: llama.cpp support was not finalized at launch, meaning Ollama compatibility was not yet confirmed. A pull request was open on the official repository. Teams running local inference via Ollama should check the repository for current status before planning a deployment.

Who Should Test Mistral Small 4 Now

European enterprises under GDPR or EU AI Act constraints — This is the primary design target. Self-hostable, Apache 2.0, no data leaves your infrastructure.

Teams running multiple Mistral models — The consolidation from three separate deployments (Magistral + Pixtral + Devstral) to one endpoint is operationally significant. The reasoning_effort parameter gives per-request control without routing logic.

Cost-sensitive API workloads — At $0.15/1M input, it is 5× cheaper than GPT-5.4 Mini on input. For high-volume classification, summarization, or document processing where frontier-model quality is unnecessary, the cost difference is material.

Where it is not the right choice: Tasks requiring the deepest reasoning (Claude Mythos, Claude Opus 4.6), computer use, or a context window beyond 256K. For maximum coding performance on complex repositories, Claude Code's 1M context and 80.8% SWE-bench Verified score still lead.

FAQ

What is Mistral Small 4?

Mistral Small 4 is an open-weight model released March 16, 2026 under Apache 2.0. It uses a 119B-parameter MoE architecture with 6.5B active parameters per token, a 256K context window, native image input, and configurable reasoning depth — replacing Mistral's three prior specialized models in a single endpoint.

How does Mixture of Experts work in Mistral Small 4?

The model has 128 expert networks. For each token, only 4 activate. This gives 119B-class knowledge at 6B-class inference cost. You get lower latency, lower API pricing, and self-hosting requirements roughly equivalent to a mid-size dense model.

Is Mistral Small 4 truly open source?

The weights are open under Apache 2.0 — commercial use, fine-tuning, and self-hosting are all permitted with no user-count restrictions. The training data and methodology were not publicly disclosed at launch, which is standard for open-weight (as distinct from fully open-source) models.

How much does Mistral Small 4 cost via API?

$0.15 per million input tokens and $0.60 per million output tokens as of March 2026. That is approximately 5× cheaper than GPT-5.4 Mini on input.

What is the minimum hardware to self-host?

Four H100 GPUs (approximately 242GB in BF16). NVIDIA NIM deployment is available for teams with existing NVIDIA infrastructure. vLLM with Mistral's Docker image is the recommended path for air-gapped or GDPR-constrained environments.

Next step: Pull the Mistral Small 4 weights from Hugging Face (mistralai/Mistral-Small-3.2-24B-Instruct-2506 family) or call mistral-small-latest via the Mistral API and run it against your current open-weight model on a representative sample of your production workload — the reasoning_effort parameter is the first thing to benchmark across your task distribution.

Sources: Mistral AI official release, March 16, 2026 (https://mistral.ai/news/mistral-small-4) · TokenCost benchmark analysis, March 23, 2026 (https://tokencost.app/blog/mistral-small-4-pricing)