What NVIDIA Just Released
On March 11, 2026 at GTC, NVIDIA released Nemotron 3 Super: a 120B-parameter open model with only 12B active parameters per token. It is the current top-performing open-weight model on several agentic benchmarks and is available free via Cloudflare Workers AI and at $0.30 per million input tokens on providers like DeepInfra.
This is not a research release. The weights, training data, and recipes are fully open under the NVIDIA Nemotron Open Model License (commercially usable for most organizations). You can run it today.
The Architecture: Why It Is Different
Nemotron 3 Super introduces three architectural choices that work together:
LatentMoE
Standard Mixture-of-Experts models route tokens to a small subset of expert layers. LatentMoE compresses tokens into a 1024-dimensional latent space (down from the full 4096 hidden dimension) before routing. This 4x compression allows 512 total experts with 22 active per token — at the same computational cost as a standard MoE with far fewer experts. The result: more specialization at the same inference cost.
Hybrid Mamba-Transformer
Most of the sequence processing happens in Mamba-2 layers, which have linear-time complexity with respect to sequence length. This is what makes the 1M-token context window practical rather than theoretical — the memory cost does not grow quadratically as it does in pure attention models. Attention layers are used selectively.
Multi-Token Prediction (MTP)
The model is trained to predict multiple future tokens simultaneously. During inference, the MTP heads function as a built-in draft model for speculative decoding. On SPEED-Bench, Nemotron 3 Super achieves an average acceptance length of 3.45 tokens per verification step, versus 2.70 for DeepSeek-R1 — enabling up to 3x wall-clock speedups without a separate draft model.
Benchmark Results
| Benchmark | Nemotron 3 Super | GPT-OSS-120B | Qwen3.5-122B |
|---|---|---|---|
| SWE-Bench Verified (OpenHands) | 60.47% | 41.90% | 66.40% |
| RULER (1M context) | 91.75% | 22.30% | 91.33% |
| HMMT Feb 2025 (no tools) | 93.67% | — | 91.40% |
| GPQA Diamond (no tools) | 79.23% | — | — |
| PinchBench (agent brain) | 85.6% | — | — |
| Throughput (8K in / 64K out) | 449–478 tok/s | ~200 tok/s | ~60 tok/s |
SWE-Bench Verified is the standard measure for autonomous software engineering — fixing real GitHub issues. At 60.47%, Nemotron 3 Super is 18.5 points ahead of GPT-OSS-120B. It trails Qwen3.5-122B by ~6 points on that benchmark but delivers 7.5x higher throughput.
The RULER result at 1M context (91.75% vs GPT-OSS's 22.30%) is the most striking number. GPT-OSS-120B loses coherence as context grows — Nemotron 3 Super does not. This matters for agents that maintain long task histories.
Source: NVIDIA Technical Report and Artificial Analysis, verified March 2026.
Pricing and Access (March 2026)
| Provider | Input (per 1M tokens) | Output (per 1M tokens) | Notes |
|---|---|---|---|
| Cloudflare Workers AI | Free (within limits) | Free (within limits) | Via env.AI.run() |
| DeepInfra | $0.30 | $0.75 | REST API or OpenAI-compatible |
| NVIDIA NIM (build.nvidia.com) | Free tier available | — | Official NVIDIA endpoint |
| Hugging Face | Download weights | — | Self-hosted |
For context: Claude Sonnet 4.6 costs $3.00/$15.00 per million tokens. Running Nemotron 3 Super via DeepInfra at $0.30/$0.75 is 10x cheaper on input.
What This Means for Agent Builders
Nemotron 3 Super is explicitly designed for multi-agent systems, not single-turn chat. The two architectural advantages that matter most in practice:
Throughput at scale: Running 10 parallel agents each consuming 8K input tokens is where the 2.2x throughput advantage becomes 2.2x cost savings, or 2.2x more agents for the same budget.
Long-context coherence: Agent workflows accumulate context — tool outputs, prior reasoning traces, long instructions. GPT-OSS-120B drops from 52% to 22% accuracy between 256K and 1M tokens. Nemotron 3 Super loses under 5 points across that same 4x context increase.
The recommended pattern from NVIDIA: use Nemotron 3 Nano (3.2B active parameters) for simple subtasks and Nemotron 3 Super for the planning and reasoning layer. This tiered routing is cheaper than running a large model on every task.
What It Does Not Beat
Qwen3.5-122B scores higher on SWE-Bench Verified (66.40% vs 60.47%). For pure software engineering quality, Qwen3.5 and Claude Opus 4.6 both edge ahead.
The Humanity's Last Exam score (18.26% vs Qwen3.5's 25.30%) reveals that raw scientific breadth is still an area where denser models hold an edge. Nemotron 3 Super is optimized for agentic throughput, not frontier scientific reasoning.
FAQ
Can I run Nemotron 3 Super locally?
The BF16 weights require ~240GB of VRAM. In practice, that means 3-4 H100 GPUs or an H200 node. The NVFP4 quantized version runs on 64GB — a single H100 SXM. Self-hosting is realistic for organizations with GPU infrastructure; for most developers, a hosted API is more practical.
Is the license commercially usable?
Yes for most cases. The NVIDIA Nemotron Open Model License permits commercial use. There are standard restrictions for uses that could harm national security or violate laws. Verify the full terms at huggingface.co/nvidia/nemotron-3-super-120b-a12b before deploying in regulated industries.
How does it compare to DeepSeek-R1 for reasoning?
DeepSeek-R1 is stronger on pure mathematical reasoning (AIME, competition math). Nemotron 3 Super is stronger on agentic benchmarks (SWE-Bench, PinchBench) and outperforms DeepSeek-R1 on speculative decoding acceptance rates (3.45 vs 2.70 tokens per step). Different models for different tasks.
Where is Nemotron 3 Ultra?
NVIDIA announced three models: Nano (released), Super (released), and Ultra (forthcoming). Ultra is described as the highest-accuracy option in the family. No release date confirmed as of March 2026.
Sources
Next step: Test Nemotron 3 Super on your workload today via build.nvidia.com — free tier, no credit card. If your use case involves long agent traces or parallel subagents, run a cost comparison against your current model before committing to an architecture.