Cursor shipped Composer 2 on March 19, 2026 — a coding-only model that outperforms Claude Opus 4.6 on Terminal-Bench 2.0 (61.7% vs 58.0%) while costing 10x less at $0.50 per million input tokens. It is the first purpose-built coding model to beat a frontier general-purpose model on a rigorous coding benchmark.
What Is Cursor Composer 2?
Composer 2 is a coding-specific language model developed by Cursor and integrated directly into the Cursor IDE. Unlike general-purpose frontier models, it was trained exclusively on programming tasks using reinforcement learning optimized for long-horizon, multi-step coding workflows. It will not write poems or help with tax questions — that is an explicit design decision, not a limitation.
The model is available today in the Cursor model selector under two speed tiers:
| Tier | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Standard | $0.50 | $2.50 |
| Fast | $1.50 | $7.50 |
For comparison, Claude Opus 4.6 costs $15/$75 per million tokens. Composer 2 Standard delivers superior coding performance at one-thirtieth the output cost.
Benchmark Results
Cursor published full benchmark results alongside the release. Composer 2 was evaluated on four benchmarks covering different dimensions of coding ability:
| Benchmark | Composer 2 | Claude Opus 4.6 |
|---|---|---|
| Terminal-Bench 2.0 | 61.7% | 58.0% |
| SWE-bench Multilingual | 73.7% | — |
| CursorBench | 61.3% | — |
Terminal-Bench 2.0 measures a model's ability to complete real engineering tasks in a terminal environment — file edits, test runs, shell commands, and multi-step debugging. Scoring 61.7% against Opus 4.6's 58.0% is significant because Opus 4.6 is Anthropic's most capable general model as of March 2026.
SWE-bench Multilingual at 73.7% covers code repair across multiple programming languages, not just Python. This is the most direct signal that Composer 2 generalizes across real-world codebases rather than overfitting to English-language Python repos.
CursorBench is Cursor's internal benchmark measuring performance on IDE-specific tasks: multi-file edits, context retrieval, diff application, and agent tool use.
The Self-Summarization Technique
One of the core technical contributions in Composer 2 is a self-summarization approach to long-context compression. When a coding session grows long — multiple files open, extensive edit history, large diffs — standard context compression degrades model performance by losing critical code state.
Cursor's approach trains the model to summarize its own prior context in a structured, code-aware format before compression. According to Cursor's release notes, this reduces errors by 50% compared to standard context compression in sessions exceeding 100K tokens.
The model supports a 200K token context window, which accommodates most real-world monorepo tasks without truncation.
Why Code-Only Models Make Sense Now
General-purpose frontier models optimize across a broad task distribution — reasoning, writing, math, multimodal understanding, code. That breadth has a cost: training compute and RLHF reward signals are spread thin across domains.
Cursor's bet is that a model trained exclusively on coding tasks, with RL reward signals tightly scoped to compilation success, test pass rates, and diff correctness, can outperform a much larger general model on the specific task distribution that matters for developers.
The Terminal-Bench 2.0 result validates this hypothesis in practice. Whether it holds at longer horizon tasks — full feature implementation across a codebase over 50+ steps — remains the more important open question for production use.
Context: Cursor's Scale
Cursor crossed 1 million daily active users in early 2026. That scale gives the team a large volume of real coding sessions to inform training data curation and RL reward modeling — a flywheel that pure API providers without an IDE product cannot replicate as directly.
The combination of distribution (1M DAU generating real coding signal) and a tightly scoped reward function (did the code work?) explains how a focused team can produce a model that competes with Anthropic's flagship on coding tasks.
What Composer 2 Is Not Built For
This is not a general assistant. Cursor has been explicit: Composer 2 is scoped to programming tasks. Prompts outside that domain — writing, analysis, general Q&A — will either be declined or produce noticeably weaker output than a frontier model. If you need a general-purpose coding assistant that also handles documentation drafting, architecture diagrams, and meeting notes, Claude Sonnet 4.6 or GPT-5.4 remain better fits.
Pricing Comparison
| Model | Input (per 1M) | Output (per 1M) | Terminal-Bench 2.0 |
|---|---|---|---|
| Cursor Composer 2 (Standard) | $0.50 | $2.50 | 61.7% |
| Cursor Composer 2 (Fast) | $1.50 | $7.50 | 61.7% |
| Claude Opus 4.6 | $15.00 | $75.00 | 58.0% |
| Claude Sonnet 4.6 | $3.00 | $15.00 | — |
For teams running high token volumes through an agentic coding pipeline, the cost difference is not marginal. A team spending $10,000/month on Claude Opus 4.6 for code generation could run equivalent or better workloads on Composer 2 Standard for under $340.
Availability
Composer 2 is available now in the Cursor IDE model selector for all Cursor Pro and Business subscribers. No additional setup is required — select the model from the dropdown, choose Standard or Fast depending on latency requirements, and it is active for all Composer sessions.
Full benchmark methodology and release notes are available at cursor.com/blog/composer-2. The Terminal-Bench 2.0 leaderboard is maintained at terminal-bench.com.
FAQ
Is Cursor Composer 2 available outside the Cursor IDE?
No. As of March 2026, Composer 2 is only accessible through the Cursor IDE. There is no public API endpoint. If you need the model's capabilities in a custom pipeline, you would need to use the Cursor IDE's agent interface.
Can Composer 2 replace Claude Opus 4.6 for all coding tasks?
For terminal tasks, multi-file edits, and codebase-level bug fixes, yes — benchmarks favor Composer 2 at far lower cost. For tasks requiring broad reasoning, architecture decision analysis, or natural language documentation, Opus 4.6 remains stronger.
What is the difference between Standard and Fast tiers?
Both tiers run the same Composer 2 model weights. Fast ($1.50/$7.50 per million tokens) prioritizes lower latency, useful for interactive completions. Standard ($0.50/$2.50) is optimized for cost in batch or longer agentic tasks.
Does Composer 2 support all programming languages?
Yes. SWE-bench Multilingual at 73.7% covers Python, JavaScript, TypeScript, Java, Go, and Rust. Cursor has not published per-language breakdowns.
How does self-summarization affect token usage?
Self-summarization compresses earlier context into structured summaries rather than raw truncation. It reduces errors by 50% in long sessions but does not reduce total token billing — you pay for the summary tokens, not the original context.
Next step: Open Cursor, go to Settings → Models, select Composer 2 Standard, and run it on your next multi-file refactor. Compare output quality and cost against your current model over one week of real usage.