Anthropic Launches Claude 4: Opus and Sonnet Set New Coding Benchmarks
Claude 4 introduces the first Opus-class model alongside an upgraded Sonnet, bringing extended thinking with tool use, parallel tool execution, and Claude Code going generally available.
Maya Johnson
Anthropic released the Claude 4 family on May 22, 2025, shipping both Opus 4 and Sonnet 4 simultaneously. Opus 4 scored 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, making it the strongest coding model available at launch, according to Anthropic's announcement.
Opus 4: Built for Long-Running Agent Work
Claude Opus 4 is the first model designed to work continuously for several hours on complex agent workflows. It scores 72.5% on SWE-bench Verified — matching Sonnet 4 on that benchmark but pulling ahead on longer-horizon tasks where sustained reasoning matters.
Pricing is steep at $15/$75 per million tokens, positioning it clearly as an enterprise tool. The 200K context window and 32K max output remain unchanged from Claude 3.5.
Sonnet 4: The Practical Upgrade
Sonnet 4 is the bigger story for most developers. It scores 72.7% on SWE-bench Verified — technically edging out Opus 4 — at $3/$15 per million tokens, one-fifth the cost. Anthropic reports a 65% reduction in shortcut behaviors compared to Sonnet 3.7, meaning it follows complex instructions more faithfully instead of taking easy paths.
Both models support extended thinking with tool use — a first for Claude — enabling the model to reason step-by-step while calling external tools. Parallel tool execution lets Claude call multiple tools simultaneously, reducing latency in agentic workflows.
Claude Code Goes GA
Alongside the model launch, Anthropic made Claude Code generally available with VS Code and JetBrains IDE integrations. Four new API features shipped: a code execution tool, MCP connector, Files API, and prompt caching for up to one hour.
The MCP connector is particularly significant — it turns Claude into an interoperable agent that can plug into any tool ecosystem following the Model Context Protocol standard.
The Competitive Context
This launch happened 11 days after GPT-4.1 and one month after o3/o4-mini. OpenAI had been shipping rapidly, but Claude 4 reclaimed the coding benchmark crown. Google's Gemini 2.5 Pro, released two months earlier, held the reasoning lead but couldn't match Claude's coding performance.
The Claude 4 launch was also Anthropic's first release under ASL-3, the company's most rigorous safety classification, which added deployment constraints but signaled confidence in the model's capabilities.
Our Take
Claude 4 is Anthropic's declaration that it's the coding company now. The SWE-bench scores are impressive, but the real differentiator is the agent infrastructure — extended thinking with tools, parallel execution, MCP, and Claude Code going GA. Anthropic isn't just building models; it's building the stack around them. And at $3/$15, Sonnet 4 makes the case that you don't need to pay Opus prices for production-quality results.
FAQ
What's the difference between Claude Opus 4 and Sonnet 4? Opus 4 is optimized for long-running agent workflows lasting hours, with 32K max output. Sonnet 4 offers near-identical benchmark performance at one-fifth the cost ($3/$15 vs $15/$75 per million tokens), making it the better choice for most production workloads.
Does Claude 4 support extended thinking? Yes, both Opus 4 and Sonnet 4 support extended thinking with tool use, a first for Claude. This lets the model reason step-by-step while simultaneously calling external tools, enabling more complex agentic workflows.
What is Claude Code? Claude Code is Anthropic's CLI-based coding agent that went generally available with this launch. It integrates with VS Code and JetBrains IDEs, allowing Claude to read, write, and execute code directly in development environments.
How does Claude 4 compare to GPT-4.1? Claude 4 Sonnet scores 72.7% on SWE-bench Verified compared to GPT-4.1's lower scores on the same benchmark. However, GPT-4.1 offers a 1M token context window while Claude 4 is limited to 200K tokens.