GPT-5.4 Sets Records on Computer-Use Benchmarks

OpenAI's GPT-5.4, released March 5, achieves record scores on OSWorld and WebArena while the company faces a changed competitive landscape.

Lisa Thoma

Thursday, March 5, 2026·2 min read

OpenAI released GPT-5.4 on March 5, and it leads the benchmark charts where it matters most: computer use. The model set records on OSWorld-Verified and WebArena Verified — the two benchmarks that test whether AI can actually operate software, not just talk about it. It also scored 83% on OpenAI's internal GDPval test.

Where It Wins

GPT-5.4's strength is practical computer operation. On tasks like navigating web applications, filling out forms, and executing multi-step workflows across real software interfaces, it outperforms both Gemini 3.1 Pro and Claude Opus 4.6, according to LLM Stats.

This matters because "computer use" is rapidly becoming the benchmark category that predicts real-world business value. A model that can reliably operate a browser, fill a CRM, and navigate a dashboard is more valuable to enterprises than one that scores 2% higher on reasoning puzzles.

Where It Doesn't

On pure reasoning, Gemini 3.1 Pro still leads — particularly on GPQA Diamond with 94.3%. And on coding benchmarks like SWE-bench Verified, Claude Opus 4.6 holds the top spot. GPT-5.4 is strong across the board, but it's not the best at everything anymore. That era is over.

The Bigger Picture

OpenAI is at an inflection point. The company has $25 billion in annualized revenue, but Anthropic recently surpassed it. Sora's shutdown in March cost credibility. And Google's Gemini 3.1 lineup is more competitive than any previous Gemini generation.

GPT-5.4 is a solid model — arguably the best general-purpose choice for enterprises that need computer-use capabilities. But OpenAI can no longer ship a model and assume it's automatically the best. Every release from Anthropic and Google now requires a genuine response.

Our Take

GPT-5.4 is OpenAI's most practical release in months. The computer-use benchmarks point at a future where LLMs don't just answer questions but do work. That's the right focus. But the three-way competition between OpenAI, Anthropic, and Google is closer than it's ever been, and no single model dominates across all categories.

Anthropic Launches Claude Managed Agents in Public Beta — $0.08/Hour Runtime

Claude Managed Agents provides a fully managed infrastructure for running autonomous AI agents with sandboxing, tool execution, and SSE streaming. Available now to all API accounts.

Lisa Thoma·Apr 14, 2026

AI LLMs

DeepSeek V4 Confirmed on Huawei Ascend Chips — Late April Launch Expected

Reuters confirms DeepSeek V4 runs on Huawei's Ascend 950PR processors, not NVIDIA. The 1-trillion-parameter MoE model is expected in late April with an Apache 2.0 release.

Lisa Thoma·Apr 14, 2026

Where It Wins

The Bigger Picture

Our Take

Anthropic Launches Claude Managed Agents in Public Beta — $0.08/Hour Runtime

Claude Managed Agents provides a fully managed infrastructure for running autonomous AI agents with sandboxing, tool execution, and SSE streaming. Available now to all API accounts.

Lisa Thoma·Apr 14, 2026

AI LLMs

DeepSeek V4 Confirmed on Huawei Ascend Chips — Late April Launch Expected

Reuters confirms DeepSeek V4 runs on Huawei's Ascend 950PR processors, not NVIDIA. The 1-trillion-parameter MoE model is expected in late April with an Apache 2.0 release.

Lisa Thoma·Apr 14, 2026

GPT-5.4 Sets Records on Computer-Use Benchmarks

Where It Wins

Where It Doesn't

The Bigger Picture

Our Take

More in AI LLMs

Anthropic Launches Claude Managed Agents in Public Beta — $0.08/Hour Runtime

DeepSeek V4 Confirmed on Huawei Ascend Chips — Late April Launch Expected

GPT-5.4 Sets Records on Computer-Use Benchmarks

Where It Wins

Where It Doesn't

The Bigger Picture

Our Take

More in AI LLMs

Anthropic Launches Claude Managed Agents in Public Beta — $0.08/Hour Runtime

DeepSeek V4 Confirmed on Huawei Ascend Chips — Late April Launch Expected