LLMsVideo TranslationImage GenerationVideo Generation
AI News

Independent coverage of the latest AI tool updates, releases, and comparisons.

Categories

  • AI LLMs
  • AI Video Translation
  • AI Image Generation
  • AI Video Generation

Company

  • About
  • Contact

Resources

  • Sitemap
  • AI Glossary
  • Tool Comparisons
  • Facts / Grounding
  • llms.txt
  • XML Sitemap
© 2026 AI News. Independent editorial coverage. Not affiliated with any AI company.
AI LLMs

GPT-5.4 Sets Records on Computer-Use Benchmarks

OpenAI's GPT-5.4, released March 5, achieves record scores on OSWorld and WebArena while the company faces a changed competitive landscape.

MJ

Maya Johnson

Thursday, March 5, 2026·2 min read

OpenAI released GPT-5.4 on March 5, and it leads the benchmark charts where it matters most: computer use. The model set records on OSWorld-Verified and WebArena Verified — the two benchmarks that test whether AI can actually operate software, not just talk about it. It also scored 83% on OpenAI's internal GDPval test.

Where It Wins

GPT-5.4's strength is practical computer operation. On tasks like navigating web applications, filling out forms, and executing multi-step workflows across real software interfaces, it outperforms both Gemini 3.1 Pro and Claude Opus 4.6, according to LLM Stats.

This matters because "computer use" is rapidly becoming the benchmark category that predicts real-world business value. A model that can reliably operate a browser, fill a CRM, and navigate a dashboard is more valuable to enterprises than one that scores 2% higher on reasoning puzzles.

Where It Doesn't

On pure reasoning, Gemini 3.1 Pro still leads — particularly on GPQA Diamond with 94.3%. And on coding benchmarks like SWE-bench Verified, Claude Opus 4.6 holds the top spot. GPT-5.4 is strong across the board, but it's not the best at everything anymore. That era is over.

The Bigger Picture

OpenAI is at an inflection point. The company has $25 billion in annualized revenue, but Anthropic recently surpassed it. Sora's shutdown in March cost credibility. And Google's Gemini 3.1 lineup is more competitive than any previous Gemini generation.

GPT-5.4 is a solid model — arguably the best general-purpose choice for enterprises that need computer-use capabilities. But OpenAI can no longer ship a model and assume it's automatically the best. Every release from Anthropic and Google now requires a genuine response.

Our Take

GPT-5.4 is OpenAI's most practical release in months. The computer-use benchmarks point at a future where LLMs don't just answer questions but do work. That's the right focus. But the three-way competition between OpenAI, Anthropic, and Google is closer than it's ever been, and no single model dominates across all categories.

Tools Mentioned

GPT (OpenAI)Industry-leading large language models powering ChatGPT
$20/mo (ChatGPT Plus)
Claude (Anthropic)Safe, helpful AI assistant with extended context and reasoning
$20/mo (Pro)
Gemini (Google)Google's multimodal AI model family
$19.99/mo (Advanced)

More in AI LLMs

AI LLMs

Meta Launches Muse Spark — Its First Closed-Source Model Targets 'Personal Superintelligence'

Meta Superintelligence Labs unveils Muse Spark with dual modes, 58% on Humanity's Last Exam, and multimodal reasoning. Breaking with tradition, the model is not open-source.

Alex Chen·Apr 8, 2026
AI LLMs

OpenAI, Anthropic, and Google Unite to Combat AI Model Copying From China

The three biggest Western AI labs are sharing information through the Frontier Model Forum to prevent Chinese competitors from extracting their models' capabilities.

Sarah Mueller·Apr 7, 2026
← Back to all news