Gemini 3.1 Pro Tops Reasoning Benchmarks With 94.3% on GPQA Diamond

Google's Gemini 3.1 Pro scores highest on reasoning benchmarks, edges past GPT-5.4 and Claude Opus 4.6 on academic tasks, and brings a new thinking_level parameter for developers.

Lisa Thoma

Thursday, February 19, 2026·2 min read

Google released Gemini 3.1 Pro on February 19 as an update to the Gemini 3 Pro series launched in November. The headline number: 94.3% on GPQA Diamond, the reasoning benchmark that tests graduate-level scientific questions. That's the highest score any model has achieved on this benchmark.

Where Gemini 3.1 Pro Leads

GPQA Diamond is specifically designed to be difficult for non-experts — even PhD holders in adjacent fields struggle with it. Gemini 3.1 Pro's 94.3% score places it above both GPT-5.4 and Claude Opus 4.6 on pure reasoning tasks.

The model also introduced a thinking_level parameter that lets developers control how much internal reasoning the model uses, and a media_resolution parameter for vision tasks. Function responses now support multimodal objects like images and PDFs.

Where It Doesn't Lead

On practical coding benchmarks like SWE-bench Verified, Claude Opus 4.6 still holds the top spot. On computer-use tasks — navigating real software interfaces — GPT-5.4 leads with record scores on OSWorld and WebArena.

The LLM leaderboard has fragmented: no single model wins everywhere. Gemini leads reasoning, Claude leads coding, GPT leads computer use. Companies choosing a model now need to match the benchmark category to their actual use case.

Pricing and Availability

Gemini 3.1 Pro is available through the Gemini API, Google AI Studio, and Vertex AI. It rolled out to Gemini app users across the AI Plus, Pro, and Ultra subscription tiers.

Our Take

Google has quietly built the best reasoning model available. Gemini 3.1 Pro's GPQA score is a genuine achievement, not benchmark gaming. But reasoning benchmarks don't directly translate to product quality — and Google's consumer AI products still trail behind ChatGPT and Claude in user experience and adoption. The model is excellent. The distribution challenge remains.

Anthropic Launches Claude Managed Agents in Public Beta — $0.08/Hour Runtime

Claude Managed Agents provides a fully managed infrastructure for running autonomous AI agents with sandboxing, tool execution, and SSE streaming. Available now to all API accounts.

Lisa Thoma·Apr 14, 2026

AI LLMs

DeepSeek V4 Confirmed on Huawei Ascend Chips — Late April Launch Expected

Reuters confirms DeepSeek V4 runs on Huawei's Ascend 950PR processors, not NVIDIA. The 1-trillion-parameter MoE model is expected in late April with an Apache 2.0 release.

Lisa Thoma·Apr 14, 2026

Where Gemini 3.1 Pro Leads

Where It Doesn't Lead

Our Take

Anthropic Launches Claude Managed Agents in Public Beta — $0.08/Hour Runtime

Claude Managed Agents provides a fully managed infrastructure for running autonomous AI agents with sandboxing, tool execution, and SSE streaming. Available now to all API accounts.

Lisa Thoma·Apr 14, 2026

AI LLMs

DeepSeek V4 Confirmed on Huawei Ascend Chips — Late April Launch Expected

Reuters confirms DeepSeek V4 runs on Huawei's Ascend 950PR processors, not NVIDIA. The 1-trillion-parameter MoE model is expected in late April with an Apache 2.0 release.

Lisa Thoma·Apr 14, 2026

Gemini 3.1 Pro Tops Reasoning Benchmarks With 94.3% on GPQA Diamond

Where Gemini 3.1 Pro Leads

Where It Doesn't Lead

Pricing and Availability

Our Take

More in AI LLMs

Anthropic Launches Claude Managed Agents in Public Beta — $0.08/Hour Runtime

DeepSeek V4 Confirmed on Huawei Ascend Chips — Late April Launch Expected

Gemini 3.1 Pro Tops Reasoning Benchmarks With 94.3% on GPQA Diamond

Where Gemini 3.1 Pro Leads

Where It Doesn't Lead

Pricing and Availability

Our Take

More in AI LLMs

Anthropic Launches Claude Managed Agents in Public Beta — $0.08/Hour Runtime

DeepSeek V4 Confirmed on Huawei Ascend Chips — Late April Launch Expected