Leaderboards

testing the limits of AI.

Benchmarks for frontier, agentic, and safety capabilities.

Benchmarks

20+

Including benchmarks on agentic coding, frontier reasoning, and safety alignment.

Models evaluated

100+

From leading AI labs including OpenAI, Anthropic, Google, Meta, and open-source contributors.

HiL-Bench (Human-in-Loop Benchmark)

HiL-Bench tests whether agents know when to ask for help, measuring if they recognize missing or ambiguous information and ask targeted clarifying questions instead of guessing.

1

Claude Opus 4.7

27.67±5.32

1

Claude Opus 4.6

24.33±5.16

1

GLM-5.1

21.00±4.96

SWE Atlas — Test Writing

Evaluating an agent's ability to write production-grade tests

1

Gpt-5.4-xHigh (Codex CLI)

44.36±6.04

1

Gpt-5.4-xhigh (Mini-SWE)

40.00±6.00

1

Gpt-5.3-Codex-Xhigh (Codex)

38.98±6.12

SWE Atlas — Codebase QnA

Evaluating deep code comprehension and reasoning

1

Gpt 5.4 xHigh (Codex)

40.80±5.10

1

Gpt 5.4 xHigh (Mini-SWE-Agent)

36.30±4.90

1

Opus 4.6 (Claude Code)

33.30±5.00

MCP Atlas

Evaluating real-world tool use through the Model Context Protocol (MCP)

1

Muse Spark

82.20±2.30

1

claude-opus-4-7 (max)

79.10±2.50

1

gemini-3.1-pro-preview (high)

78.20±2.50

SWE-Bench Pro (Public Dataset)

Evaluating long-horizon software engineering tasks in public open-source repositories

1

gpt-5.4 (xHigh)*

59.10±3.56

1

Muse Spark*

New

55.00±3.60

2

claude-opus-4-6 (thinking)*

51.90±3.61

SWE-Bench Pro (Private Dataset)

Evaluating long-horizon software engineering tasks in commercial-grade private repositories

1

claude-opus-4-6 (thinking)*

47.10±6.07

1

Muse Spark*

New

44.70±6.05

1

gpt-5.4(xHigh)*

43.40±6.03

SciPredict

Forecasting scientific experiment outcomes

1

gemini-3-pro-preview

25.27±1.92

1

claude-opus-4-5-20251101

23.05±0.51

1

claude-opus-4-1-20250805

22.22±1.48

Humanity's Last Exam

Challenging LLMs at the frontier of human knowledge

1

gemini-3.1-pro-preview (thinking high)

46.44±1.96

1

gpt-5.4-pro-2026-03-05

44.32±1.95

3

Muse Spark

New

40.56±1.92

Humanity's Last Exam (Text Only)

Challenging LLMs at the frontier of human knowledge

1

gemini-3.1-pro-preview (thinking high)

47.31±2.11

1

gpt-5.4-pro-2026-03-05

45.32±2.10

3

Muse Spark

New

40.92±2.07

AudioMultiChallenge

Evaluating spoken dialogue systems in multi-turn interaction

1

gemini-3-pro-preview (Thinking)*

54.65±4.57

1

gemini-2.5-pro (Thinking)*

46.90±4.58

2

gemini-2.5-flash (Thinking)*

40.04±4.50

AudioMultiChallenge — Audio Output

Evaluating spoken dialogue systems in multi-turn interaction

1

gemini-3.1-flash-live-preview (Thinking)

New

36.06±4.41

1

gpt-realtime-1.5

34.73±4.38

2

gemini-3.1-flash-live-preview

New

26.77±4.06

AudioMultiChallenge — Text Output

Evaluating spoken dialogue systems in multi-turn interaction

1

gemini-3-pro-preview (Thinking)

54.65±4.57

1

gemini-2.5-pro (Thinking)

46.90±4.58

2

gemini-2.5-flash (Thinking)

40.04±4.50

Professional Reasoning Benchmark — Finance

Evaluating Professional Reasoning in Finance

1

claude-opus-4-6 (Non-Thinking)

53.28±0.18

2

Muse Spark

New

52.44±0.06

3

gpt-5

51.32±0.17

Professional Reasoning Benchmark — Legal

Evaluating Professional Reasoning in Legal Practice

1

Muse Spark

New

52.29±0.06

1

claude-opus-4-6 (Non-Thinking)

52.27±0.66

3

gpt-5-pro

49.89±0.36

Remote Labor Index (RLI)

Evaluating AI agents ability to perform real-world, economically valuable remote work

1

claude-opus-4-6 (CoWork)

New

4.17

2

claude-opus-4-5-20251101-thinking

3.75

3

Manus_1.6 (Max)

New

2.92

PropensityBench

Simulating real-world pressure to choose between safe or harmful behavior

1

o3-2025-04-16

10.50±0.60

2

claude-sonnet-4-20250514

12.20±0.20

3

o4-mini-2025-04-16

15.80±0.40

VisualToolBench (VTB)

Evaluating how LLMs can dynamically interact with and reason about visual information

1

gpt-5.4-2026-03-05 (reasoning effort = high)

29.17±0.13

1

gemini-3.1-pro-preview

28.97±0.91

2

claude-opus-4-6-thinking

27.52±1.45

MultiNRC

Multilingual Native Reasoning Evaluation Benchmark for LLMs

1

gpt-5-pro-2025-10-06

65.20±1.24

1

gemini-3.1-pro-preview

64.74±2.88

1

gpt-5.4-pro-2026-03-05

62.27±2.92

MultiChallenge

Assessing models across diverse, interdisciplinary challenges

1

Muse Spark

New

75.52±4.05

1

gemini-3.1-pro-preview

71.37±1.74

1

gpt-5.4-pro-2026-03-05

69.23±3.05

Fortress

Frontier Risk Evaluation for National Security and Public Safety

1

gpt-oss-120b

8.24±1.93

1

claude-opus-4-5-20251101-thinking

9.63±2.11

2

claude-sonnet-4-5-20250929-thinking

12.80±2.36

MASK

Evaluate model honesty when pressured to lie

1

claude-opus-4-6 (Non-Thinking)

96.28±0.41

1

claude-sonnet-4-5-20250929-thinking

96.13±0.57

1

Claude Sonnet 4 (Thinking)

95.33±2.29

EnigmaEval

Evaluating model performance on complex, multi-step reasoning tasks

1

gpt-5.4-pro-2026-03-05

New

23.82±2.43

1

gemini-3.1-pro-preview

New

19.76±2.27

2

gpt-5-pro-2025-10-06

18.75±2.22

VISTA

Vision-Language Understanding benchmark for multimodal models

1

Gemini 2.5 Pro Experimental (March 2025)

54.65±1.46

1

gemini-2.5-pro-preview-06-05

54.63±0.55

1

gpt-5.4-pro-2026-03-05

New

53.89±2.02

TutorBench

Evaluating model performance on common tutoring tasks for high school and AP-level subjects

1

Muse Spark

New

68.55±0.95

1

gpt-5.4-pro-2026-03-05

56.62±1.02

1

gemini-2.5-pro-preview-06-05

55.65±1.11

Methodology

Frontier AI model evaluations & benchmarks.

We conduct high-complexity evaluations to expose model failures, prevent benchmark saturation, and push model capabilities — while continuously evaluating the latest frontier models.

Scaling with human expertise

Humans design complex evaluations and define precise criteria to assess models, while LLMs scale evaluations — ensuring efficiency and alignment with human judgment.

Robust datasets for reliable benchmarks

Our leaderboards are built on carefully curated evaluation sets, combining private datasets to prevent overfitting with open-source datasets for broad benchmarking and comparability.

Continuously evaluating frontier models

We continuously evaluate the latest releases from every major lab, so the leaderboards stay current with the state of the art — not frozen snapshots.

Evaluate your model

Want your model on the leaderboard?

If you'd like to add your model to this leaderboard or a future version, get in touch. To ensure leaderboard integrity, models can only be featured the first time an organization encounters the prompts.

Vahue leaderboard newsletter

Research, benchmarks, and insights — delivered to your inbox.