evals.report·v0.1ReferenceBuilt by inductive.ml

LLM eval scores with sources, run context, and caveats attached.

Track official benchmark scores, vendor-reported model launches, and clearly labeled community runs — each kept with its source, run context, and caveats. No composite ranking. No fake “best model.” Just the receipt.

⌘K

82 benchmarks·136 models·33 labs·Data snapshot 2026·07·25

Index 04

01Browse benchmarks82 official benchmarks 02Compare models136 models · 82 benchmarks 03View lab progress33 labs 04Run a benchmark16 step-by-step guides

Benchmarks 06

See all

SWE-bench VerifiedCoding/% resolved/↑ higher

A curated SWE-bench split for evaluating systems that resolve real software engineering issues.

github.com/SWE-bench/SWE-bench56 reportedguide available

95.0%

Top reported · Claude Fable 5

Terminal-Bench 2.1Agents/task success/↑ higher

A command-line agent benchmark for completing terminal tasks in reproducible task environments.

github.com/laude-institute/terminal-bench31 reportedguide available

91.9%

Top reported · GPT-5.6 Sol Ultra

DeepSWECoding/% resolved/↑ higher

A long-horizon software-engineering benchmark with original tasks, broad repository coverage, and behavioral verifiers.

github.com/datacurve-ai/deep-swe24 reportedguide available

73.0%

Top reported · GPT-5.6 Sol

SWE-bench ProCoding/% resolved/↑ higher

A harder public software-engineering agent benchmark built around professional repository tasks.

github.com/scaleapi/SWE-bench_Pro-os50 reportedguide available

80.0%

Top reported · Claude Fable 5

Humanity's Last ExamReasoning/accuracy/↑ higher

A broad expert-level academic question-answering benchmark for frontier reasoning systems.

48 reportedguide available

56.8%

Top reported · Claude Mythos Preview

GDPvalAgents/Elo/↑ higher

GDPval evaluates AI models agentically (shell + web access via a sandbox harness) on real-world economically valuable knowledge-work deliverables — documents, spreadsheets, slides, diagrams — spanning 44 occupations across 9 major U.S. GDP industries, scored by blind pairwise quality comparison; the Artificial Analysis GDPval-AA variant reports results as an Elo rating.

huggingface.co/datasets/openai/gdpval70 reported

1932

Top reported · Claude Fable 5

Compare

Open

Benchmark	GPT-5.5OpenAI	Claude Opus 4.8Anthropic	Gemini 3.1 Pro PreviewGoogle DeepMind	DeepSeek V4 ProDeepSeek
SWE-bench Verified% resolved	80.6%	88.6%	75.6%	80.6%
DeepSWE% resolved	70.05%	58%	9.88%	5.3%
SWE-bench Pro% resolved	58.6%	69.2%	46.10%	55.4%

Run guides 04

All guides

SWE-bench VerifiedCoding/% resolved

SWE-bench Verified is run locally with the official `swebench` harness (Docker-based).

github.com/SWE-bench/SWE-bench7 commands

GuideOpen ↓

Terminal-Bench 2.1Agents/task success

Terminal-Bench evaluates AI agents on real terminal/command-line tasks inside sandboxed Docker containers.

github.com/laude-institute/terminal-bench10 commands

GuideOpen ↓

DeepSWECoding/% resolved

DeepSWE is a 113-task long-horizon SWE benchmark (TypeScript, Go, Python, JavaScript, Rust) using the Harbor task format with program-based behavioral verifiers.

github.com/datacurve-ai/deep-swe9 commands

GuideOpen ↓

GPQA DiamondReasoning/accuracy

GPQA Diamond is a 448-question graduate-level science multiple-choice set; the score is exact-match accuracy on the A/B/C/D answer.

github.com/idavidrein/gpqa5 commands

GuideOpen ↓