SWE-bench Verified
Canonical software-engineering agent benchmark already in product scope.
- Category
- Coding
- Owner
- SWE-bench
- Data path
- Official leaderboard rows and per-instance metadata can be shown with scaffold and tool context preserved.
Scores stay attached to their benchmark, source, and run context. No composite rankings. No “best model.”
A curated SWE-bench split for evaluating systems that resolve real software engineering issues.
A difficult subset of GPQA for graduate-level science question answering evaluation.
A live competitive-programming benchmark that rates LLMs with a Codeforces-style Elo on fresh contest problems.
A function-calling and tool-use benchmark covering single-turn, multi-turn, live, and agentic scenarios.
A frequently updated public benchmark suite spanning reasoning, coding, math, language, and instruction-following tasks.
A command-line agent benchmark for completing terminal tasks in reproducible task environments.
Canonical software-engineering agent benchmark already in product scope.
Strong public benchmark for function calling, multi-turn, live, and agentic tool categories.
Broad public eval with frequently updated releases across reasoning, coding, math, and instruction following.
| Benchmark | GPT-5 highOpenAI | Claude Sonnet 4.5Anthropic | Gemini 2.5 ProGoogle DeepMind | DeepSeek R1DeepSeek |
|---|---|---|---|---|
| SWE-bench Verified% resolved | 73.6% | 71.3% | — | — |
| GPQA Diamondaccuracy | 86.2% | — | 85.3% | — |
Run the verified SWE-bench split with a fixed agent scaffold, repository setup, and scoring harness.
Evaluate multiple-choice science questions from the GPQA Diamond subset with a fixed prompt and answer extractor.
Problems and tooling are published; ratings are computed from live Codeforces-style contests.
Official BFCL README documents install, generation, evaluation, and score output.