AI Benchmark Leaderboard
State-of-the-art results across key evaluation benchmarks.
Updated 2026-05-03
Full leaderboards on Papers With Code ↗Massive Multitask Language Understanding
57 diverse subjects spanning STEM, humanities, social sciences, and professional domains. Tests breadth of knowledge and reasoning.
HumanEval (pass@1)
164 handwritten Python programming problems. Tests ability to generate correct code from a docstring. pass@1 = fraction solved in one attempt.
MATH Benchmark
12,500 competition-level mathematics problems across algebra, geometry, number theory, and calculus. Difficulty ranges from AMC to AIME level.
Grade School Math 8K
8,500 grade-school math word problems requiring multi-step reasoning. A standard measure of arithmetic reasoning capability.
Graduate-Level Google-Proof Q&A (Diamond)
448 expert-crafted multiple-choice questions in biology, chemistry, and physics. Designed to be unsolvable by googling — requires genuine domain expertise.
SWE-bench Verified
500 real GitHub issues from popular Python repositories. Measures the ability of AI agents to autonomously resolve software engineering tasks end-to-end.