AI Benchmark Leaderboard

State-of-the-art results across key evaluation benchmarks.

Results reflect published evaluations and may use different prompting strategies or few-shot settings — direct comparison across rows should be made cautiously. Scores auto-update weekly via the Papers With Code API.

Massive Multitask Language Understanding

57 diverse subjects spanning STEM, humanities, social sciences, and professional domains. Tests breadth of knowledge and reasoning.

Metric: Accuracy (%)
Higher is better: Yes
#1 GPT-4o
OpenAI
88.7%
2024-05
#2 Llama 3.1 405B
Meta
88.6%
2024-07
#3 Claude 3 Opus
Anthropic
86.8%
2024-03
#4 Gemini 1.5 Pro
Google
85.9%
2024-02
#5 Mistral Large 2
Mistral
84.0%
2024-07