Live comparison of leading AI models across major benchmarks. Click any column header to sort. Last updated: 2026-03-04
| Model | Provider | MMLU | HumanEval | MATH | MT-Bench | Arena ELO | GPQA |
|---|---|---|---|---|---|---|---|
| GPT-5 | OpenAI | 92.3 | 96.1 | 89.7 | 9.6 | 1380 | 68.2 |
| Claude 4 Opus | Anthropic | 91.8 | 95.4 | 88.9 | 9.5 | 1372 | 67.8 |
| Gemini 2.5 Pro | 91.5 | 94.2 | 87.4 | 9.4 | 1365 | 66.9 | |
| DeepSeek R1 | DeepSeek | 90.8 | 92.8 | 97.3 | 9.1 | 1358 | 71.5 |
| Claude 4 Sonnet | Anthropic | 90.4 | 93.8 | 86.2 | 9.4 | 1355 | 65.3 |
| Grok 3 | xAI | 90.1 | 91.5 | 85.6 | 9.3 | 1340 | 63.4 |
| GPT-4o | OpenAI | 88.7 | 90.2 | 76.6 | 9.3 | 1310 | 53.6 |
| Claude 3.5 Sonnet | Anthropic | 88.7 | 92 | 78.3 | 9.2 | 1290 | 59.4 |
| Llama 4 405B | Meta | 89.2 | 89 | 73.8 | 9 | 1280 | 51.2 |
| Mistral Large 3 | Mistral | 86.8 | 88.4 | 72.1 | 8.9 | 1255 | 49.8 |
| Gemini 2.0 Flash | 85.4 | 85.7 | 70.2 | 8.8 | 1245 | 48.1 | |
| Qwen 3 72B | Alibaba | 85.9 | 86.4 | 74.5 | 8.7 | 1230 | 47.2 |
| Yi Lightning | 01.AI | 84.5 | 83.2 | 68.9 | 8.6 | 1210 | 44.1 |
| Phi-4 | Microsoft | 84.8 | 82.6 | 80.4 | 8.5 | 1200 | 56.1 |
| Command R+ | Cohere | 82.1 | 79.8 | 58.3 | 8.4 | 1175 | 38.6 |
Massive Multitask Language Understanding — tests knowledge across 57 subjects including math, history, law, and medicine.
Max score: 100%
Measures code generation ability by testing models on 164 programming problems in Python.
Max score: 100%
Competition-level mathematics problems covering algebra, geometry, number theory, and more.
Max score: 100%
Multi-turn conversation benchmark that evaluates instruction following across 8 categories.
Max score: 10.0
Chatbot Arena ELO rating based on human preference votes in blind comparisons.
Max score: Open-ended
Graduate-level questions in physics, biology, and chemistry written by domain experts.
Max score: 100%
Scores come from official provider reports, published papers, and third-party evaluations (like Chatbot Arena). We update this table as new models and scores become available. Some scores are self-reported by providers and haven't been independently verified. Arena ELO ratings come from LMSYS Chatbot Arena's crowdsourced blind comparison platform.