AI Model Benchmark Tracker

Live comparison of leading AI models across major benchmarks. Click any column header to sort. Last updated: 2026-03-04

Model	Provider	MMLU	HumanEval	MATH	MT-Bench	Arena ELO	GPQA
GPT-5	OpenAI	92.3	96.1	89.7	9.6	1380	68.2
Claude 4 Opus	Anthropic	91.8	95.4	88.9	9.5	1372	67.8
Gemini 2.5 Pro	Google	91.5	94.2	87.4	9.4	1365	66.9
DeepSeek R1	DeepSeek	90.8	92.8	97.3	9.1	1358	71.5
Claude 4 Sonnet	Anthropic	90.4	93.8	86.2	9.4	1355	65.3
Grok 3	xAI	90.1	91.5	85.6	9.3	1340	63.4
GPT-4o	OpenAI	88.7	90.2	76.6	9.3	1310	53.6
Claude 3.5 Sonnet	Anthropic	88.7	92	78.3	9.2	1290	59.4
Llama 4 405B	Meta	89.2	89	73.8	9	1280	51.2
Mistral Large 3	Mistral	86.8	88.4	72.1	8.9	1255	49.8
Gemini 2.0 Flash	Google	85.4	85.7	70.2	8.8	1245	48.1
Qwen 3 72B	Alibaba	85.9	86.4	74.5	8.7	1230	47.2
Yi Lightning	01.AI	84.5	83.2	68.9	8.6	1210	44.1
Phi-4	Microsoft	84.8	82.6	80.4	8.5	1200	56.1
Command R+	Cohere	82.1	79.8	58.3	8.4	1175	38.6

Methodology

MMLU

Massive Multitask Language Understanding — tests knowledge across 57 subjects including math, history, law, and medicine.

Max score: 100%

HumanEval

Measures code generation ability by testing models on 164 programming problems in Python.

Max score: 100%

MATH

Competition-level mathematics problems covering algebra, geometry, number theory, and more.

Max score: 100%

MT-Bench

Multi-turn conversation benchmark that evaluates instruction following across 8 categories.

Max score: 10.0

Arena ELO

Chatbot Arena ELO rating based on human preference votes in blind comparisons.

Max score: Open-ended

GPQA

Graduate-level questions in physics, biology, and chemistry written by domain experts.

Max score: 100%

About This Data

Scores come from official provider reports, published papers, and third-party evaluations (like Chatbot Arena). We update this table as new models and scores become available. Some scores are self-reported by providers and haven't been independently verified. Arena ELO ratings come from LMSYS Chatbot Arena's crowdsourced blind comparison platform.

← Back to Machine Brief

AI Model Benchmark Tracker

Live comparison of leading AI models across major benchmarks. Click any column header to sort. Last updated: 2026-03-04

Full Model Comparison →

Model	Provider	MMLU	HumanEval	MATH	MT-Bench	Arena ELO	GPQA
GPT-5	OpenAI	92.3	96.1	89.7	9.6	1380	68.2
Claude 4 Opus	Anthropic	91.8	95.4	88.9	9.5	1372	67.8
Gemini 2.5 Pro	Google	91.5	94.2	87.4	9.4	1365	66.9
DeepSeek R1	DeepSeek	90.8	92.8	97.3	9.1	1358	71.5
Claude 4 Sonnet	Anthropic	90.4	93.8	86.2	9.4	1355	65.3
Grok 3	xAI	90.1	91.5	85.6	9.3	1340	63.4
GPT-4o	OpenAI	88.7	90.2	76.6	9.3	1310	53.6
Claude 3.5 Sonnet	Anthropic	88.7	92	78.3	9.2	1290	59.4
Llama 4 405B	Meta	89.2	89	73.8	9	1280	51.2
Mistral Large 3	Mistral	86.8	88.4	72.1	8.9	1255	49.8
Gemini 2.0 Flash	Google	85.4	85.7	70.2	8.8	1245	48.1
Qwen 3 72B	Alibaba	85.9	86.4	74.5	8.7	1230	47.2
Yi Lightning	01.AI	84.5	83.2	68.9	8.6	1210	44.1
Phi-4	Microsoft	84.8	82.6	80.4	8.5	1200	56.1
Command R+	Cohere	82.1	79.8	58.3	8.4	1175	38.6

Methodology

MMLU

Massive Multitask Language Understanding — tests knowledge across 57 subjects including math, history, law, and medicine.

Max score: 100%

HumanEval

Measures code generation ability by testing models on 164 programming problems in Python.

Max score: 100%

MATH

Competition-level mathematics problems covering algebra, geometry, number theory, and more.

Max score: 100%

MT-Bench

Multi-turn conversation benchmark that evaluates instruction following across 8 categories.

Max score: 10.0

Arena ELO

Chatbot Arena ELO rating based on human preference votes in blind comparisons.

Max score: Open-ended

GPQA

Graduate-level questions in physics, biology, and chemistry written by domain experts.

Max score: 100%