Value ranking

Best value on HumanEval

164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.

“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.

1
Gemini 2.0 Flash
Google
680.20
68.0 / $0.10/M
2
Qwen3 235B
Alibaba (Qwen)
414.40
82.9 / $0.20/M
3
GPT-4o mini
OpenAI
387.40
58.1 / $0.15/M
4
Llama 4 Scout
Meta
292.78
52.7 / $0.18/M
5
Llama 4 Maverick
Meta
271.93
73.4 / $0.27/M
6
DeepSeek V3
DeepSeek
230.22
62.2 / $0.27/M
7
DeepSeek R1
DeepSeek
180.18
99.1 / $0.55/M
8
o3-mini
OpenAI
78.63
86.5 / $1.10/M
9
Claude 3.5 Haiku
Anthropic
77.70
62.2 / $0.80/M
10
Gemini 2.5 Pro
Google
76.40
95.5 / $1.25/M
11
Llama 3.3 70B Instruct
Meta
72.17
63.5 / $0.88/M
12
Qwen2.5 72B Instruct
Alibaba (Qwen)
61.57
55.4 / $0.90/M
13
Gemini 1.5 Pro
Google
35.31
44.1 / $1.25/M
14
Grok 2
xAI
31.75
63.5 / $2.00/M
15
Llama 3.1 70B Instruct
Meta
31.74
27.9 / $0.88/M
16
Claude Sonnet 4
Anthropic
31.08
93.2 / $3.00/M
17
Claude 3.5 Sonnet
Anthropic
29.13
87.4 / $3.00/M
18
Mistral Large 2
Mistral
29.05
58.1 / $2.00/M
19
GPT-4o
OpenAI
28.65
71.6 / $2.50/M
20
o1-mini
OpenAI
27.18
81.5 / $3.00/M
21
Grok 3
xAI
25.83
77.5 / $3.00/M
22
Llama 3.1 405B Instruct
Meta
18.92
66.2 / $3.50/M
23
o3
OpenAI
9.37
93.7 / $10.00/M
24
Claude Opus 4
Anthropic
6.67
100.0 / $15.00/M
25
o1
OpenAI
5.44
81.5 / $15.00/M

Open full leaderboard

AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.