Value ranking

Best value on Aider Polyglot

Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.

“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.

1
Qwen3 235B
Alibaba (Qwen)
331.75
66.3 / $0.20/M
2
DeepSeek V3 (Thinking)
DeepSeek
309.81
83.7 / $0.27/M
3
DeepSeek V3
DeepSeek
292.26
78.9 / $0.27/M
4
DeepSeek R1
DeepSeek
152.09
83.7 / $0.55/M
5
Kimi K2
Moonshot (Kimi)
109.60
65.8 / $0.60/M
6
GPT-5
OpenAI
80.00
100.0 / $1.25/M
7
o4-mini
OpenAI
73.67
81.0 / $1.10/M
8
o3-mini
OpenAI
61.18
67.3 / $1.10/M
9
Llama 4 Maverick
Meta
52.67
14.2 / $0.27/M
10
Claude 3.5 Haiku
Anthropic
36.14
28.9 / $0.80/M
11
Claude 3.7 Sonnet
Anthropic
24.21
72.6 / $3.00/M
12
Claude Sonnet 4
Anthropic
20.85
62.6 / $3.00/M
13
Grok 3
xAI
19.63
58.9 / $3.00/M
14
Claude 3.5 Sonnet
Anthropic
18.96
56.9 / $3.00/M
15
Grok 4
xAI
18.01
90.0 / $5.00/M
16
GPT-4o
OpenAI
9.24
23.1 / $2.50/M
17
o3
OpenAI
9.21
92.1 / $10.00/M
18
Claude Opus 4
Anthropic
5.30
79.5 / $15.00/M
19
GPT-4o mini
OpenAI
0.00
0.0 / $0.15/M

Open full leaderboard

AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.