Value ranking

Best value on Rolling Contamination-Controlled Average

Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.

“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.

1
DeepSeek V3 (Thinking)
DeepSeek
370.37
100.0 / $0.27/M
2
Qwen3 235B (Thinking)
Alibaba (Qwen)
154.55
30.9 / $0.20/M
3
DeepSeek V3
DeepSeek
83.19
22.5 / $0.27/M
4
Gemini 2.5 Pro (Max Thinking)
Google
56.82
71.0 / $1.25/M
5
Claude Sonnet 4 (Thinking)
Anthropic
31.01
93.0 / $3.00/M
6
Claude Opus 4 (Thinking)
Anthropic
6.47
97.1 / $15.00/M
7
Claude Sonnet 4
Anthropic
5.34
16.0 / $3.00/M
8
Claude Opus 4
Anthropic
2.80
42.0 / $15.00/M
9
Qwen3 235B
Alibaba (Qwen)
0.00
0.0 / $0.20/M

Open full leaderboard

AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.