Value ranking
Best value on Rolling Contamination-Controlled Average
Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.
“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.
- 1DeepSeek V3 (Thinking)DeepSeek370.37100.0 / $0.27/M
- 2Qwen3 235B (Thinking)Alibaba (Qwen)154.5530.9 / $0.20/M
- 3DeepSeek V3DeepSeek83.1922.5 / $0.27/M
- 4Gemini 2.5 Pro (Max Thinking)Google56.8271.0 / $1.25/M
- 5Claude Sonnet 4 (Thinking)Anthropic31.0193.0 / $3.00/M
- 6Claude Opus 4 (Thinking)Anthropic6.4797.1 / $15.00/M
- 7Claude Sonnet 4Anthropic5.3416.0 / $3.00/M
- 8Claude Opus 4Anthropic2.8042.0 / $15.00/M
- 9Qwen3 235BAlibaba (Qwen)0.000.0 / $0.20/M
AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.