Value ranking
Best value on Aider Polyglot
Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.
“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.
- 1Qwen3 235BAlibaba (Qwen)331.7566.3 / $0.20/M
- 2DeepSeek V3 (Thinking)DeepSeek309.8183.7 / $0.27/M
- 3DeepSeek V3DeepSeek292.2678.9 / $0.27/M
- 4DeepSeek R1DeepSeek152.0983.7 / $0.55/M
- 5Kimi K2Moonshot (Kimi)109.6065.8 / $0.60/M
- 6GPT-5OpenAI80.00100.0 / $1.25/M
- 7o4-miniOpenAI73.6781.0 / $1.10/M
- 8o3-miniOpenAI61.1867.3 / $1.10/M
- 9Llama 4 MaverickMeta52.6714.2 / $0.27/M
- 10Claude 3.5 HaikuAnthropic36.1428.9 / $0.80/M
- 11Claude 3.7 SonnetAnthropic24.2172.6 / $3.00/M
- 12Claude Sonnet 4Anthropic20.8562.6 / $3.00/M
- 13Grok 3xAI19.6358.9 / $3.00/M
- 14Claude 3.5 SonnetAnthropic18.9656.9 / $3.00/M
- 15Grok 4xAI18.0190.0 / $5.00/M
- 16GPT-4oOpenAI9.2423.1 / $2.50/M
- 17o3OpenAI9.2192.1 / $10.00/M
- 18Claude Opus 4Anthropic5.3079.5 / $15.00/M
- 19GPT-4o miniOpenAI0.000.0 / $0.15/M
AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.