Value ranking
Best value on HumanEval
164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.
“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.
- 1Gemini 2.0 FlashGoogle680.2068.0 / $0.10/M
- 2Qwen3 235BAlibaba (Qwen)414.4082.9 / $0.20/M
- 3GPT-4o miniOpenAI387.4058.1 / $0.15/M
- 4Llama 4 ScoutMeta292.7852.7 / $0.18/M
- 5Llama 4 MaverickMeta271.9373.4 / $0.27/M
- 6DeepSeek V3DeepSeek230.2262.2 / $0.27/M
- 7DeepSeek R1DeepSeek180.1899.1 / $0.55/M
- 8o3-miniOpenAI78.6386.5 / $1.10/M
- 9Claude 3.5 HaikuAnthropic77.7062.2 / $0.80/M
- 10Gemini 2.5 ProGoogle76.4095.5 / $1.25/M
- 11Llama 3.3 70B InstructMeta72.1763.5 / $0.88/M
- 12Qwen2.5 72B InstructAlibaba (Qwen)61.5755.4 / $0.90/M
- 13Gemini 1.5 ProGoogle35.3144.1 / $1.25/M
- 14Grok 2xAI31.7563.5 / $2.00/M
- 15Llama 3.1 70B InstructMeta31.7427.9 / $0.88/M
- 16Claude Sonnet 4Anthropic31.0893.2 / $3.00/M
- 17Claude 3.5 SonnetAnthropic29.1387.4 / $3.00/M
- 18Mistral Large 2Mistral29.0558.1 / $2.00/M
- 19GPT-4oOpenAI28.6571.6 / $2.50/M
- 20o1-miniOpenAI27.1881.5 / $3.00/M
- 21Grok 3xAI25.8377.5 / $3.00/M
- 22Llama 3.1 405B InstructMeta18.9266.2 / $3.50/M
- 23o3OpenAI9.3793.7 / $10.00/M
- 24Claude Opus 4Anthropic6.67100.0 / $15.00/M
- 25o1OpenAI5.4481.5 / $15.00/M
AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.