Value ranking
Best value on GPQA Diamond
Graduate-level Google-proof Q&A in physics, chemistry, and biology. Diamond subset is the hardest tier with PhD-validated answers.
“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.
- 1Qwen2.5 72B InstructAlibaba (Qwen)111.11100.0 / $0.90/M
- 2Mixtral 8x22BMistral80.3096.4 / $1.20/M
- 3Llama 3.1 70B InstructMeta68.1860.0 / $0.88/M
- 4Llama 3.3 70B InstructMeta0.000.0 / $0.88/M
AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.