AMA

Value ranking

Best value on Rolling Contamination-Controlled Average

Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.

“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.

  1. 1
    DeepSeek V3 (Thinking)
    DeepSeek
    370.37
    100.0 / $0.27/M
  2. 2
    Qwen3 235B (Thinking)
    Alibaba (Qwen)
    154.55
    30.9 / $0.20/M
  3. 3
    DeepSeek V3
    DeepSeek
    83.19
    22.5 / $0.27/M
  4. 4
    Gemini 2.5 Pro (Max Thinking)
    Google
    56.82
    71.0 / $1.25/M
  5. 5
    Claude Sonnet 4 (Thinking)
    Anthropic
    31.01
    93.0 / $3.00/M
  6. 6
    Claude Opus 4 (Thinking)
    Anthropic
    6.47
    97.1 / $15.00/M
  7. 7
    Claude Sonnet 4
    Anthropic
    5.34
    16.0 / $3.00/M
  8. 8
    Claude Opus 4
    Anthropic
    2.80
    42.0 / $15.00/M
  9. 9
    Qwen3 235B
    Alibaba (Qwen)
    0.00
    0.0 / $0.20/M

AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.