AMA

Value ranking

Best value on Aider Polyglot

Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.

“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.

  1. 1
    Qwen3 235B
    Alibaba (Qwen)
    331.75
    66.3 / $0.20/M
  2. 2
    DeepSeek V3 (Thinking)
    DeepSeek
    309.81
    83.7 / $0.27/M
  3. 3
    DeepSeek V3
    DeepSeek
    292.26
    78.9 / $0.27/M
  4. 4
    DeepSeek R1
    DeepSeek
    152.09
    83.7 / $0.55/M
  5. 5
    Kimi K2
    Moonshot (Kimi)
    109.60
    65.8 / $0.60/M
  6. 6
    GPT-5
    OpenAI
    80.00
    100.0 / $1.25/M
  7. 7
    o4-mini
    OpenAI
    73.67
    81.0 / $1.10/M
  8. 8
    o3-mini
    OpenAI
    61.18
    67.3 / $1.10/M
  9. 9
    Llama 4 Maverick
    Meta
    52.67
    14.2 / $0.27/M
  10. 10
    Claude 3.5 Haiku
    Anthropic
    36.14
    28.9 / $0.80/M
  11. 11
    Claude 3.7 Sonnet
    Anthropic
    24.21
    72.6 / $3.00/M
  12. 12
    Claude Sonnet 4
    Anthropic
    20.85
    62.6 / $3.00/M
  13. 13
    Grok 3
    xAI
    19.63
    58.9 / $3.00/M
  14. 14
    Claude 3.5 Sonnet
    Anthropic
    18.96
    56.9 / $3.00/M
  15. 15
    Grok 4
    xAI
    18.01
    90.0 / $5.00/M
  16. 16
    GPT-4o
    OpenAI
    9.24
    23.1 / $2.50/M
  17. 17
    o3
    OpenAI
    9.21
    92.1 / $10.00/M
  18. 18
    Claude Opus 4
    Anthropic
    5.30
    79.5 / $15.00/M
  19. 19
    GPT-4o mini
    OpenAI
    0.00
    0.0 / $0.15/M

AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.