AMA

Value ranking

Best value on Terminal-Bench 2

Long-horizon shell-and-filesystem tasks executed in a sandboxed terminal, scored by whether the agent's final state matches a target state. Tests practical tool-using ability for everyday devops and data-wrangling work; one of the hardest agentic benchmarks today.

“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.

  1. 1
    Gemini 3 Flash
    Google
    249.63
    74.9 / $0.30/M
  2. 2
    DeepSeek V3
    DeepSeek
    147.63
    39.9 / $0.27/M
  3. 3
    GPT-5 mini
    OpenAI
    132.20
    33.0 / $0.25/M
  4. 4
    Gemini 3 Pro
    Google
    77.96
    97.5 / $1.25/M
  5. 5
    Kimi K2
    Moonshot (Kimi)
    74.93
    45.0 / $0.60/M
  6. 6
    GPT-5.5
    OpenAI
    66.67
    100.0 / $1.50/M
  7. 7
    GPT-5.4
    OpenAI
    66.48
    99.7 / $1.50/M
  8. 8
    GLM-4.7
    Zhipu AI (GLM)
    62.12
    31.1 / $0.50/M
  9. 9
    GPT-5.2
    OpenAI
    60.59
    75.7 / $1.25/M
  10. 10
    GPT-5
    OpenAI
    43.23
    54.0 / $1.25/M
  11. 11
    GPT-5.1
    OpenAI
    40.97
    51.2 / $1.25/M
  12. 12
    GLM-4.6
    Zhipu AI (GLM)
    36.88
    18.4 / $0.50/M
  13. 13
    Claude Haiku 4.5
    Anthropic
    34.04
    34.0 / $1.00/M
  14. 14
    Gemini 2.5 Flash
    Google
    26.47
    7.9 / $0.30/M
  15. 15
    Gemini 2.5 Pro
    Google
    23.94
    29.9 / $1.25/M
  16. 16
    Claude Sonnet 4.6
    Anthropic
    16.55
    49.6 / $3.00/M
  17. 17
    Claude Sonnet 4.5
    Anthropic
    14.80
    44.4 / $3.00/M
  18. 18
    Claude Opus 4.6
    Anthropic
    6.46
    96.9 / $15.00/M
  19. 19
    Claude Opus 4.5
    Anthropic
    4.88
    73.2 / $15.00/M
  20. 20
    Grok 4
    xAI
    4.45
    22.3 / $5.00/M
  21. 21
    Claude Opus 4
    Anthropic
    2.51
    37.6 / $15.00/M
  22. 22
    GPT-5 nano
    OpenAI
    0.00
    0.0 / $0.05/M

AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.