Value ranking
Best value on Terminal-Bench 2
Long-horizon shell-and-filesystem tasks executed in a sandboxed terminal, scored by whether the agent's final state matches a target state. Tests practical tool-using ability for everyday devops and data-wrangling work; one of the hardest agentic benchmarks today.
“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.
- 1Gemini 3 FlashGoogle215.2364.6 / $0.30/M
- 2GPT-5 nanoOpenAI128.606.4 / $0.05/M
- 3DeepSeek V3DeepSeek114.0030.8 / $0.27/M
- 4GPT-5 miniOpenAI96.8424.2 / $0.25/M
- 5Gemini 3 ProGoogle69.0686.3 / $1.25/M
- 6GPT-5.5OpenAI61.6592.5 / $1.50/M
- 7Kimi K2Moonshot (Kimi)59.5035.7 / $0.60/M
- 8GPT-5.4OpenAI59.0188.5 / $1.50/M
- 9GPT-5.2OpenAI52.3165.4 / $1.25/M
- 10GLM-4.7Zhipu AI (GLM)44.6022.3 / $0.50/M
- 11GPT-5OpenAI35.5744.5 / $1.25/M
- 12GPT-5.1OpenAI33.3841.7 / $1.25/M
- 13Claude Haiku 4.5Anthropic25.1725.2 / $1.00/M
- 14GLM-4.6Zhipu AI (GLM)20.2410.1 / $0.50/M
- 15Gemini 2.5 ProGoogle16.9621.2 / $1.25/M
- 16Claude Sonnet 4.6Anthropic16.5549.7 / $3.00/M
- 17Claude Sonnet 4.5Anthropic11.7235.2 / $3.00/M
- 18Claude Opus 4.7Anthropic6.67100.0 / $15.00/M
- 19Claude Opus 4.6Anthropic5.7285.8 / $15.00/M
- 20Claude Opus 4.5Anthropic4.2062.9 / $15.00/M
- 21Grok 4xAI2.7613.8 / $5.00/M
- 22Claude Opus 4Anthropic1.9128.6 / $15.00/M
- 23Gemini 2.5 FlashGoogle0.000.0 / $0.30/M
AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.