Value ranking

Best value on Terminal-Bench 2

Long-horizon shell-and-filesystem tasks executed in a sandboxed terminal, scored by whether the agent's final state matches a target state. Tests practical tool-using ability for everyday devops and data-wrangling work; one of the hardest agentic benchmarks today.

“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.

1
Gemini 3 Flash
Google
249.63
74.9 / $0.30/M
2
DeepSeek V3
DeepSeek
147.63
39.9 / $0.27/M
3
GPT-5 mini
OpenAI
132.20
33.0 / $0.25/M
4
Gemini 3 Pro
Google
77.96
97.5 / $1.25/M
5
Kimi K2
Moonshot (Kimi)
74.93
45.0 / $0.60/M
6
GPT-5.5
OpenAI
66.67
100.0 / $1.50/M
7
GPT-5.4
OpenAI
66.48
99.7 / $1.50/M
8
GLM-4.7
Zhipu AI (GLM)
62.12
31.1 / $0.50/M
9
GPT-5.2
OpenAI
60.59
75.7 / $1.25/M
10
GPT-5
OpenAI
43.23
54.0 / $1.25/M
11
GPT-5.1
OpenAI
40.97
51.2 / $1.25/M
12
GLM-4.6
Zhipu AI (GLM)
36.88
18.4 / $0.50/M
13
Claude Haiku 4.5
Anthropic
34.04
34.0 / $1.00/M
14
Gemini 2.5 Flash
Google
26.47
7.9 / $0.30/M
15
Gemini 2.5 Pro
Google
23.94
29.9 / $1.25/M
16
Claude Sonnet 4.6
Anthropic
16.55
49.6 / $3.00/M
17
Claude Sonnet 4.5
Anthropic
14.80
44.4 / $3.00/M
18
Claude Opus 4.6
Anthropic
6.46
96.9 / $15.00/M
19
Claude Opus 4.5
Anthropic
4.88
73.2 / $15.00/M
20
Grok 4
xAI
4.45
22.3 / $5.00/M
21
Claude Opus 4
Anthropic
2.51
37.6 / $15.00/M
22
GPT-5 nano
OpenAI
0.00
0.0 / $0.05/M

Open full leaderboard

AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.