Value ranking

Best value on OTIS Mock AIME 2024-2025

AIME-style competition problems written specifically for the OTIS mock contest, then run as an evaluation by Epoch AI. Closer in spirit to the public AIME but with novel problems unlikely to appear in training data.

“Value” is normalized benchmark score (0–100 for this leaderboard cohort) divided by input price per million tokens. Higher means more capability per dollar on this axis only — always sanity-check latency, context length, and your real workload.

1
GPT-5 nano
OpenAI
1608.00
80.4 / $0.05/M
2
Qwen3 235B (Thinking)
Alibaba (Qwen)
430.85
86.2 / $0.20/M
3
GPT-5 mini
OpenAI
344.68
86.2 / $0.25/M
4
Gemini 3 Flash
Google
308.37
92.5 / $0.30/M
5
Gemini 2.0 Flash
Google
285.30
28.5 / $0.10/M
6
Gemini 1.5 Flash
Google
174.80
13.1 / $0.08/M
7
GLM-4.7
Zhipu AI (GLM)
165.42
82.7 / $0.50/M
8
DeepSeek R1
DeepSeek
158.84
87.4 / $0.55/M
9
Kimi K2
Moonshot (Kimi)
153.18
91.9 / $0.60/M
10
DeepSeek V3
DeepSeek
131.30
35.5 / $0.27/M
11
GPT-5.2
OpenAI
76.78
96.0 / $1.25/M
12
Gemini 3 Pro
Google
76.35
95.4 / $1.25/M
13
o4-mini
OpenAI
73.62
81.0 / $1.10/M
14
GPT-5
OpenAI
72.86
91.1 / $1.25/M
15
GPT-5.1
OpenAI
70.54
88.2 / $1.25/M
16
o3-mini
OpenAI
69.16
76.1 / $1.10/M
17
Gemini 2.5 Pro
Google
66.86
83.6 / $1.25/M
18
GPT-5.5
OpenAI
66.67
100.0 / $1.50/M
19
Claude Haiku 4.5
Anthropic
65.42
65.4 / $1.00/M
20
GPT-5.4
OpenAI
63.41
95.1 / $1.50/M
21
Claude Sonnet 4.6
Anthropic
28.42
85.3 / $3.00/M
22
Claude Sonnet 4.5
Anthropic
25.65
77.0 / $3.00/M
23
Llama 4 Scout
Meta
24.00
4.3 / $0.18/M
24
GPT-4o mini
OpenAI
23.07
3.5 / $0.15/M
25
Claude 3.7 Sonnet
Anthropic
18.73
56.2 / $3.00/M

Open full leaderboard

AI Model Analyzer does not recommend specific vendors; rankings are derived from public data only.