o1
Pricing verified 1y ago
Benchmarks
preference
Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.
math
American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.
coding
164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.
vision
Massive Multi-discipline Multimodal Understanding; college-exam level questions with images across 30+ subjects.
Math reasoning over visual contexts (charts, figures, geometry).
long context
Long-context retrieval and reasoning suite. We report the 128k token effective-context score.
performance
Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.
Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.
composite
Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.
Reliability monitor
Loading drift signal…
Hosted endpoints
| Host | Input $/M | Output $/M | Context | Quant |
|---|---|---|---|---|
| Host B | $15.00 | $60.00 | 200k | unknown |