AMA
← Leaderboard

Claude Sonnet 4

Closed
Anthropic
Proprietary
text
vision
Claude 4Released 1y ago
Avg score
56.7
/ 100
Context
200k
Output limit
64k
Input price
$3.00 /M
Output price
$15.00 /M

Pricing verified 1y ago

Benchmarks

preference

Chatbot Arena EloFresh
Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

math

AIME 2024High risk
%

American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.

FrontierMath Tiers 1-3Fresh
%

Mathematical research problems spanning analysis, algebra, combinatorics and number theory. Tiers 1-3 are progressively harder; even frontier reasoning models only solve a small fraction. The hardest publicly reported benchmark for general mathematical reasoning.

OTIS Mock AIME 2024-2025Fresh
%

AIME-style competition problems written specifically for the OTIS mock contest, then run as an evaluation by Epoch AI. Closer in spirit to the public AIME but with novel problems unlikely to appear in training data.

coding

HumanEvalSaturated
% pass@1

164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.

LiveCodeBenchFresh
% pass@1

Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.

Aider PolyglotFresh
%

Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.

agentic

SWE-bench VerifiedSome risk
% resolved

Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.

vision

MMMUSome risk
%

Massive Multi-discipline Multimodal Understanding; college-exam level questions with images across 30+ subjects.

MathVistaSome risk
%

Math reasoning over visual contexts (charts, figures, geometry).

long context

RULER 128kFresh
%

Long-context retrieval and reasoning suite. We report the 128k token effective-context score.

performance

Output SpeedN/A
tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First TokenN/A
ms

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

general

Rolling Contamination-Controlled AverageFresh
/100

Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.

data analysis

Rolling Data AnalysisFresh
/100

Rolling contamination-controlled data-analysis evaluation. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.

reasoning

Humanity's Last ExamFresh
%

A challenging multi-disciplinary exam aggregating expert-written questions from across academic fields. Designed to discriminate at the very top of the capability range when MMLU-style tests saturate.

ARC-AGI 2Fresh
%

Second-generation ARC challenge testing fluid reasoning over abstract visual puzzles. Resists training-data memorisation by construction: each puzzle is novel and solutions require multi-step pattern induction. Frontier models are only just starting to score above chance on the harder tier.

composite

Frontier CompositeFresh
ECI

Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.

reliability

Output StabilityN/A
/100

How consistent the model's outputs are across repeated runs of the same task. Higher means lower variance, fewer occasional hallucinations under identical inputs. Useful for production loops that need reproducible behaviour.

Format AdherenceN/A
/100

How reliably the model produces output in the requested format (JSON schemas, markdown structures, exact-string responses). Pairs well with IFEval but reflects how the deployed API is behaving day to day rather than how a frozen test set scores.

Recovery RateN/A
/100

How often the model self-corrects after producing an incorrect intermediate step (debugging axis upstream). Critical for agentic loops that depend on the model noticing and repairing its own mistakes rather than barrelling forward.

Safety HandlingN/A
/100

How well the model handles safety-sensitive prompts without false-refusing benign requests or producing unsafe output. The upstream signal does not separate refusal counts from substantive content-safety behaviour, so this single axis covers both.

Reliability monitor

Loading drift signal…

Hosted endpoints

HostInput $/MOutput $/MContextQuant
Host G$3.00$15.001.0Munknown
Host D$3.00$15.00200kunknown
Host H$3.00$15.001.0Munknown
Host D$3.00$15.00200kunknown
Host I$3.00$15.001.0Munknown
Host E$3.00$15.001.0Munknown
Anonymised third-party hosts. Sorted by lowest output price.

Effort variants

Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.

Compare with...