AMA
← Leaderboard

Llama 4 Maverick

Open source
Meta
Open (restricted)
text
vision
Llama 4Released 1y ago
Avg score
47.9
/ 100
Context
1.0M
Output limit
8k
Input price
$0.27 /M
Output price
$0.85 /M

Pricing verified 1y ago

Benchmarks

preference

Chatbot Arena EloFresh
Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

math

AIME 2024High risk
%

American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.

coding

HumanEvalSaturated
% pass@1

164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.

Aider PolyglotFresh
%

Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.

vision

MMMUSome risk
%

Massive Multi-discipline Multimodal Understanding; college-exam level questions with images across 30+ subjects.

MathVistaSome risk
%

Math reasoning over visual contexts (charts, figures, geometry).

long context

RULER 128kFresh
%

Long-context retrieval and reasoning suite. We report the 128k token effective-context score.

performance

Output SpeedN/A
tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First TokenN/A
ms

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

reasoning

Humanity's Last ExamFresh
%

A challenging multi-disciplinary exam aggregating expert-written questions from across academic fields. Designed to discriminate at the very top of the capability range when MMLU-style tests saturate.

ARC-AGI 2Fresh
%

Second-generation ARC challenge testing fluid reasoning over abstract visual puzzles. Resists training-data memorisation by construction: each puzzle is novel and solutions require multi-step pattern induction. Frontier models are only just starting to score above chance on the harder tier.

composite

Frontier CompositeFresh
ECI

Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.

Reliability monitor

Loading drift signal…

Hosted endpoints

HostInput $/MOutput $/MContextQuant
Host O$0.15$0.601.0Mfp8
Host 30$0.27$0.851.0Mfp8
Host 31$0.35$1.00524kfp8
Host 32$0.63$1.80131kunknown
Anonymised third-party hosts. Sorted by lowest output price.

Compare with...