AMA
← Leaderboard

Llama 3.3 70B Instruct

Open source
Meta
Open (restricted)
text
Llama 3.3Released 1y ago
Avg score
48.4
/ 100
Context
128k
Output limit
4k
Input price
$0.88 /M
Output price
$0.88 /M

Pricing verified 1y ago

Benchmarks

preference

Chatbot Arena EloFresh
Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

knowledge

MMLU ProHigh risk
%

Harder version of MMLU testing knowledge across 57 academic subjects; reduces guessing-friendly answers.

reasoning

GPQA DiamondSome risk
%

Graduate-level Google-proof Q&A in physics, chemistry, and biology. Diamond subset is the hardest tier with PhD-validated answers.

math

MATH-500Saturated
%

500 high-school competition math problems requiring multi-step solutions. Scored on final-answer correctness.

AIME 2024High risk
%

American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.

OTIS Mock AIME 2024-2025Fresh
%

AIME-style competition problems written specifically for the OTIS mock contest, then run as an evaluation by Epoch AI. Closer in spirit to the public AIME but with novel problems unlikely to appear in training data.

coding

HumanEvalSaturated
% pass@1

164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.

instruction following

IFEvalSome risk
%

Verifiable instruction-following benchmark; 25 categories of strict formatting / structural directives.

long context

RULER 128kFresh
%

Long-context retrieval and reasoning suite. We report the 128k token effective-context score.

performance

Output SpeedN/A
tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First TokenN/A
ms

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

composite

Frontier CompositeFresh
ECI

Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.

Reliability monitor

Loading drift signal…

Hosted endpoints

HostInput $/MOutput $/MContextQuant
Host N$0.10$0.32131kfp8
Host Q$0.12$0.38131kfp8
Host R$0.13$0.40131kfp8
Host S$0.13$0.40131kfp8
Host T$0.14$0.40131kbf16
Host U$0.22$0.50131kint8
Host Y$0.60$0.60131kunknown
Host 27$0.71$0.71128kfp16
Host E$0.72$0.72128kunknown
Host E$0.72$0.72128kunknown
Host X$0.59$0.79131kunknown
Host 28$0.88$0.88131kfp8
Host W$0.45$0.9016kbf16
Host Z$0.60$1.20131kbf16
Host V$0.29$2.2524kfp8
Anonymised third-party hosts. Sorted by lowest output price.

Compare with...