AMA
← Leaderboard

Qwen3 235B

Open source
Alibaba (Qwen)
Open license
text
Qwen 3Released 1y ago
Avg score
58.3
/ 100
Context
131k
Output limit
16k
Input price
$0.20 /M
Output price
$0.60 /M

Pricing verified 1y ago

Benchmarks

preference

Chatbot Arena EloFresh
Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

math

AIME 2024High risk
%

American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.

coding

HumanEvalSaturated
% pass@1

164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.

LiveCodeBenchFresh
% pass@1

Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.

Aider PolyglotFresh
%

Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.

long context

RULER 128kFresh
%

Long-context retrieval and reasoning suite. We report the 128k token effective-context score.

performance

Output SpeedN/A
tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First TokenN/A
ms

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

general

Rolling Contamination-Controlled AverageFresh
/100

Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.

data analysis

Rolling Data AnalysisFresh
/100

Rolling contamination-controlled data-analysis evaluation. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.

composite

Frontier CompositeFresh
ECI

Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.

Reliability monitor

Loading drift signal…

Hosted endpoints

HostInput $/MOutput $/MContextQuant
Host 29$0.07$0.10262kfp8
Host P$0.10$0.10262kbf16
Host 30$0.09$0.58131kfp8
Host 41$0.15$0.60131kunknown
Host 39$0.09$0.60262kfp8
Host 31$0.10$0.60131kfp8
Host 42$0.20$0.60262kunknown
Host Y$0.20$0.80262kunknown
Host 37$0.20$0.88131kfp8
Host E$0.22$0.88262kunknown
Host E$0.25$1.00262kunknown
Host 43$0.60$1.20131kfp16
Anonymised third-party hosts. Sorted by lowest output price.

Effort variants

Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.

Compare with...