Qwen3 235B
Pricing verified 1y ago
Benchmarks
preference
Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.
math
American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.
coding
164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.
Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.
Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.
long context
Long-context retrieval and reasoning suite. We report the 128k token effective-context score.
performance
Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.
Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.
general
Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.
data analysis
Rolling contamination-controlled data-analysis evaluation. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.
composite
Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.
Reliability monitor
Loading drift signal…
Hosted endpoints
| Host | Input $/M | Output $/M | Context | Quant |
|---|---|---|---|---|
| Host 29 | $0.07 | $0.10 | 262k | fp8 |
| Host P | $0.10 | $0.10 | 262k | bf16 |
| Host 30 | $0.09 | $0.58 | 131k | fp8 |
| Host 41 | $0.15 | $0.60 | 131k | unknown |
| Host 39 | $0.09 | $0.60 | 262k | fp8 |
| Host 31 | $0.10 | $0.60 | 131k | fp8 |
| Host 42 | $0.20 | $0.60 | 262k | unknown |
| Host Y | $0.20 | $0.80 | 262k | unknown |
| Host 37 | $0.20 | $0.88 | 131k | fp8 |
| Host E | $0.22 | $0.88 | 262k | unknown |
| Host E | $0.25 | $1.00 | 262k | unknown |
| Host 43 | $0.60 | $1.20 | 131k | fp16 |
Effort variants
Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.