Mixtral 8x22B
Pricing verified 1y ago · Median of hosted endpoints
Benchmarks
preference
Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.
knowledge
Harder version of MMLU testing knowledge across 57 academic subjects; reduces guessing-friendly answers.
reasoning
Graduate-level Google-proof Q&A in physics, chemistry, and biology. Diamond subset is the hardest tier with PhD-validated answers.
math
500 high-school competition math problems requiring multi-step solutions. Scored on final-answer correctness.
American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.
coding
164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.
instruction following
Verifiable instruction-following benchmark; 25 categories of strict formatting / structural directives.
long context
Long-context retrieval and reasoning suite. We report the 128k token effective-context score.
composite
Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.
Reliability monitor
Loading drift signal…
Hosted endpoints
| Host | Input $/M | Output $/M | Context | Quant |
|---|---|---|---|---|
| Host 33 | $2.00 | $6.00 | 66k | unknown |