About
AI Model Analyzer is a free, ad-free tool for comparing AI models. There is no backend, no database, no tracking. The whole site is a static bundle that recomputes every ranking in your browser.
Methodology
For each benchmark, raw scores are min-max normalised to a 0–100 scale across the participating models. This makes scores from very different benchmarks (Elo points, % pass rate, % accuracy) directly comparable.
Scenario scores are a weighted average of the normalised benchmark scores. If a model is missing data for some benchmarks, the weights are renormalised over what's available — we never penalise a model just because a score hasn't been reported yet.
Cost is folded in via a separate cost-vs-quality slider. Cost is converted to a 0–100 score on a log scale (because pricing spans four orders of magnitude) and combined with the quality score to produce a composite ranking.
Benchmarks are flagged with a contamination-risk indicator — low means a live, continuously refreshed eval, high or saturated means the question set is fixed and well known.
Why composite benchmarks?
Individual benchmarks saturate. MATH and HumanEval used to spread the field; today every frontier model clears 90%, so the score stops discriminating. Composite indices fix this by stitching many benchmarks together and weighting harder ones more.
We surface one such composite as a single benchmark row, Frontier Composite. It is computed upstream using Item Response Theory — each benchmark gets a fitted difficulty and discrimination, and a model's capability is the value that best explains its observed pass rates across the whole battery. Doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score; doing well on a saturated benchmark barely does. We import the composite as a number per model and treat it like any other benchmark — you can weight it into scenarios from the wizard.
One caveat: the upstream composite uses an anchored scale (e.g. Claude 3.5 Sonnet at 130, GPT-5 at 150). Our pipeline min-max normalises every benchmark across the models we track, so that anchoring is squashed; raw values are still visible in tooltips.
What is reliability monitoring?
Quality benchmarks tell you how smart a model is on a frozen test set. Reliability metrics tell you how the deployed API is behaving this week — whether it's refusing more, drifting downward, or occasionally producing malformed output.
We surface four reliability axes as benchmarks: output stability (variance across re-runs), recovery rate (does it self-correct after a wrong step), format adherence (does it obey output formats), and safety handling (refuses unsafe prompts and is robust to jailbreaks). Coverage is partial — only the models the upstream monitor tracks — and the scenario engine re-weights gracefully for missing data.
A simple drift signal is computed at ingest time by z-scoring the most recent 7-day mean for each metric against the prior 21-day mean. Models that have visibly degraded on any axis are tagged in the leaderboard. AI Stupid Meter (the upstream monitor we ingest from) uses more sophisticated change-point detection (CUSUM, Mann-Whitney U); ours is the cheaper, more conservative cousin.
Data pipeline
Data is refreshed nightly by an automated pipeline. Each ingester has a fallback dataset baked in so the site keeps rendering when an upstream is briefly unavailable. Last refreshed 2d ago.
Tracked benchmarks
Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.
Harder version of MMLU testing knowledge across 57 academic subjects; reduces guessing-friendly answers.
Graduate-level Google-proof Q&A in physics, chemistry, and biology. Diamond subset is the hardest tier with PhD-validated answers.
500 high-school competition math problems requiring multi-step solutions. Scored on final-answer correctness.
American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.
164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.
Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.
Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.
Verifiable instruction-following benchmark; 25 categories of strict formatting / structural directives.
Massive Multi-discipline Multimodal Understanding; college-exam level questions with images across 30+ subjects.
Math reasoning over visual contexts (charts, figures, geometry).
Long-context retrieval and reasoning suite. We report the 128k token effective-context score.
Crowdsourced pairwise human preference for image generation models. Users vote on anonymised side-by-side generations; scores are a standard Elo over those votes.
How well the generated image matches the textual prompt as evaluated by human raters.
Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.
Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.
Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.
Rolling contamination-controlled data-analysis evaluation. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.
Mathematical research problems spanning analysis, algebra, combinatorics and number theory. Tiers 1-3 are progressively harder; even frontier reasoning models only solve a small fraction. The hardest publicly reported benchmark for general mathematical reasoning.
A human-validated factuality benchmark of short factual questions whose answers can be checked against a single ground truth. Penalises hallucinations by scoring confidently-wrong answers below abstentions.
AIME-style competition problems written specifically for the OTIS mock contest, then run as an evaluation by Epoch AI. Closer in spirit to the public AIME but with novel problems unlikely to appear in training data.
A challenging multi-disciplinary exam aggregating expert-written questions from across academic fields. Designed to discriminate at the very top of the capability range when MMLU-style tests saturate.
Second-generation ARC challenge testing fluid reasoning over abstract visual puzzles. Resists training-data memorisation by construction: each puzzle is novel and solutions require multi-step pattern induction. Frontier models are only just starting to score above chance on the harder tier.
Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.
Long-horizon shell-and-filesystem tasks executed in a sandboxed terminal, scored by whether the agent's final state matches a target state. Tests practical tool-using ability for everyday devops and data-wrangling work; one of the hardest agentic benchmarks today.
Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.
How consistent the model's outputs are across repeated runs of the same task. Higher means lower variance, fewer occasional hallucinations under identical inputs. Useful for production loops that need reproducible behaviour.
How reliably the model produces output in the requested format (JSON schemas, markdown structures, exact-string responses). Pairs well with IFEval but reflects how the deployed API is behaving day to day rather than how a frozen test set scores.
How often the model self-corrects after producing an incorrect intermediate step (debugging axis upstream). Critical for agentic loops that depend on the model noticing and repairing its own mistakes rather than barrelling forward.
How well the model handles safety-sensitive prompts without false-refusing benign requests or producing unsafe output. The upstream signal does not separate refusal counts from substantive content-safety behaviour, so this single axis covers both.
By the numbers
Data sources & licenses
We aggregate publicly available benchmark data from the projects below. Per-row attribution is intentionally omitted from the leaderboard so the site stays neutral, but the contributing projects are credited here. If you maintain one of these projects and would like a different attribution, please open an issue on our repository.
Crowdsourced human-preference Elo, pulled from the open Hugging Face dataset. Used with attribution under CC-BY 4.0.
Open-weights leaderboard. Source for IFEval, MMLU-Pro, and BBH on community models.
Continuously refreshed coding benchmark. Source of our coding pass-rate scores.
Contamination-controlled rolling benchmark. Source of our rolling-average and data-analysis scores.
Frontier benchmarks (FrontierMath, ARC-AGI 2, Humanity's Last Exam, SimpleQA Verified, OTIS Mock AIME) and the Epoch Capabilities Index used for our Frontier Composite row. Used under Creative Commons Attribution 4.0; full citation and modifications notice below.
Polyglot coding benchmark. Per-model pass rates re-published by Epoch AI; we credit the original Aider project here as the upstream of the questions and grading harness.
Real-world terminal-tool benchmark. Per-model accuracies re-published by Epoch AI; we credit the original Terminal-Bench authors here.
Source of our reliability metrics (output stability, recovery rate, format adherence, safety handling) and the time-series we use for drift detection. Code is MIT-licensed; data is used with attribution.
Output throughput (tok/s), time-to-first-token, and image-arena Elo are hand-maintained from publicly observable sources -- provider documentation, OpenRouter, model cards, and community measurements. PRs welcome.
Citation for Epoch AI:Epoch AI, “AI Benchmarking Hub”. Published online at epoch.ai. Retrieved from https://epoch.ai/benchmarks/use-this-data. Used under the Creative Commons Attribution 4.0 International license.
Citation for LMArena: based on the open lmarena-ai/leaderboard-dataset published on Hugging Face under the CC-BY 4.0 license. We pull the latest snapshot of the “overall” category and re-publish per-model Elo scores with attribution.
Modifications notice: upstream values are transformed before being shown here. Specifically, every benchmark score is min-max normalised to 0–100 across the models we track, multi-source rows are deduplicated to one canonical model id, attribution and source URLs are stripped from the public bundle, and reliability scores are aggregated to daily means for drift detection. Raw values remain visible in tooltips.
License coverage: Epoch AI and LMArena data are used under CC-BY 4.0. Apache-2.0 sources (Aider Polyglot, Terminal-Bench) keep their permissive terms; MIT sources (LiveCodeBench, SWE-bench, AI Stupid Meter source code) are used per their license terms.
Speed and image-arena metrics: output speed (tok/s), time-to-first-token, and image-generation arena scores are hand-maintained from public provider documentation, model cards, OpenRouter, and community measurements. They are not derived from any single proprietary leaderboard; the YAML files backing them are released under CC-BY 4.0 and PRs are welcome.
Disclaimers
- Benchmark scores are summary statistics. They don't predict how a model will do on your task.
- Pricing is best-effort and changes constantly. Always confirm on the provider's page before relying on a number.
- Open-source pricing reflects a median of common hosted endpoints — your self-hosted cost will differ.
- We don't run any of these models ourselves and have no commercial relationship with any provider.