About

AI Model Analyzer is a free, ad-free tool for comparing AI models. There is no backend, no database, no tracking. The whole site is a static bundle that recomputes every ranking in your browser.

Methodology

For each benchmark, raw scores are min-max normalised to a 0–100 scale across the participating models. This makes scores from very different benchmarks (Elo points, % pass rate, % accuracy) directly comparable.

Scenario scores are a weighted average of the normalised benchmark scores. If a model is missing data for some benchmarks, the weights are renormalised over what's available — we never penalise a model just because a score hasn't been reported yet.

Cost is folded in via a separate cost-vs-quality slider. Cost is converted to a 0–100 score on a log scale (because pricing spans four orders of magnitude) and combined with the quality score to produce a composite ranking.

Benchmarks are flagged with a contamination-risk indicator — low means a live, continuously refreshed eval, high or saturated means the question set is fixed and well known.

Why composite benchmarks?

Individual benchmarks saturate. MATH and HumanEval used to spread the field; today every frontier model clears 90%, so the score stops discriminating. Composite indices fix this by stitching many benchmarks together and weighting harder ones more.

We surface one such composite as a single benchmark row, Frontier Composite. It is computed upstream using Item Response Theory — each benchmark gets a fitted difficulty and discrimination, and a model's capability is the value that best explains its observed pass rates across the whole battery. Doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score; doing well on a saturated benchmark barely does. We import the composite as a number per model and treat it like any other benchmark — you can weight it into scenarios from the wizard.

One caveat: the upstream composite uses an anchored scale (e.g. Claude 3.5 Sonnet at 130, GPT-5 at 150). Our pipeline min-max normalises every benchmark across the models we track, so that anchoring is squashed; raw values are still visible in tooltips.

What is reliability monitoring?

Quality benchmarks tell you how smart a model is on a frozen test set. Reliability metrics tell you how the deployed API is behaving this week — whether it's refusing more, drifting downward, or occasionally producing malformed output.

We surface four reliability axes as benchmarks: output stability (variance across re-runs), recovery rate (does it self-correct after a wrong step), format adherence (does it obey output formats), and safety handling (refuses unsafe prompts and is robust to jailbreaks). Coverage is partial — only the models the upstream monitor tracks — and the scenario engine re-weights gracefully for missing data.

A simple drift signal is computed at ingest time by z-scoring the most recent 7-day mean for each metric against the prior 21-day mean. Models that have visibly degraded on any axis are tagged in the leaderboard. AI Stupid Meter (the upstream monitor we ingest from) uses more sophisticated change-point detection (CUSUM, Mann-Whitney U); ours is the cheaper, more conservative cousin.

Data pipeline

Data is refreshed nightly by an automated pipeline. Each ingester has a fallback dataset baked in so the site keeps rendering when an upstream is briefly unavailable. Last refreshed 2d ago.

ingester-1

0 rows

ingester-2

267 rows

ingester-3

16 rows

ingester-4

18 rows

ingester-5

stale

18 rows

ingester-6

23 rows

ingester-7

76 rows

ingester-8

70 rows

ingester-9

88 rows

ingester-10

0 rows

ingester-11

36 rows

ingester-12

stale

120 rows

Tracked benchmarks

Chatbot Arena Elo

preference

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

MMLU Pro

knowledge

Harder version of MMLU testing knowledge across 57 academic subjects; reduces guessing-friendly answers.

GPQA Diamond

reasoning

Graduate-level Google-proof Q&A in physics, chemistry, and biology. Diamond subset is the hardest tier with PhD-validated answers.

MATH-500

math

500 high-school competition math problems requiring multi-step solutions. Scored on final-answer correctness.

AIME 2024

math

American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.

HumanEval

coding

164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.

LiveCodeBench

coding

Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.

SWE-bench Verified

agentic

Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.

IFEval

instruction following

Verifiable instruction-following benchmark; 25 categories of strict formatting / structural directives.

MMMU

vision

Massive Multi-discipline Multimodal Understanding; college-exam level questions with images across 30+ subjects.

MathVista

vision

Math reasoning over visual contexts (charts, figures, geometry).

RULER 128k

long context

Long-context retrieval and reasoning suite. We report the 128k token effective-context score.

Image Arena Elo

image gen

Crowdsourced pairwise human preference for image generation models. Users vote on anonymised side-by-side generations; scores are a standard Elo over those votes.

Prompt Adherence

image gen

How well the generated image matches the textual prompt as evaluated by human raters.

Output Speed

performance

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First Token

performance

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

Rolling Contamination-Controlled Average

general

Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.

Rolling Data Analysis

data analysis

Rolling contamination-controlled data-analysis evaluation. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.

FrontierMath Tiers 1-3

math

Mathematical research problems spanning analysis, algebra, combinatorics and number theory. Tiers 1-3 are progressively harder; even frontier reasoning models only solve a small fraction. The hardest publicly reported benchmark for general mathematical reasoning.

SimpleQA Verified

knowledge

A human-validated factuality benchmark of short factual questions whose answers can be checked against a single ground truth. Penalises hallucinations by scoring confidently-wrong answers below abstentions.

OTIS Mock AIME 2024-2025

math

AIME-style competition problems written specifically for the OTIS mock contest, then run as an evaluation by Epoch AI. Closer in spirit to the public AIME but with novel problems unlikely to appear in training data.

Humanity's Last Exam

reasoning

A challenging multi-disciplinary exam aggregating expert-written questions from across academic fields. Designed to discriminate at the very top of the capability range when MMLU-style tests saturate.

ARC-AGI 2

reasoning

Second-generation ARC challenge testing fluid reasoning over abstract visual puzzles. Resists training-data memorisation by construction: each puzzle is novel and solutions require multi-step pattern induction. Frontier models are only just starting to score above chance on the harder tier.

Aider Polyglot

coding

Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.

Terminal-Bench 2

agentic

Long-horizon shell-and-filesystem tasks executed in a sandboxed terminal, scored by whether the agent's final state matches a target state. Tests practical tool-using ability for everyday devops and data-wrangling work; one of the hardest agentic benchmarks today.

Frontier Composite

composite

Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.

Output Stability

reliability

How consistent the model's outputs are across repeated runs of the same task. Higher means lower variance, fewer occasional hallucinations under identical inputs. Useful for production loops that need reproducible behaviour.

Format Adherence

reliability

How reliably the model produces output in the requested format (JSON schemas, markdown structures, exact-string responses). Pairs well with IFEval but reflects how the deployed API is behaving day to day rather than how a frozen test set scores.

Recovery Rate

reliability

How often the model self-corrects after producing an incorrect intermediate step (debugging axis upstream). Critical for agentic loops that depend on the model noticing and repairing its own mistakes rather than barrelling forward.

Safety Handling

reliability

How well the model handles safety-sensitive prompts without false-refusing benign requests or producing unsafe output. The upstream signal does not separate refusal counts from substantive content-safety behaviour, so this single axis covers both.

By the numbers

Models

Benchmarks

Scenarios

Pricing entries

Data sources & licenses

We aggregate publicly available benchmark data from the projects below. Per-row attribution is intentionally omitted from the leaderboard so the site stays neutral, but the contributing projects are credited here. If you maintain one of these projects and would like a different attribution, please open an issue on our repository.

LMArena Leaderboard Dataset CC-BY 4.0

Crowdsourced human-preference Elo, pulled from the open Hugging Face dataset. Used with attribution under CC-BY 4.0.

HuggingFace Open LLM Leaderboard Apache 2.0

Open-weights leaderboard. Source for IFEval, MMLU-Pro, and BBH on community models.

LiveCodeBench MIT

Continuously refreshed coding benchmark. Source of our coding pass-rate scores.

SWE-bench MIT

Real-world GitHub issue resolution benchmark. Source of SWE-bench Verified scores.

LiveBench Apache 2.0

Contamination-controlled rolling benchmark. Source of our rolling-average and data-analysis scores.

Epoch AI Benchmarking Hub CC-BY 4.0

Frontier benchmarks (FrontierMath, ARC-AGI 2, Humanity's Last Exam, SimpleQA Verified, OTIS Mock AIME) and the Epoch Capabilities Index used for our Frontier Composite row. Used under Creative Commons Attribution 4.0; full citation and modifications notice below.

Aider Polyglot Coding Benchmark Apache 2.0

Polyglot coding benchmark. Per-model pass rates re-published by Epoch AI; we credit the original Aider project here as the upstream of the questions and grading harness.

Terminal-Bench Apache 2.0

Real-world terminal-tool benchmark. Per-model accuracies re-published by Epoch AI; we credit the original Terminal-Bench authors here.

AI Stupid Meter MIT (code)

Source of our reliability metrics (output stability, recovery rate, format adherence, safety handling) and the time-series we use for drift detection. Code is MIT-licensed; data is used with attribution.

Hand-curated speed & image-arena seeds CC-BY 4.0

Output throughput (tok/s), time-to-first-token, and image-arena Elo are hand-maintained from publicly observable sources -- provider documentation, OpenRouter, model cards, and community measurements. PRs welcome.

Citation for Epoch AI:Epoch AI, “AI Benchmarking Hub”. Published online at epoch.ai. Retrieved from https://epoch.ai/benchmarks/use-this-data. Used under the Creative Commons Attribution 4.0 International license.

Citation for LMArena: based on the open lmarena-ai/leaderboard-dataset published on Hugging Face under the CC-BY 4.0 license. We pull the latest snapshot of the “overall” category and re-publish per-model Elo scores with attribution.

Modifications notice: upstream values are transformed before being shown here. Specifically, every benchmark score is min-max normalised to 0–100 across the models we track, multi-source rows are deduplicated to one canonical model id, attribution and source URLs are stripped from the public bundle, and reliability scores are aggregated to daily means for drift detection. Raw values remain visible in tooltips.

License coverage: Epoch AI and LMArena data are used under CC-BY 4.0. Apache-2.0 sources (Aider Polyglot, Terminal-Bench) keep their permissive terms; MIT sources (LiveCodeBench, SWE-bench, AI Stupid Meter source code) are used per their license terms.

Speed and image-arena metrics: output speed (tok/s), time-to-first-token, and image-generation arena scores are hand-maintained from public provider documentation, model cards, OpenRouter, and community measurements. They are not derived from any single proprietary leaderboard; the YAML files backing them are released under CC-BY 4.0 and PRs are welcome.

Disclaimers

Benchmark scores are summary statistics. They don't predict how a model will do on your task.
Pricing is best-effort and changes constantly. Always confirm on the provider's page before relying on a number.
Open-source pricing reflects a median of common hosted endpoints — your self-hosted cost will differ.
We don't run any of these models ourselves and have no commercial relationship with any provider.