AMA
← Leaderboard

Gemini 2.5 Flash

Closed
Google
Proprietary
text
vision
Gemini 2.5Released 1y ago
Avg score
54.6
/ 100
Context
1.0M
Output limit
66k
Input price
$0.30 /M
Output price
$2.50 /M

Pricing verified 2mo ago

Benchmarks

preference

Chatbot Arena EloFresh
Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

coding

LiveCodeBenchFresh
% pass@1

Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.

agentic

SWE-bench VerifiedSome risk
% resolved

Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.

Terminal-Bench 2Fresh
%

Long-horizon shell-and-filesystem tasks executed in a sandboxed terminal, scored by whether the agent's final state matches a target state. Tests practical tool-using ability for everyday devops and data-wrangling work; one of the hardest agentic benchmarks today.

performance

Output SpeedN/A
tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First TokenN/A
ms

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

math

FrontierMath Tiers 1-3Fresh
%

Mathematical research problems spanning analysis, algebra, combinatorics and number theory. Tiers 1-3 are progressively harder; even frontier reasoning models only solve a small fraction. The hardest publicly reported benchmark for general mathematical reasoning.

composite

Frontier CompositeFresh
ECI

Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.

Reliability monitor

Loading drift signal…

Hosted endpoints

HostInput $/MOutput $/MContextQuant
Host T$0.30$2.501.0Munknown
Host H$0.30$2.501.0Munknown
Host V$0.30$2.501.0Munknown
Host K$0.30$2.501.0Munknown
Anonymised third-party hosts. Sorted by lowest output price.

Compare with...