Gemini 2.5 Flash

Name: Gemini 2.5 Flash
Brand: Google
Price: 0.3 USD
Rating: 54.6 (8 reviews)

Closed

Google

Proprietary

text

vision

Gemini 2.5Released 1y ago

Avg score

54.6

/ 100

Context

1.0M

Output limit

66k

Input price

$0.30 /M

Output price

$2.50 /M

Pricing verified 2mo ago

Benchmarks

preference

Chatbot Arena EloFresh

Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

coding

LiveCodeBenchFresh

% pass@1

Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.

agentic

SWE-bench VerifiedSome risk

% resolved

Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.

Terminal-Bench 2Fresh

Long-horizon shell-and-filesystem tasks executed in a sandboxed terminal, scored by whether the agent's final state matches a target state. Tests practical tool-using ability for everyday devops and data-wrangling work; one of the hardest agentic benchmarks today.

performance

Output SpeedN/A

tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First TokenN/A

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

math

FrontierMath Tiers 1-3Fresh

Mathematical research problems spanning analysis, algebra, combinatorics and number theory. Tiers 1-3 are progressively harder; even frontier reasoning models only solve a small fraction. The hardest publicly reported benchmark for general mathematical reasoning.

composite

Frontier CompositeFresh

ECI

Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.

Reliability monitor

Loading drift signal…

Hosted endpoints

Host	Input $/M	Output $/M	Context	Quant
Host T	$0.30	$2.50	1.0M	unknown
Host H	$0.30	$2.50	1.0M	unknown
Host V	$0.30	$2.50	1.0M	unknown
Host K	$0.30	$2.50	1.0M	unknown

Anonymised third-party hosts. Sorted by lowest output price.

Compare with...

vs GPT-4o vs GPT-4o mini vs o1 vs o1-mini vs o3 vs o4-mini vs o3-mini vs GPT-4 Turbo vs GPT-4.1 vs GPT-5