Scenario guide
Best AI models for Customer Support Bot
A high-volume B2C chatbot. We weight Arena (human preference) and IFEval (does it follow your formatting instructions?) heavily, and lean on cost because volume is enormous. Reliability axes (format-adherence, safety-handling) are weighted modestly because a chatbot that answers off-format or false-refuses on benign requests is unusable regardless of how smart it is.
Rankings use the same scenario weights and cost blending as the interactive leaderboard on AI Model Analyzer. Data is min-max normalised per benchmark; missing scores are skipped without penalty.
- 1DeepSeek V3DeepSeekScore 81.9Q 86.1In $0.27/M
- 2Gemini 3 FlashGoogleScore 81.3Q 91.1In $0.30/M
- 3Qwen3 235B (Thinking)Alibaba (Qwen)Score 78.6Q 75.2In $0.20/M
- 4Gemini 2.5 FlashGoogleScore 78.4Q 86.3In $0.30/M
- 5DeepSeek R1DeepSeekScore 77.6Q 87.4In $0.55/M
- 6DeepSeek V3 (Thinking)DeepSeekScore 77.2Q 78.2In $0.27/M
- 7Gemini 2.0 FlashGoogleScore 77.1Q 65.8In $0.10/M
- 8Kimi K2Moonshot (Kimi)Score 75.7Q 85.7In $0.60/M
- 9Gemini 3 ProGoogleScore 75.4Q 98.4In $1.25/M
- 10Qwen3 235BAlibaba (Qwen)Score 73.0Q 65.9In $0.20/M
- 11GPT-5.5OpenAIScore 72.5Q 95.9In $1.50/M
- 12GPT-5.4OpenAIScore 72.1Q 95.2In $1.50/M
- 13Gemini 2.5 ProGoogleScore 71.2Q 91.5In $1.25/M
- 14GPT-5 nanoOpenAIScore 68.3Q 47.2In $0.05/M
- 15Gemini 1.5 FlashGoogleScore 68.3Q 47.6In $0.08/M