Scenario guide
Best AI models for Customer Support Bot
A high-volume B2C chatbot. We weight Arena (human preference) and IFEval (does it follow your formatting instructions?) heavily, and lean on cost because volume is enormous. Reliability axes (format-adherence, safety-handling) are weighted modestly because a chatbot that answers off-format or false-refuses on benign requests is unusable regardless of how smart it is.
Rankings use the same scenario weights and cost blending as the interactive leaderboard on AI Model Analyzer. Data is min-max normalised per benchmark; missing scores are skipped without penalty.
- 1Qwen3 235B (Thinking)Alibaba (Qwen)Score 82.3Q 81.4In $0.20/M
- 2Gemini 3 FlashGoogleScore 82.0Q 92.3In $0.30/M
- 3DeepSeek V3 (Thinking)DeepSeekScore 81.2Q 84.8In $0.27/M
- 4GLM-4.6Zhipu AI (GLM)Score 80.3Q 90.1In $0.50/M
- 5DeepSeek V3DeepSeekScore 80.2Q 83.2In $0.27/M
- 6Gemini 2.0 FlashGoogleScore 79.4Q 69.5In $0.10/M
- 7GLM-4.7Zhipu AI (GLM)Score 78.9Q 87.8In $0.50/M
- 8DeepSeek R1DeepSeekScore 76.5Q 85.6In $0.55/M
- 9Gemini 2.5 FlashGoogleScore 76.2Q 82.6In $0.30/M
- 10Qwen3 235BAlibaba (Qwen)Score 75.4Q 69.9In $0.20/M
- 11Gemini 2.5 ProGoogleScore 74.7Q 97.3In $1.25/M
- 12Kimi K2Moonshot (Kimi)Score 74.7Q 84.0In $0.60/M
- 13GPT-5.5OpenAIScore 74.6Q 99.4In $1.50/M
- 14Gemini 3 ProGoogleScore 73.5Q 95.3In $1.25/M
- 15GPT-5 nanoOpenAIScore 70.7Q 51.2In $0.05/M