2026 年 6 月
AI Coder Bench
真实编程任务基准排行。50 项任务,覆盖 Bug 修复、功能开发、重构、系统设计和 Debug & 解释。
| # | 模型 | 得分 |
|---|---|---|
1= | Gemini 2.5 ProGoogle | 87.73/4 |
2= | Grok 4.3xAI | 86.03/4 |
3= | DeepSeek-V4-ProDeepSeek | 84.03/4 |
4= | DeepSeek-V4-FlashDeepSeek | 83.51/4 |
5= | Claude Opus 4.8Anthropic | 81.43/4 |
6= | GPT-5.4OpenAI | 77.73/4 |
7= | Claude Sonnet 4.6Anthropic | 75.64/4 |
8= | GPT-5.5OpenAI | 74.74/4 |
9= | Qwen3.7-MaxQwen | 74.32/4 |
10= | Gemini 2.5 FlashGoogle | 72.13/4 |
11= | Mistral Large 3Mistral AI | 62.21/4 |
12= | GPT-5.4 miniOpenAI | 61.34/4 |
13= | Kimi K2.6Moonshot AI | 59.11/4 |
14= | Phi-4 MiniMicrosoft | 59.11/4 |
15↓1 | Command R+ 08-2024Cohere | 56.71/4 |
16↓1 | Claude Haiku 4.5Anthropic | 56.43/4 |
17= | Phi-4Microsoft | 45.11/4 |
18↓2 | Qwen3.6-PlusQwen | 32.12/4 |
19↓2 | CodestralMistral AI | 25.02/4 |
20= | Llama 4 MaverickMeta | 15.61/4 |
21= | Yi-Lightning01.AI | 12.91/4 |
22↓4 | Command R 03-2024Cohere | 12.01/4 |
S ≥ 90A ≥ 80B ≥ 70C < 70
Scores aggregated from public benchmarks · updated weekly
评分方式
35%Aider Polyglot
225 real coding tasks across 6 languages
35%SWE-bench Verified
500 real GitHub issues, % resolved
20%LiveCodeBench
Ongoing competitive programming, Pass@1
10%EvalPlus (HumanEval+)
Stricter HumanEval test suite