AI Coder Bench — 排行榜

模型总数22

任务数1277

期间June 2026

#	模型	得分	Aider ↗	SWE-bench ↗	LiveCode ↗	EvalPlus ↗
1=	Gemini 2.5 ProGoogle reasoningvisionlong-context	87.73/4	83.1	—	100.0	79.3
2=	Grok 4.3xAI reasoningagenticlong-context	86.03/4	79.6	—	100.0	80.5
3=	DeepSeek-V4-ProDeepSeek flagshipreasoningcoding	84.03/4	74.2	—	100.0	86.6
4=	DeepSeek-V4-FlashDeepSeek fastcheapcoding	83.51/4	—	—	—	83.5
5=	Claude Opus 4.8Anthropic flagshipreasoningvision	81.43/4	72.0	—	100.0	77.4
6=	GPT-5.4OpenAI workhorsecoding	77.73/4	61.7	—	100.0	89.0
7=	Claude Sonnet 4.6Anthropic workhorsecodingvision	75.64/4	64.9	70.6	100.0	81.7
8=	GPT-5.5OpenAI flagshipreasoning	74.74/4	88.0	65.0	100.0	11.0
9=	Qwen3.7-MaxQwen flagshipreasoningcoding	74.32/4	59.6	—	100.0	—
10=	Gemini 2.5 FlashGoogle workhorsevisionlong-context	72.13/4	55.1	—	100.0	75.6
11=	Mistral Large 3Mistral AI flagshipvisioncoding	62.21/4	—	—	—	62.2
12=	GPT-5.4 miniOpenAI fastcheap	61.34/4	32.9	59.8	100.0	89.0
13=	Kimi K2.6Moonshot AI flagshipreasoningvision	59.11/4	59.1	—	—	—
14=	Phi-4 MiniMicrosoft fastcheapopen-source	59.11/4	—	—	—	59.1
15↓1	Command R+ 08-2024Cohere flagshipreasoningcoding	56.71/4	—	—	—	56.7
16↓1	Claude Haiku 4.5Anthropic fastcheap	56.43/4	28.0	—	100.0	68.9
17=	Phi-4Microsoft fastcheapcoding	45.11/4	—	—	—	45.1
18↓2	Qwen3.6-PlusQwen codingreasoning	32.12/4	16.4	—	—	87.2
19↓2	CodestralMistral AI codingcode-completion	25.02/4	11.1	—	—	73.8
20=	Llama 4 MaverickMeta workhorsevisionopen-source	15.61/4	15.6	—	—	—
21=	Yi-Lightning01.AI fastcheapcoding	12.91/4	12.9	—	—	—
22↓4	Command R 03-2024Cohere codingmultilingualrag	12.01/4	12.0	—	—	—

S ≥ 90A ≥ 80B ≥ 70C < 70

Scores aggregated from public benchmarks · updated weekly

评分方式

35%Aider Polyglot

225 real coding tasks across 6 languages

35%SWE-bench Verified

500 real GitHub issues, % resolved

20%LiveCodeBench

Ongoing competitive programming, Pass@1

10%EvalPlus (HumanEval+)

Stricter HumanEval test suite