模型总数22
任务数1277
期间June 2026
#模型得分
1=
Gemini 2.5 ProGoogle
reasoningvisionlong-context
87.73/4
2=
Grok 4.3xAI
reasoningagenticlong-context
86.03/4
3=
DeepSeek-V4-ProDeepSeek
flagshipreasoningcoding
84.03/4
4=
DeepSeek-V4-FlashDeepSeek
fastcheapcoding
83.51/4
5=
Claude Opus 4.8Anthropic
flagshipreasoningvision
81.43/4
6=
GPT-5.4OpenAI
workhorsecoding
77.73/4
7=
Claude Sonnet 4.6Anthropic
workhorsecodingvision
75.64/4
8=
GPT-5.5OpenAI
flagshipreasoning
74.74/4
9=
Qwen3.7-MaxQwen
flagshipreasoningcoding
74.32/4
10=
Gemini 2.5 FlashGoogle
workhorsevisionlong-context
72.13/4
11=
Mistral Large 3Mistral AI
flagshipvisioncoding
62.21/4
12=
GPT-5.4 miniOpenAI
fastcheap
61.34/4
13=
Kimi K2.6Moonshot AI
flagshipreasoningvision
59.11/4
14=
Phi-4 MiniMicrosoft
fastcheapopen-source
59.11/4
15↓1
Command R+ 08-2024Cohere
flagshipreasoningcoding
56.71/4
16↓1
Claude Haiku 4.5Anthropic
fastcheap
56.43/4
17=
Phi-4Microsoft
fastcheapcoding
45.11/4
18↓2
Qwen3.6-PlusQwen
codingreasoning
32.12/4
19↓2
CodestralMistral AI
codingcode-completion
25.02/4
20=
Llama 4 MaverickMeta
workhorsevisionopen-source
15.61/4
21=
Yi-Lightning01.AI
fastcheapcoding
12.91/4
22↓4
Command R 03-2024Cohere
codingmultilingualrag
12.01/4
S ≥ 90A ≥ 80B ≥ 70C < 70

Scores aggregated from public benchmarks · updated weekly

评分方式
35%Aider Polyglot

225 real coding tasks across 6 languages

35%SWE-bench Verified

500 real GitHub issues, % resolved

20%LiveCodeBench

Ongoing competitive programming, Pass@1

10%EvalPlus (HumanEval+)

Stricter HumanEval test suite