FEATUREDAI HOT (Curated Pool)· aihot-apiZH22:01 · 06·05
→Arena launches Agent Arena, a real-world AI agent leaderboard
Arena launched Agent Arena, a real-world agent leaderboard using 300,000+ tasks, 2 million+ tool calls, and 40 million lines of code to evaluate performance across coding, app building, and document analysis, with GPT-5.5 High, Claude Opus 4.7 Thinking, and GPT-5.4 High ranked first to third.
#Agent#Code#Tools#Arena
why featured
HKR-H/K/R all pass: the leaderboard names model winners and gives scale. The post is summary-level, with no task mix, scoring rule, or reproducibility link, so it stays below P1.
editor take
Agent Arena brings 300K real tasks; GPT-5.5 High leads by 10.7%. I buy the direction, not the referee role yet.
sharp
Agent Arena is pushing agent evals into live workflows, but it still needs auditability before it can be treated as a scoreboard. The scale is strong: 300K+ tasks, 2M+ tool calls, and 40M lines of code. The signals are better than pass/fail too: task success, correction following, error recovery, user praise and complaints, and tool hallucination. The weak spot is hidden weighting and hidden task mix. GPT-5.5 High at +10.7%, Claude Opus 4.7 Thinking at +9.5%, and GPT-5.4 High at +8.9% look precise. That precision gets shaky if the workload is skewed toward code repair, app scaffolding, or document analysis. LMSYS Arena worked because preference sampling became legible. Agent Arena needs the same transparency for task buckets and scoring weights.
HKR breakdown
hook ✓knowledge ✓resonance ✓