FEATUREDAI HOT (Curated Pool)· aihot-apiZH17:00 · 05·16
→Latest Open Artifacts #21: Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1, and More
Open AI model teams released Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1, and other versions this month, and the post says they were tested under CAISI’s V4 evaluation framework, but the RSS snippet does not disclose scores.
#Benchmarking#Gemma#DeepSeek#Kimi
why featured
HKR-H/K/R all pass: a dense open-model roster, a named CAISI V4 evaluation frame, and clear practitioner relevance for model choice. Missing scores and reproducible detail keep it in the 78–84 band.
editor take
Don’t buy the “open model bonanza” framing too fast; CAISI’s V4 shows how benchmark choice can stretch the gap narrative.
sharp
CAISI is making the open-model gap sound cleaner than the evidence supports. The post says V4 uses nine benchmarks, but DeepSeek V4’s large Elo hit comes heavily from CTF-Archive-Diamond subset extrapolation, CAISI-private PortBench, and ARC-AGI-2 with scoring different from public leaderboards. One private benchmark plus two special-case treatments can bend the aggregate.
I buy Interconnects’ pushback more than the headline. A bash loop with fixed token budget is not how Claude Code or OpenCode elicit coding models. The Bun Zig-to-Rust port with 1 million LOC changed is a nasty counterexample to benchmark claims that porting apps is currently impossible. Open models trail closed frontier models, but this Elo story is too dependent on the harness.
HKR breakdown
hook ✓knowledge ✓resonance ✓