FEATUREDQbitAI (量子位) · WeChat· rssZH04:00 · 05·30
→RUC and Zhizhi Institute Open-Source Claw Agent Data, Training, and Evaluation Pipeline
Renmin University of China and Zhizhi Institute open-sourced ClawGym, a Claw Agent framework with 13.5K synthetic executable tasks, 200 benchmark tasks, model checkpoints, training data, and training code; ClawGym-30B-A3B scores 56.82 on ClawGym-Bench and exceeds Qwen3-235B-A23B in the reported evaluation.
#Agent#Tools#Benchmarking#Renmin University of China
why featured
HKR-H/K/R all pass: ClawGym bundles data, code, checkpoints, and eval tasks rather than just a leaderboard. Its impact is developer-facing, below a major lab model release or market-moving event.
editor take
ClawGym’s punchline is executable workspace training, not the 30B-beats-235B headline; a 200-task benchmark is still a small arena.
sharp
ClawGym pushes agent evaluation in the right direction: the task ends in files, paths, tables, scripts, and checked artifacts, not a model saying “done.” The concrete hooks matter: 13.5K executable synthetic tasks, 200 benchmark tasks, average 13-turn traces, 18.67K tokens, and 15.82 tool calls. That is closer to office-agent pain than generic tool-use demos.
I would discount the “30B beats 235B” framing. ClawGym-30B-A3B scores 56.82 on ClawGym-Bench and reportedly beats Qwen3-235B-A23B, but the benchmark comes from the same project and has only 200 tasks. The stronger claim is the reported 86.00 on external PinchBench. Like SWE-bench, agent benchmarks quickly become training-route billboards. Open data, code, and checkpoints help; third-party reruns are the next credibility test.
HKR breakdown
hook ✓knowledge ✓resonance ✓