FEATUREDarXiv · cs.CL· atomEN17:59 · 05·22
→SkillOpt: Executive Strategy for Self-Evolving Agent Skills
SkillOpt uses a separate optimizer model to convert scored rollouts into bounded text edits, accepts only validation-improving changes, ranks best or tied across 52 evaluated cells, and raises GPT-5.5 no-skill accuracy by 23.5 points in direct chat, 24.8 in Codex, and 19.1 in Claude Code.
#Agent#Fine-tuning#Benchmarking#SkillOpt
why featured
HKR-H/K/R all pass: the hook is self-evolving skills, the paper gives a 23.5-point gain and 52 eval units, and agent builders care about automated skill updates. It stays in 78–84 because this is a single arXiv paper, not a major model or product release.
editor take
SkillOpt turns prompt tinkering into validated training; 52 cells is strong, but I want leakage checks and baseline implementations first.
sharp
SkillOpt’s sharp move is treating an agent skill as trainable external state, not dressing up reflection loops again. A separate optimizer model converts scored rollouts into bounded add/delete/replace edits, then accepts only edits that improve held-out validation. Deployment adds zero extra inference calls. That is much closer to an optimizer than GEPA, TextGrad, or EvoSkill-style text search.
The numbers are hard to ignore: six benchmarks, seven target models, three harnesses, and best-or-tied results across all 52 evaluated cells. On GPT-5.5, it reports +23.5 points in direct chat, +24.8 inside Codex, and +19.1 inside Claude Code. My first checks would be validation/test separation and whether the human, Trace2Skill, GEPA, and EvoSkill baselines were implemented strongly. If those survive, this is a credible path to agent gains without touching weights.
HKR breakdown
hook ✓knowledge ✓resonance ✓