23:54
56d ago
● P1arXiv · cs.CL· atomEN23:54 · 04·13
→From Plan to Action: How Well Do Agents Follow the Plan?
This paper analyzes 16,991 SWE-agent trajectories on SWE-bench Verified and Pro to measure how closely coding agents follow instructed plans. A standard plan improves issue resolution, periodic reminders reduce violations, and a weak plan hurts more than no plan. The snippet does not disclose the four LLM names or per-plan gains across the eight variants.
#Agent#Code#Benchmarking#SWE-agent
why featured
Strong HKR-H/K/R: it turns a familiar agent failure mode into a measurable result across 16,991 SWE-agent traces and adds a practical claim—bad plans can hurt more than no plan. Not P1 because the abstract leaves the four model names and per-variant gains undisclosed.
editor take
The paper analyzes 16,991 SWE-agent runs and lands on an uncomfortable point: many agents are not executing plans, just replaying memorized workflows.
sharp
The paper measures plan compliance across 16,991 SWE-agent trajectories, and my read is pretty blunt: this exposes a hole in how we evaluate coding agents. A solved task does not mean the agent followed the instructed strategy. The abstract already gives three hard signals: a standard plan improves resolution, periodic reminders reduce violations and raise success, and a weak plan hurts more than no plan. That alone knocks down a lot of the current “agents can autonomously plan” narrative.
I’ve thought for a while that SWE-bench-style evaluation mixes up two different things: “can patch this benchmark issue” and “can work through a disciplined problem-solving process.” Those are not the same skill. A lot of code agents already have an internalized workflow from training: navigate repo, find likely files, attempt a patch, run some validation, iterate. That can come from code corpora, issue discussions, prior agent traces, and benchmark leakage in the broad sense. The abstract says that without an explicit plan, agents fall back to workflows internalized during training, and that tracks with what many teams have seen since the ReAct and SWE-agent wave: the trajectory looks deliberate, but a lot of it is just habit.
The most interesting claim here is that adding extra task-relevant phases early in the plan can degrade performance. I buy that. Recent coding models are usually responsive to high-level structure, but they often resist overly rigid stage constraints when those constraints conflict with the model’s learned solve order. You get a weird failure mode: the agent half-follows the plan, burns tool calls, and still reverts to its preferred path. I’ve seen adjacent behavior in internal agent evals discussed over the last year: checklists make logs look cleaner, while pass rates stay flat or fall. I haven’t read the full paper yet, so I can’t verify whether they separate “better-looking trajectory” from “genuinely better execution” in a rigorous way.
I do have two pushbacks. First, the abstract withholds the four LLM names and the per-variant gains across eight plan conditions. That is a big omission. If most of the lift comes from weaker models, then the story is “plans compensate for capability gaps.” If stronger frontier models also gain consistently, then the story is larger: plan-following itself is undertrained. Those are different conclusions. Second, SWE-agent runs in a fairly structured environment with a clear task shape: inspect, reproduce, patch, validate. I would not automatically extend this result to browser agents, research agents, or multi-agent systems where phase boundaries are much fuzzier.
Honestly, the paper matters because it redirects the problem. The issue is not just writing better plans. The issue is that current training recipes often assume the model already knows how to obey a plan, and prompts are just there to specify one. This paper suggests that assumption is weak. That lines up with the broader process-supervision debate from the last year: if you only reward the final patch or benchmark pass, models will learn shortcuts, not disciplined execution. If plan compliance becomes measurable, agent evaluation starts moving from outcome-only scoring toward auditable process. I’m not ready to call this a methods breakthrough from the snippet alone. The missing details are too important. Still, it puts a neglected question on the table in a way the field has needed for a while.
HKR breakdown
hook ✓knowledge ✓resonance ✓
86
SCORE
H1·K1·R1