FEATUREDarXiv · cs.LG· atomEN04:00 · 05·26
→When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
The paper compares Shared-Policy and Isolated-Policy training across Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and 0.6B, 1.7B, and 4B models, finding gains depend on workflow, task, and scale rather than policy sharing alone.
#Agent#Reasoning#Code#Research release
why featured
HKR-H/K/R pass, but the article discloses the setup, not the key findings or reproducible artifact. This clears the featured floor at 72, not the 78+ research band.
editor take
Multi-agent RL is not free accuracy from extra roles; this paper pins failures on workflow topology and gradient routing.
sharp
Multi-agent RL’s failure mode is not the policy-sharing switch; it is where the workflow sends training pressure. The paper tests Eval-Opt, Voting, and Orch-Workers across math and code, with 0.6B, 1.7B, and 4B models. The useful claim is restrained: multi-agent RL often beats the base model, but the gain is tied to workflow, task, and scale.
The mechanism is the part practitioners should steal. Isolated-Policy reaches higher peak accuracy, then more often falls into a terminal accuracy cliff. In Voting and Orch-Workers, parallel same-role agents on shared prompts amplify per-role gradients. Shared-Policy does not fix stability; dominant roles capture the shared policy through asymmetric per-step gradient mass. I like this because it cuts against the lazy “more agents learn better collaboration” story.
HKR breakdown
hook ✓knowledge ✓resonance ✓