04:00
5d ago
→Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data
The paper proves that single-layer Transformers trained with outcome-only RL can learn an iterative vertex-by-vertex traversal algorithm on a synthetic graph task, but the training distribution must contain enough simple examples requiring fewer reasoning steps for policy-gradient learning to remain feasible.
75
SCORE
H1·K1·R1