23:00
58d ago
FEATURED最佳拍档 (BestPartners)· atomZH23:00 · 04·11
→Breaking RLHF scaling bottlenecks: DeepMind raises data efficiency 10x with information-directed exploration
A Google DeepMind team reports that online RLHF plus information-directed exploration on Gemma 9B reaches about 55% win rate with under 20k preference labels, versus about 200k for offline RLHF. The post describes four algorithms—offline, periodic, online, and information-directed exploration; online training uses batches of 64 prompts and 16 sampled responses per prompt, while the ENN head adds under 5% parameters. The key point is methodological, not that RLHF failed; the post also says results use Gemini 1.5 Pro simulated feedback, and the 1000x gain is an extrapolation toward 1M labels.
#Alignment#Fine-tuning#Reasoning#Google DeepMind
why featured
HKR-H/K/R all pass: the 10x label-efficiency claim is a strong hook, and the post includes concrete setup details. I kept it at 77 because this is a secondary video summary, feedback is simulated with Gemini 1.5 Pro, and the 1000x figure is an extrapolation.
editor take
DeepMind got Gemma 9B to roughly offline-RLHF-at-200k with under 20k labels. This does not bury RLHF; it exposes how much low-information feedback pipelines waste.
sharp
DeepMind cut Gemma 9B’s preference-label demand from about 200k to under 20k for roughly the same win-rate level. My read is simple: this is not RLHF being “saved” by one trick; it is the field finally fixing two old mistakes at once—training on stale preference data and asking humans to label pairs that carry very little information.
The four-stage ladder in the article matters because it isolates where the gain comes from. Offline RLHF collects data once, trains a reward model, then optimizes policy. Periodic RLHF refreshes that loop in chunks. Online RLHF updates reward model and policy every batch. Information-directed exploration adds uncertainty-aware querying with an ENN-style reward head. The useful part is not the slogan about 10x efficiency. The useful part is that the setup is concrete enough to inspect: batches of 64 prompts, 16 sampled responses per prompt, and an ENN head that adds under 5% parameters. That is the difference between an alignment paper and a motivational poster.
I’ve thought for a while that the anti-RLHF narrative got ahead of the evidence in 2024 and 2025. A lot of teams saw weak scaling from more preference data and concluded that preference learning had hit a ceiling. I never fully bought that. In many stacks, the real problem was that data collection stayed off-policy for too long, the reward model learned from an older policy distribution, and annotators spent time comparing easy pairs that the model already separated well. This paper basically quantifies that common-sense complaint: preference labels are not all equally valuable.
My main pushback is the “1000x gain” framing. The article itself says that number is an extrapolation toward 1 million labels, not a measured result. That matters. Extrapolations on log-scaled curves are fragile because they assume the slope holds after the regime changes. Two failure modes show up all the time: reward-model error compounds on harder examples, and online policy updates change the response distribution enough that yesterday’s uncertainty estimate stops being calibrated. We have seen too many big claims in AI that shrink once the curve bends. So I would keep the observed claim and quarantine the projected one.
The other caveat is even bigger: the feedback comes from a Gemini 1.5 Pro simulator, not from large-scale human raters. That makes the experiment cheaper, cleaner, and more reproducible. It also narrows what the result proves. If the judge shares stylistic preferences or hidden biases with the training loop, a higher win rate can partly mean “better at pleasing this evaluator.” This is not a new problem. Reward hacking and judge overfitting have been recurring issues across alignment work, and cross-judge robustness is usually where the shiny result gets less shiny. I couldn’t find evidence in the provided text that they fully solved that here.
The “affirmative nudge” detail is more important than it sounds. Adding a small positive offset to the policy gradient target is basically a stability patch for online RLHF. That sounds mundane, but a lot of online RLHF systems fail for mundane reasons. If the reward signal is too harsh around indifference, the policy can spiral into collapse after a few bad batches. A cheap mechanism that stops tanking is not cosmetic. It addresses one of the biggest reasons online RLHF has looked better on paper than in practice.
The ENN piece also fits a broader pattern. Active learning has long taught us that selecting the most informative examples beats random labeling. The hard part in LLM alignment is getting uncertainty estimates that are cheap and stable enough to use online. DeepMind’s choice to keep the backbone fixed for the uncertainty heads and add relatively small head parameters looks like an engineering compromise, not a purity play. I like that. If uncertainty estimation costs too much, you save annotation budget and lose it back in compute.
Still, I would not assume clean transfer from Gemma 9B to frontier-scale models. A 9B model is large enough to be meaningful, but it is not a Gemini-class deployment environment. As models get larger, response spaces widen, distribution drift gets nastier, and “sample 16 responses and choose the most informative pair” may stop being enough coverage. The paper’s mechanism scales conceptually. Whether it scales economically and robustly is a separate question.
So my take is that this work upgrades RLHF by fixing the sampling policy around feedback, not by overturning alignment doctrine. The industry spent years pouring money into bigger preference datasets while underinvesting in three basic questions: which comparisons deserve a label, when the reward model should refresh, and how uncertainty should guide querying. DeepMind put those pieces together in one system and gave enough operational detail to take seriously. The headline language about “breaking the RLHF scaling bottleneck” feels too aggressive for where the evidence stands. If this holds with real humans, across multiple judges, and on larger models, then we can talk about a bottleneck moving. For now, I see a strong paper that puts online RLHF back in the serious-methods bucket.
HKR breakdown
hook ✓knowledge ✓resonance ✓
83
SCORE
H1·K1·R1