arXiv · cs.LG· atomEN04:00 · 05·12
→Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
This survey organizes LLM optimizer research into 7 groups, spanning AdamW, memory-efficient variants, curvature-aware methods, low-rank approaches, and matrix-based optimizers such as Muon, and it argues that benchmarks should report convergence, stability, memory overhead, wall-clock efficiency, token efficiency, and implementation complexity together.
#Fine-tuning#Inference-opt#Benchmarking#Research release
why featured
HKR-K is solid: the 7 optimizer classes and four benchmark dimensions add usable structure. HKR-R is narrow to training-infra readers, and the numerical-optimization topic keeps it in the 60-71 band rather than featured.
editor take
This survey splits LLM optimizers into 7 buckets; AdamW-to-Muon claims now need memory, stability, and wall-clock receipts.
HKR breakdown
hook —knowledge ✓resonance ✓