NEWarXiv · cs.LG· atomEN04:00 · 06·09
→Revisiting Training Scale: An Empirical Study of Token Count, Power Consumption, and Parameter Efficiency
The study trains a 1.1B-parameter TinyLlama on the same GPU, architecture, optimizer settings, and epoch count, and finds parameter efficiency declines strictly monotonically as token count rises across 500K, 1M, and 2M training tokens.
#Benchmarking#Inference-opt#TinyLlama#Research release
why featured
HKR-K is solid: fixed setup, token counts, and a testable monotonic-efficiency claim. HKR-R comes from training cost, but HKR-H is weak and the 500K–2M-token scale keeps it in the 60–71 band.
editor take
TinyLlama 1.1B loses efficiency at 500K, 1M, and 2M tokens; tiny scale, but energy belongs in scaling tables.
HKR breakdown
hook —knowledge ✓resonance ✓