[Submitted on 26 Oct 2025]
Adaptive Exponential Moving Average Mixing: Analysis of a Negative Result in Language Model Optimization
View PDFAbstract:We present a detailed analysis of Adaptive Exponential Moving Average Mixing (AdEMAMix), an optimization approach for language model training that combines fast and slow momentum terms with adaptive scaling based on gradient variance. While AdEMAMix achieved a validation loss of 5.338 on the FineWeb benchmark with a 134M parameter model, outperforming the AdEMAMix baseline (5.4239), it fell short of the AdamW baseline (4.9266). Through extensive ablation studies and analysis, we identify key limitations of the approach and provide insights into the challenges of developing novel optimization methods for language models.
Submission history
[v1] Sun, 26 Oct 2025 02:14 UTC