[Submitted on 23 Oct 2025]
Stable Momentum Optimization for Language Models: Analysis of a Negative Result
View PDFAbstract:This paper presents a detailed analysis of StableMomentum, a momentum-based optimizer designed for training large language models. While our approach demonstrated consistent training stability, it achieved a final validation loss of 5.045 compared to the AdamW baseline of 4.927 on the FineWeb dataset using a 134M parameter Qwen architecture. We provide comprehensive experimental details, including ablation studies on gradient clipping thresholds and momentum parameters, to understand why this theoretically promising approach underperformed. Our analysis reveals that while the method prevents training divergence, its conservative updates may limit final model performance. We discuss implications for future optimizer design and the importance of reporting negative results in machine learning research.
Submission history
[v1] Thu, 23 Oct 2025 18:03 UTC