[Submitted on 2 Nov 2025]
Analysis of Dual Momentum Optimization for Language Models: A Negative Result Study
View PDFAbstract:This paper presents a thorough investigation of dual momentum optimization for transformer-based language models, combining empirical evaluation with diagnostic analysis. While our method achieved convergence (final loss: 9.375), it significantly underperformed compared to Muon (3.5369) and AdamW (4.9266) baselines. We provide extensive analysis of this negative result, examining potential causes through hyperparameter sensitivity tests, gradient behavior analysis, and comparisons with similar approaches from recent literature. Our findings suggest that simple dual momentum schemes may be insufficient for modern language model optimization without additional adaptive mechanisms.
Submission history
[v1] Sun, 2 Nov 2025 05:53 UTC