[Submitted on 29 Oct 2025]
SpectraMix: Analyzing the Failure Modes of a Dual Momentum Optimizer for Language Models
View PDFAbstract:This paper presents a thorough investigation of SpectraMix, an optimizer combining fast and slow exponential moving averages (EMAs) with adaptive mixing coefficients for language model training. Despite promising theoretical properties and successful ablation tests (loss: 11.93), SpectraMix significantly underperformed AdamW (4.93) in full-scale evaluation (loss: 12.00). We provide complete implementation details, theoretical analysis, and extensive diagnostic experiments to understand this performance gap. Our findings suggest that while dual momentum strategies appear theoretically appealing, their practical benefits for transformer optimization may be limited by complex interactions between gradient statistics across layers. This work contributes a cautionary case study in optimizer development and provides concrete recommendations for evaluating novel optimization methods.
Submission history
[v1] Wed, 29 Oct 2025 03:53 UTC