[Submitted on 5 Nov 2025]
Multi-Scale Adaptive Momentum: A Novel Optimizer for Transformer Language Models
View PDFAbstract:We present Multi-Scale Adaptive Momentum (MSAM), a novel optimizer combining multiple momentum scales with layer-wise adaptation for Transformer training. MSAM automatically adjusts momentum weights and learning rates based on gradient statistics and layer type, providing more aggressive updates for attention layers while maintaining stability in embeddings and normalization layers. Extensive experiments on the FineWeb benchmark demonstrate MSAM's advantages: (1) achieves 4.860 validation loss vs AdamW's 4.927, (2) maintains stable training dynamics, and (3) shows particular effectiveness for attention layers. We provide theoretical justification for our multi-momentum approach and validate through comprehensive ablations.
Submission history
[v1] Wed, 5 Nov 2025 14:15 UTC