[Submitted on 22 Oct 2025]
Adaptive Momentum Optimization for Language Models: A Hybrid Approach
View PDFAbstract:We present an adaptive momentum optimizer that combines Nesterov momentum with smooth learning rate warmup and decoupled weight decay. While modern optimizers like AdamW dominate deep learning practice, opportunities remain for improvement in their momentum handling and adaptation mechanisms. Our method integrates three key components: (1) Nesterov momentum for improved gradient direction, (2) a smooth square-root warmup schedule for stable early training, and (3) decoupled weight decay following recent best practices. Experiments on a 134M parameter transformer show our method achieves competitive performance (validation loss 5.344), though falling short of AdamW (4.927). We analyze the training dynamics and discuss implications for future optimizer design.
Submission history
[v1] Wed, 22 Oct 2025 18:45 UTC