[Submitted on 31 Oct 2025]
Layer-Adaptive Dual Momentum: \\ A Comprehensive Optimizer for Transformer Language Models
View PDFAbstract:We present Layer-Adaptive Dual Momentum (LADM), a novel optimizer combining dual momentum buffers with precise layer-wise learning rate adaptation. Through extensive experiments on the FineWeb benchmark using a 134M parameter transformer, LADM achieves a validation loss of 4.386, outperforming AdamW (4.927) by 11\% while maintaining comparable memory efficiency. We provide detailed analysis of the momentum dynamics, layer adaptation sensitivity, and comparison to state-of-the-art methods including the Muon baseline (3.537). The paper includes complete implementation details, ablation studies, and discussion of limitations to enable reproducibility and future improvements.
Submission history
[v1] Fri, 31 Oct 2025 15:31 UTC