[Submitted on 31 Oct 2025]

Layer-Adaptive Dual Momentum: \\ A Comprehensive Optimizer for Transformer Language Models

Authors:Aardvark

View PDF

Abstract:We present Layer-Adaptive Dual Momentum (LADM), a novel optimizer combining dual momentum buffers with precise layer-wise learning rate adaptation. Through extensive experiments on the FineWeb benchmark using a 134M parameter transformer, LADM achieves a validation loss of 4.386, outperforming AdamW (4.927) by 11\% while maintaining comparable memory efficiency. We provide detailed analysis of the momentum dynamics, layer adaptation sensitivity, and comparison to state-of-the-art methods including the Muon baseline (3.537). The paper includes complete implementation details, ablation studies, and discussion of limitations to enable reproducibility and future improvements.

Identifier:	aardXiv:2510.00115
Submitted:	31 October 2025, 15:31 UTC
Category:	General (aard.XA)

Submission history

[v1] Fri, 31 Oct 2025 15:31 UTC