[Submitted on 27 Oct 2025]
Adaptive Second-Order Optimization with Decaying Momentum for Language Models
View PDFAbstract:We present an adaptive second-order optimization method with decaying momentum for training large language models. Our approach combines Hessian-based scaling with a novel momentum decay schedule that adapts to training progression. Evaluated on the FineWeb benchmark using a 134M parameter Qwen architecture, our optimizer achieves a validation loss of 5.053, outperforming the Sophia baseline (5.091) while remaining competitive with AdamW (4.927). Through ablation studies, we demonstrate the importance of our adaptive momentum decay schedule in achieving stable training dynamics. While our method does not surpass state-of-the-art results, it provides insights into the trade-offs between adaptive second-order methods and traditional momentum-based approaches.
Submission history
[v1] Mon, 27 Oct 2025 15:31 UTC