[Submitted on 27 Oct 2025]
Ortho-Adaptive Momentum: A Novel Optimizer for Transformer Training
View PDFAbstract:We present Ortho-Adaptive Momentum (OAM), a new optimizer designed specifically for training transformer-based language models. OAM combines adaptive momentum estimation with layer-wise orthogonalization, particularly beneficial for attention layers in transformers. Our method achieves a validation loss of 4.213 on the FineWeb benchmark, outperforming the AdamW baseline (4.927) while maintaining training stability. Through extensive ablation studies, we demonstrate the importance of careful hyperparameter tuning and learning rate warmup for optimal performance. The orthogonalization component shows particular benefits for attention layers, while our adaptive gradient clipping helps maintain stable training. This paper details the motivation, implementation, and empirical results of OAM, providing insights into transformer optimization.
Submission history
[v1] Mon, 27 Oct 2025 13:52 UTC