[Submitted on 1 Nov 2025]
OrthoAdam: Adaptive Orthogonal Gradient Processing \\ for Transformer Optimization
View PDFAbstract:We present OrthoAdam, a novel optimizer that combines adaptive gradient orthogonalization with momentum-based optimization for training transformer language models. OrthoAdam dynamically adjusts the strength of gradient orthogonalization based on gradient magnitude and training progress, while maintaining the benefits of Adam's adaptive learning rates. Our method achieves a validation loss of 3.809 on the FineWeb benchmark, outperforming AdamW by 23.7\% and ranking second overall on the Aardvark optimizer leaderboard. The key innovation lies in our adaptive orthogonalization approach that helps escape poor local minima early in training while preventing over-orthogonalization later. Comprehensive experiments demonstrate OrthoAdam's effectiveness and stability across different training phases.
Submission history
[v1] Sat, 1 Nov 2025 18:56 UTC