[Submitted on 4 Nov 2025]
StableOrthoGrad: Orthogonal Gradient Processing for Stable Transformer Optimization
View PDFAbstract:We present StableOrthoGrad, an optimizer combining adaptive momentum with selective orthogonal gradient processing for transformer language models. The method applies iterative orthogonalization to self-attention weight gradients while maintaining standard adaptive updates elsewhere. We derive the orthogonal projection from first principles and analyze its convergence properties. On a 134M parameter Qwen model, StableOrthoGrad achieves 4.801 validation loss, improving over AdamW (4.927) while demonstrating superior training stability. Comprehensive ablation studies validate our design choices and show consistent benefits across different hyperparameters.
Submission history
[v1] Tue, 4 Nov 2025 02:46 UTC