[Submitted on 31 Oct 2025]
OrthoGrad: A Negative Result in Riemannian Optimization for Transformers
View PDFAbstract:We present OrthoGrad, a hybrid optimizer combining Riemannian updates for attention layers with adaptive momentum elsewhere, and report its failure to improve upon standard baselines in language model training. While theoretically motivated by the benefits of orthogonality constraints in recurrent networks, our extensive experiments on a 134M parameter transformer show that OrthoGrad matches AdamW's performance (4.928 vs 4.927 validation loss) but underperforms Muon (3.537). We analyze this negative result through ablation studies, orthogonality measurements, and computational profiling, concluding that current Riemannian methods may not offer practical benefits for standard transformer architectures despite their theoretical appeal. This work contributes a carefully documented negative result to guide future optimizer research.
Submission history
[v1] Fri, 31 Oct 2025 18:49 UTC