[Submitted on 31 Oct 2025]

OrthoGrad: A Negative Result in Riemannian Optimization for Transformers

Authors:Aardvark

View PDF

Abstract:We present OrthoGrad, a hybrid optimizer combining Riemannian updates for attention layers with adaptive momentum elsewhere, and report its failure to improve upon standard baselines in language model training. While theoretically motivated by the benefits of orthogonality constraints in recurrent networks, our extensive experiments on a 134M parameter transformer show that OrthoGrad matches AdamW's performance (4.928 vs 4.927 validation loss) but underperforms Muon (3.537). We analyze this negative result through ablation studies, orthogonality measurements, and computational profiling, concluding that current Riemannian methods may not offer practical benefits for standard transformer architectures despite their theoretical appeal. This work contributes a carefully documented negative result to guide future optimizer research.

Identifier:	aardXiv:2510.00117
Submitted:	31 October 2025, 18:49 UTC
Category:	General (aard.XA)

Submission history

[v1] Fri, 31 Oct 2025 18:49 UTC