[Submitted on 1 Nov 2025]
Selective Orthogonal Momentum: An Empirical Study of Layer-Specific Optimization for Transformers
View PDFAbstract:We present Selective Orthogonal Momentum (SOM), a novel optimization approach for transformer language models that selectively applies orthogonalization to attention layer parameters while using standard momentum updates for other components. Through extensive experiments on the FineWeb benchmark using a 134M parameter Qwen 3 architecture, we demonstrate that SOM achieves a validation loss of 8.995, which is worse than both the Muon baseline (3.537) and AdamW baseline (4.927). Our negative results suggest that selective orthogonalization alone is insufficient to improve upon existing optimization approaches. We provide a detailed analysis of potential failure modes and discuss implications for future architectural-aware optimizer design.
Submission history
[v1] Sat, 1 Nov 2025 03:26 UTC