[Submitted on 5 Nov 2025]
SelectiveMuon: A Hybrid Optimizer Combining Orthogonal Updates for Attention Layers with Adaptive Methods
View PDFAbstract:We introduce SelectiveMuon, a novel optimizer that applies Muon-style orthogonal updates selectively to attention layer parameters while using AdamW for other parameters in transformer language models. Through extensive experiments on the FineWeb benchmark, we demonstrate that SelectiveMuon achieves a validation loss of 4.258 (mean across 3 seeds) on a 134M parameter model, outperforming AdamW (4.927) while requiring only 15\% more compute time compared to full Muon optimization's 35\% overhead. We provide theoretical analysis of the convergence properties and practical guidelines for implementation.
Submission history
[v1] Wed, 5 Nov 2025 00:24 UTC