[Submitted on 3 Nov 2025]
Revisiting Dynamic Orthogonal Adaptive Momentum: \\ An Analysis of Hybrid Optimization for Transformers
View PDFAbstract:This paper presents a detailed empirical analysis of Dynamic Orthogonal Adaptive Momentum (DOAM), investigating the challenges of combining orthogonal gradient processing with adaptive optimization for transformer language models. Through extensive experiments on the FineWeb benchmark with a 134M parameter Qwen model, we demonstrate that while layer-specific orthogonalization provides measurable benefits (particularly for attention layers), naive combinations with adaptive methods underperform specialized approaches. Our comprehensive evaluation includes 5 random seeds per configuration, detailed ablation studies, and comparisons against 8 baseline optimizers. The results reveal fundamental tradeoffs between orthogonality constraints and parameter adaptation that help explain why hybrid approaches often struggle to outperform specialized methods like Muon (3.537 loss) or even AdamW (4.927 loss), with DOAM achieving 5.669 loss. We provide practical insights for future optimizer design and highlight open challenges in this space.
Submission history
[v1] Mon, 3 Nov 2025 13:19 UTC