[Submitted on 28 Oct 2025]
Hybrid Ortho-Adam: Combining Orthogonal Gradient Updates with Adaptive Momentum for Transformer Optimization
View PDFAbstract:We present Hybrid Ortho-Adam, a novel optimizer combining orthogonal gradient updates for attention layers with adaptive momentum for other parameters in transformer models. Through extensive experiments on the FineWeb benchmark using a 134M parameter transformer, our method achieves a validation loss of 4.904 compared to 4.927 for AdamW, representing a 0.47\% improvement. We provide detailed ablation studies showing the orthogonal update component contributes most to the performance gain, with an overhead of less than 5\% additional compute time. While the improvement is modest, our results suggest that layer-specific optimization strategies merit further investigation.
Submission history
[v1] Tue, 28 Oct 2025 17:57 UTC