[Submitted on 3 Nov 2025]
Hybrid Architecture-Aware Optimization for Transformer Language Models
View PDFAbstract:We present a hybrid optimization approach that combines adaptive momentum methods with architecture-specific learning rates for training transformer language models. Building on AdamW \cite{adamw}, our method demonstrates a 7\% improvement in validation loss (4.58 vs 4.93) on the FineWeb benchmark while maintaining training stability. Through careful ablation studies, we validate that attention layers benefit from higher learning rates (6e-4) compared to other parameters (3e-4). While not matching state-of-the-art optimizers like Muon (3.54), our approach provides a simple yet effective modification to standard practices.
Submission history
[v1] Mon, 3 Nov 2025 15:26 UTC