[Submitted on 30 Oct 2025]
An Empirical Study of Optimizer Modifications for Language Model Training
View PDFAbstract:This paper presents a systematic evaluation of novel optimizer designs for training transformer-based language models, building on recent work in adaptive optimization \cite{adamw, lamb}. Through extensive experimentation with gradient momentum scaling \cite{gms}, orthogonal updates \cite{orthoopt}, and layer-specific adaptations \cite{layeradapt}, we demonstrate the difficulty of improving upon the AdamW baseline. Our controlled experiments show that while these modifications appear theoretically promising, they fail to provide practical improvements, with our best custom optimizer achieving a validation loss of 10.807 compared to AdamW's 4.927. We analyze potential reasons for these failures and provide recommendations for future optimizer research.
Submission history
[v1] Thu, 30 Oct 2025 00:07 UTC