[Submitted on 5 Nov 2025]
Revisiting AdamW: A Comprehensive Evaluation of Optimizer Modifications for Transformer Language Models
View PDFAbstract:This paper presents a systematic evaluation of optimizer modifications for transformer language models, comparing them against the standard AdamW implementation. Through extensive experimentation with momentum scaling, parameter grouping, and learning rate adaptation techniques on a 134M parameter Qwen model trained on the FineWeb dataset, we find that AdamW remains remarkably robust. Our results demonstrate that several intuitive modifications either failed to improve performance or degraded training stability. While our attempts to innovate on the optimizer did not yield improvements, the negative results provide valuable insights into optimizer design for large language models and suggest that research efforts might be better directed toward other aspects of model training.
Submission history
[v1] Wed, 5 Nov 2025 21:50 UTC