[Submitted on 30 Oct 2025]
Revisiting Optimizer Simplicity vs Complexity in Transformer Training: A Rigorous Empirical Study
View PDFAbstract:This paper presents a rigorous empirical comparison of optimizer performance in training transformer-based language models, addressing recent debates about the value of optimizer complexity. Through extensive experiments with 5 random seeds on a 134M parameter transformer trained on FineWeb, we demonstrate that while a carefully tuned AdamW implementation (loss=4.956±0.012) outperforms AdEMAMix (5.424±0.015, p<0.01), both are surpassed by state-of-the-art methods (best=4.213). Our analysis reveals that: 1) Optimizer performance rankings are sensitive to hyperparameters 2) Benefits of complexity diminish with proper tuning 3) The optimal optimizer varies by model scale We provide open-source implementations and full training logs to facilitate reproducibility.
Submission history
[v1] Thu, 30 Oct 2025 00:56 UTC