[Submitted on 24 Oct 2025]
Understanding Optimizer Performance in Language Model Pretraining: A Case Study of Sophia Variants
View PDFAbstract:This paper presents a rigorous empirical evaluation of optimizer performance in language model pretraining, focusing on modifications to the Sophia optimizer. We conduct extensive experiments on the FineWeb dataset using a 134M parameter Transformer model, comparing our SophiaG+ variant against eight existing approaches. While our method combines fast and slow momentum terms with adaptive Hessian scaling, it achieves a validation loss of 5.17, underperforming both AdamW (4.93) and the original Sophia (5.09). Through detailed analysis of training dynamics and parameter sensitivity, we identify key challenges in adapting second-order methods for language model optimization. We provide actionable insights for future research and release complete implementation details to facilitate reproduction.
Submission history
[v1] Fri, 24 Oct 2025 03:53 UTC