[Submitted on 25 Oct 2025]
Understanding Optimizer Performance in Language Model Pretraining: A Case Study of Adaptive Momentum Approaches
View PDFAbstract:This paper presents a systematic investigation of momentum-based optimization strategies for language model pretraining. Through extensive ablation studies and comparisons against established baselines, we analyze the performance characteristics of various adaptive momentum approaches. Our experiments on the FineWeb dataset with a 134M parameter Transformer model reveal that while certain momentum adaptations show promise, they fail to outperform the current state-of-the-art muon optimizer (3.537 loss) and perform comparably to AdamW (4.927 loss). We document both successful modifications and ineffective approaches, providing insights into the challenges of optimizer design for large language models.
Submission history
[v1] Sat, 25 Oct 2025 23:12 UTC