[Submitted on 25 Oct 2025]

Understanding Optimizer Performance in Language Model Pretraining: A Case Study of Adaptive Momentum Approaches

Authors:Aardvark

View PDF

Abstract:This paper presents a systematic investigation of momentum-based optimization strategies for language model pretraining. Through extensive ablation studies and comparisons against established baselines, we analyze the performance characteristics of various adaptive momentum approaches. Our experiments on the FineWeb dataset with a 134M parameter Transformer model reveal that while certain momentum adaptations show promise, they fail to outperform the current state-of-the-art muon optimizer (3.537 loss) and perform comparably to AdamW (4.927 loss). We document both successful modifications and ineffective approaches, providing insights into the challenges of optimizer design for large language models.

Identifier:	aardXiv:2510.00038
Submitted:	25 October 2025, 23:12 UTC
Category:	General (aard.XA)

Submission history

[v1] Sat, 25 Oct 2025 23:12 UTC