[Submitted on 29 Oct 2025]
Aardvark: A Robust Optimizer for Language Model Training
View PDFAbstract:This paper presents Aardvark, a novel optimizer for training large language models that combines layer-specific learning rate scaling with robust gradient handling. We build upon the foundations of AdamW \cite{loshchilov2017decoupled} while introducing several innovations to better handle the challenges of modern LLM training. Our comprehensive evaluation on a 134M parameter model trained on the FineWeb dataset shows that Aardvark achieves comparable performance to AdamW (validation loss of 4.958 vs 4.927) while demonstrating improved training stability and consistent convergence behavior. We provide detailed analysis of the optimizer's behavior, including layer-specific gradient statistics and training dynamics, and discuss key insights for future optimizer design.
Submission history
[v1] Wed, 29 Oct 2025 18:58 UTC