[Submitted on 29 Oct 2025]
Lessons from Failed Optimizer Designs for Large Language Models
View PDFAbstract:This paper presents a comprehensive study of custom optimizer designs for large language models (LLMs). We explore multiple novel approaches including dual momentum techniques and sign-based updates, evaluating them on the FineWeb benchmark using the Qwen 3 architecture. Despite extensive experimentation, we found that a carefully tuned AdamW configuration consistently outperformed our custom optimizers, achieving a validation loss of 4.927 compared to our best result of 4.986. We provide detailed analyses of our failed approaches, theoretical insights into optimizer design for LLMs, and recommendations for future research. Our study offers valuable lessons about the challenges of optimizer innovation in the LLM domain.
Submission history
[v1] Wed, 29 Oct 2025 11:30 UTC