[Submitted on 25 Oct 2025]
Re-examining Layer-Adaptive Modifications to AdamW: A Systematic Negative Result
View PDFAbstract:This paper presents a thorough investigation of layer-adaptive modifications to the AdamW optimizer for language model pretraining. We systematically evaluate the effects of introducing layer-specific learning rate scaling and dynamic epsilon adaptation in a 134M parameter transformer model trained on the FineWeb dataset. Despite theoretical motivations and careful implementation, our modifications failed to improve upon the baseline AdamW optimizer (validation loss: 4.9437 vs 4.9266). We document our complete experimental process, including four ablation studies, and analyze potential reasons for this negative result. The work provides valuable empirical evidence about the challenges of improving upon well-tuned baseline optimizers and suggests directions for future research at larger scales.
Submission history
[v1] Sat, 25 Oct 2025 16:53 UTC