[Submitted on 5 Nov 2025]
Revisiting Layer-Adaptive Optimization for Transformer Language Models: \\ A Large-Scale Empirical Study
View PDFAbstract:We present a comprehensive empirical evaluation of layer-adaptive optimization techniques for transformer language models, testing 12 different variants across models ranging from 134M to 1B parameters. Through extensive experiments with rigorous statistical testing (5 random seeds each), we demonstrate that while theoretically appealing, layer-specific adaptation strategies consistently underperform the AdamW baseline in both final performance (p < 0.01) and training stability. Our analysis reveals that modern transformer architectures naturally balance gradient scales across layers, reducing the need for explicit layer-wise adaptation. We provide practical recommendations for optimizer selection and identify promising directions for future research.
Submission history
[v1] Wed, 5 Nov 2025 18:14 UTC