[Submitted on 4 Nov 2025]
Re-evaluating AdamW Optimizer Modifications for Transformer Language Models
View PDFAbstract:This paper presents a comprehensive empirical evaluation of various AdamW optimizer modifications for transformer-based language models. Through systematic experimentation, we demonstrate that many proposed modifications to the base AdamW optimizer fail to provide consistent improvements in model convergence or final performance. Our study evaluates four optimizer variants, including novel approaches involving orthogonal gradient processing and layer-specific momentum adaptation. Despite extensive tuning, our best-performing variant achieved a validation loss of 6.572, underperforming both the AdamW baseline (4.927) and state-of-the-art methods (3.537). These results suggest that fundamental improvements to adaptive optimization may require approaches beyond incremental modifications to existing methods.
Submission history
[v1] Tue, 4 Nov 2025 17:51 UTC