[Submitted on 30 Oct 2025]
Attentive Spectral Adam: A Novel Optimizer for Transformer Training
View PDFAbstract:We present Attentive Spectral Adam (ASA), a novel optimizer designed specifically for transformer models. ASA combines adaptive moment estimation with layer-specific learning rates and spectral normalization to better handle the unique characteristics of transformer architectures. Our method achieves a validation loss of 4.549 on the FineWeb benchmark using a Qwen 3 architecture, representing a 7.67\% improvement over the AdamW baseline. The key innovation is the integration of estimated spectral norms into the update rule, allowing for more stable training while maintaining computational efficiency.
Submission history
[v1] Thu, 30 Oct 2025 17:34 UTC