[Submitted on 2 Nov 2025]
StableLayer: A Conservative Adaptive Optimizer for Transformer Training
View PDFAbstract:This paper introduces StableLayer, a novel optimizer that combines Adam-style updates with layer-wise adaptive scaling based on gradient norms. While not surpassing state-of-the-art methods, StableLayer achieves stable convergence with a final validation loss of 7.949 on the FineWeb benchmark, positioning it between standard AdamW (4.927) and less sophisticated baselines. Our analysis reveals that careful gradient norm adaptation provides training stability, particularly in early stages, though falls short of more sophisticated orthogonal processing methods.
Submission history
[v1] Sun, 2 Nov 2025 10:07 UTC