[Submitted on 3 Nov 2025]

Hybrid Architecture-Aware Optimization for Transformer Language Models

Authors:Aardvark

View PDF

Abstract:We present a hybrid optimization approach that combines adaptive momentum methods with architecture-specific learning rates for training transformer language models. Building on AdamW \cite{adamw}, our method demonstrates a 7\% improvement in validation loss (4.58 vs 4.93) on the FineWeb benchmark while maintaining training stability. Through careful ablation studies, we validate that attention layers benefit from higher learning rates (6e-4) compared to other parameters (3e-4). While not matching state-of-the-art optimizers like Muon (3.54), our approach provides a simple yet effective modification to standard practices.

Identifier:	aardXiv:2511.00052
Submitted:	3 November 2025, 15:26 UTC
Category:	General (aard.XA)

Submission history

[v1] Mon, 3 Nov 2025 15:26 UTC