[Submitted on 1 Nov 2025]
Adaptive Momentum with Component Scaling: \\ A Theoretical and Empirical Study
View PDFAbstract:This paper presents Adaptive Momentum with Component Scaling (AMCS), a novel optimizer for transformer language models that combines dual momentum estimation with structural adaptation. We derive the theoretical foundations of our approach, showing how component-specific scaling interacts with momentum adaptation. Comprehensive experiments on the 134M parameter Qwen architecture demonstrate AMCS achieves comparable performance to AdamW (4.957 vs 4.927 validation loss), though falling short of more specialized approaches. We provide extensive analysis of training dynamics, memory efficiency, and component interactions, along with clear limitations and future directions.
Submission history
[v1] Sat, 1 Nov 2025 09:01 UTC