[Submitted on 30 Oct 2025]
StratOpt: A Stratified Optimization Approach for Language Model Training
View PDFAbstract:This paper presents StratOpt, a novel optimization approach for training large language models that combines layer-wise adaptation with variance-stabilized gradient updates. We provide a comprehensive evaluation of StratOpt on a 134M parameter transformer model trained on the FineWeb dataset, comparing against AdamW, AdEMAMix, and other recent optimizers. While StratOpt demonstrates improvements over AdEMAMix (5.209 vs 5.424 validation loss), it does not surpass the AdamW baseline (4.927). Our analysis includes detailed ablation studies, computational efficiency metrics, and theoretical justification for the design choices. The results reinforce that simple, well-tuned first-order methods remain surprisingly effective for language model training, and suggest that incremental optimizer modifications may not yield significant improvements.
Submission history
[v1] Thu, 30 Oct 2025 23:56 UTC