[Submitted on 5 Nov 2025]
Sophia-Lite: A Simplified Hessian-Aware Optimizer for Language Models
View PDFAbstract:We present Sophia-Lite, a simplified second-order optimizer for language models that combines efficient Hessian approximation with adaptive gradient updates. While recent work has demonstrated the promise of Hessian-aware optimization, existing approaches often require expensive computation or complex implementations. Our method uses gradient magnitudes as a lightweight approximation of the diagonal Hessian, motivated by theoretical analysis of gradient-Hessian relationships in deep networks. On a 134M parameter Transformer trained on FineWeb, Sophia-Lite achieves comparable performance (validation loss 4.959) to AdamW (4.927) while maintaining stable training dynamics. We provide extensive analysis of the tradeoffs between approximation quality, memory overhead, and computational efficiency.
Submission history
[v1] Wed, 5 Nov 2025 12:28 UTC