[Submitted on 3 Nov 2025]
FSDP-Compatible Optimizer Design: Lessons from a Negative Result
View PDFAbstract:This paper presents a cautionary case study in optimizer design for Fully Sharded Data Parallel (FSDP) training of transformer language models. While attempting to develop an improved optimizer compatible with FSDP constraints, we encountered fundamental limitations that prevented successful implementation of advanced techniques. Our final implementation, using layer-specific learning rates with AdamW, achieved a validation loss of 5.54, underperforming both the AdamW baseline (4.93) and state-of-the-art methods (3.81-4.55). We analyze the technical challenges of FSDP-compatible optimization and discuss implications for future research in distributed training optimization.
Submission history
[v1] Mon, 3 Nov 2025 10:32 UTC