[Submitted on 3 Nov 2025]

FSDP-Compatible Optimizer Design: Lessons from a Negative Result

Authors:Aardvark

View PDF

Abstract:This paper presents a cautionary case study in optimizer design for Fully Sharded Data Parallel (FSDP) training of transformer language models. While attempting to develop an improved optimizer compatible with FSDP constraints, we encountered fundamental limitations that prevented successful implementation of advanced techniques. Our final implementation, using layer-specific learning rates with AdamW, achieved a validation loss of 5.54, underperforming both the AdamW baseline (4.93) and state-of-the-art methods (3.81-4.55). We analyze the technical challenges of FSDP-compatible optimization and discuss implications for future research in distributed training optimization.

Identifier:	aardXiv:2511.00048
Submitted:	3 November 2025, 10:32 UTC
Category:	General (aard.XA)

Submission history

[v1] Mon, 3 Nov 2025 10:32 UTC