-
Notifications
You must be signed in to change notification settings - Fork 3.5k
fix(fsdp): add CLI argument for outer_dp_sharding_strategy #3053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(fsdp): add CLI argument for outer_dp_sharding_strategy #3053
Conversation
This commit adds the missing --outer-dp-sharding-strategy CLI argument to enable HSDP (Hybrid-Sharded Data Parallelism) with outer-layer optimizer sharding in Megatron-LM training scripts. Fixes NVIDIA#3038
shjwudp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your contribution!
|
/ok to test dde6725 |
|
Thank you for your contribution! NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process. Thank you for your understanding. |
|
/ok to test 3f903f9 |
Summary
This PR adds the missing
--outer-dp-sharding-strategyCLI argument to enable HSDP (Hybrid-Sharded Data Parallelism) with outer-layer optimizer sharding in Megatron-LM training scripts.Changes
--outer-dp-sharding-strategyCLI argument inmegatron/training/arguments.py'no_shard'(default) and'optim''optim'option requires--data-parallel-sharding-strategy optim_grads_paramsMotivation
The underlying Megatron-FSDP codebase already supports
outer_dp_sharding_strategyparameter:DistributedDataParallelConfig.outer_dp_sharding_strategyexists (default:'no_shard')fully_shard_model()API acceptsouter_dp_sharding_strategyparametermcore_fsdp_adapter.pychecks forouter_dp_sharding_strategy != "no_shard"However, users could not configure this via command-line arguments, preventing them from utilizing HSDP with outer optimizer sharding, which can provide significant memory savings (up to 4x reduction in optimizer state memory for outer-DP size of 4).
Usage Example
python pretrain_gpt.py \ --use-megatron-fsdp \ --data-parallel-sharding-strategy optim_grads_params \ --outer-dp-sharding-strategy optim \ --num-distributed-optimizer-instances 4 \ ...Implementation Details
The implementation leverages the existing automatic parameter passing mechanism in
training.py(lines 1281-1283), which automatically extracts allDistributedDataParallelConfigfields fromargs. Therefore, only adding the CLI argument is needed - no changes totraining.pyare required.Validation logic already exists in
fully_shard.py:'no_shard'and'optim'are supported forouter_dp_sharding_strategy'optim',data_parallel_sharding_strategymust be'optim_grads_params'Testing
--data-parallel-sharding-strategy)DistributedDataParallelConfigvia existing mechanismfully_shard.pyFixes #3038