fix(fsdp): add CLI argument for outer_dp_sharding_strategy #3053

liuyun7345 · 2026-01-23T03:19:57Z

Summary

This PR adds the missing --outer-dp-sharding-strategy CLI argument to enable HSDP (Hybrid-Sharded Data Parallelism) with outer-layer optimizer sharding in Megatron-LM training scripts.

Changes

Added --outer-dp-sharding-strategy CLI argument in megatron/training/arguments.py
Supports values: 'no_shard' (default) and 'optim'
The 'optim' option requires --data-parallel-sharding-strategy optim_grads_params

Motivation

The underlying Megatron-FSDP codebase already supports outer_dp_sharding_strategy parameter:

DistributedDataParallelConfig.outer_dp_sharding_strategy exists (default: 'no_shard')
fully_shard_model() API accepts outer_dp_sharding_strategy parameter
mcore_fsdp_adapter.py checks for outer_dp_sharding_strategy != "no_shard"

However, users could not configure this via command-line arguments, preventing them from utilizing HSDP with outer optimizer sharding, which can provide significant memory savings (up to 4x reduction in optimizer state memory for outer-DP size of 4).

Usage Example

python pretrain_gpt.py \
    --use-megatron-fsdp \
    --data-parallel-sharding-strategy optim_grads_params \
    --outer-dp-sharding-strategy optim \
    --num-distributed-optimizer-instances 4 \
    ...

Implementation Details

The implementation leverages the existing automatic parameter passing mechanism in training.py (lines 1281-1283), which automatically extracts all DistributedDataParallelConfig fields from args. Therefore, only adding the CLI argument is needed - no changes to training.py are required.

Validation logic already exists in fully_shard.py:

Only 'no_shard' and 'optim' are supported for outer_dp_sharding_strategy
When using 'optim', data_parallel_sharding_strategy must be 'optim_grads_params'

Testing

Code follows existing patterns (similar to --data-parallel-sharding-strategy)
Parameter automatically passed to DistributedDataParallelConfig via existing mechanism
Validation logic already exists in fully_shard.py
No syntax errors

Fixes #3038

This commit adds the missing --outer-dp-sharding-strategy CLI argument to enable HSDP (Hybrid-Sharded Data Parallelism) with outer-layer optimizer sharding in Megatron-LM training scripts. Fixes NVIDIA#3038

copy-pr-bot · 2026-01-23T03:20:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shjwudp

LGTM. Thanks for your contribution!

shjwudp · 2026-01-23T04:58:27Z

/ok to test dde6725

github-actions · 2026-01-23T04:58:45Z

Thank you for your contribution!

NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process.

Thank you for your understanding.

shjwudp · 2026-01-26T07:06:11Z

/ok to test 3f903f9

fix(fsdp): add CLI argument for outer_dp_sharding_strategy

dde6725

This commit adds the missing --outer-dp-sharding-strategy CLI argument to enable HSDP (Hybrid-Sharded Data Parallelism) with outer-layer optimizer sharding in Megatron-LM training scripts. Fixes NVIDIA#3038

ko3n1g requested a review from a team January 23, 2026 03:20

github-actions bot added the community-request label Jan 23, 2026

shjwudp approved these changes Jan 23, 2026

View reviewed changes

shjwudp added Final Review Apply this label to indicate that your PR is ready for final review. module: megatron-fsdp labels Jan 23, 2026

shjwudp enabled auto-merge January 23, 2026 04:58

copy-pr-bot bot temporarily deployed to nemo-ci January 23, 2026 04:58 Inactive

ko3n1g added this to the Core 0.16 milestone Jan 23, 2026

copy-pr-bot bot temporarily deployed to nemo-ci January 23, 2026 04:58 Inactive

copy-pr-bot bot temporarily deployed to test January 23, 2026 04:59 Inactive

BoxiangW requested a review from a team January 23, 2026 19:33

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 25, 2026

Merge branch 'main' into fix/add-outer-dp-sharding-strategy-arg

3f903f9

copy-pr-bot bot temporarily deployed to nemo-ci January 26, 2026 07:06 Inactive

copy-pr-bot bot temporarily deployed to test January 26, 2026 07:06 Inactive

shjwudp added this pull request to the merge queue Jan 26, 2026

Merged via the queue into NVIDIA:main with commit bb42a00 Jan 26, 2026
48 checks passed

chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fsdp): add CLI argument for outer_dp_sharding_strategy #3053

fix(fsdp): add CLI argument for outer_dp_sharding_strategy #3053

liuyun7345 commented Jan 23, 2026

Uh oh!

copy-pr-bot bot commented Jan 23, 2026

Uh oh!

shjwudp left a comment

Uh oh!

shjwudp commented Jan 23, 2026

Uh oh!

github-actions bot commented Jan 23, 2026

Uh oh!

shjwudp commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(fsdp): add CLI argument for outer_dp_sharding_strategy #3053

fix(fsdp): add CLI argument for outer_dp_sharding_strategy #3053

Conversation

liuyun7345 commented Jan 23, 2026

Summary

Changes

Motivation

Usage Example

Implementation Details

Testing

Uh oh!

copy-pr-bot bot commented Jan 23, 2026

Uh oh!

shjwudp left a comment

Choose a reason for hiding this comment

Uh oh!

shjwudp commented Jan 23, 2026

Uh oh!

github-actions bot commented Jan 23, 2026

Uh oh!

shjwudp commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants