Skip to content

Conversation

@liuyun7345
Copy link
Contributor

Summary

This PR adds the missing --outer-dp-sharding-strategy CLI argument to enable HSDP (Hybrid-Sharded Data Parallelism) with outer-layer optimizer sharding in Megatron-LM training scripts.

Changes

  • Added --outer-dp-sharding-strategy CLI argument in megatron/training/arguments.py
  • Supports values: 'no_shard' (default) and 'optim'
  • The 'optim' option requires --data-parallel-sharding-strategy optim_grads_params

Motivation

The underlying Megatron-FSDP codebase already supports outer_dp_sharding_strategy parameter:

  • DistributedDataParallelConfig.outer_dp_sharding_strategy exists (default: 'no_shard')
  • fully_shard_model() API accepts outer_dp_sharding_strategy parameter
  • mcore_fsdp_adapter.py checks for outer_dp_sharding_strategy != "no_shard"

However, users could not configure this via command-line arguments, preventing them from utilizing HSDP with outer optimizer sharding, which can provide significant memory savings (up to 4x reduction in optimizer state memory for outer-DP size of 4).

Usage Example

python pretrain_gpt.py \
    --use-megatron-fsdp \
    --data-parallel-sharding-strategy optim_grads_params \
    --outer-dp-sharding-strategy optim \
    --num-distributed-optimizer-instances 4 \
    ...

Implementation Details

The implementation leverages the existing automatic parameter passing mechanism in training.py (lines 1281-1283), which automatically extracts all DistributedDataParallelConfig fields from args. Therefore, only adding the CLI argument is needed - no changes to training.py are required.

Validation logic already exists in fully_shard.py:

  • Only 'no_shard' and 'optim' are supported for outer_dp_sharding_strategy
  • When using 'optim', data_parallel_sharding_strategy must be 'optim_grads_params'

Testing

  • Code follows existing patterns (similar to --data-parallel-sharding-strategy)
  • Parameter automatically passed to DistributedDataParallelConfig via existing mechanism
  • Validation logic already exists in fully_shard.py
  • No syntax errors

Fixes #3038

This commit adds the missing --outer-dp-sharding-strategy CLI argument to enable HSDP (Hybrid-Sharded Data Parallelism) with outer-layer optimizer sharding in Megatron-LM training scripts.

Fixes NVIDIA#3038
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ko3n1g ko3n1g requested a review from a team January 23, 2026 03:20
Copy link
Contributor

@shjwudp shjwudp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your contribution!

@shjwudp shjwudp added Final Review Apply this label to indicate that your PR is ready for final review. module: megatron-fsdp labels Jan 23, 2026
@shjwudp
Copy link
Contributor

shjwudp commented Jan 23, 2026

/ok to test dde6725

@github-actions
Copy link
Contributor

Thank you for your contribution!

NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process.

Thank you for your understanding.

@ko3n1g ko3n1g added this to the Core 0.16 milestone Jan 23, 2026
@BoxiangW BoxiangW requested a review from a team January 23, 2026 19:33
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 25, 2026
@shjwudp
Copy link
Contributor

shjwudp commented Jan 26, 2026

/ok to test 3f903f9

@shjwudp shjwudp added this pull request to the merge queue Jan 26, 2026
Merged via the queue into NVIDIA:main with commit bb42a00 Jan 26, 2026
48 checks passed
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Final Review Apply this label to indicate that your PR is ready for final review. module: megatron-fsdp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[QUESTION] Missing CLI argument for outer_dp_sharding_strategy in Megatron-LM training scripts

4 participants