Benchmark: Model benchmark - Add Multi-Node Support for LLaMA Benchmarks #759

avazr · 2025-11-18T22:14:28Z

Description
Enable distributed multi-node execution for LLaMA benchmarks in SuperBench using torchrun. Includes configuration for multiple nodes and GPUs, dynamic rendezvous setup, and documentation updates to guide users on running multi-node benchmarks.

Files Changed

docs/user-tutorial/benchmarks/model-benchmarks.md
- Added a Multi-node LLaMA Benchmarks section.
- Instructions for setting up multi-node experiments with torchrun.
- Example YAML configuration showing node_num, proc_num, MASTER_ADDR, and MASTER_PORT.
- Added prerequisites for passwordless SSH and NVIDIA IMEX service.
superbench/benchmarks/model_benchmarks/pytorch_base.py
- Added detection of multi-node execution (_multi_node).
- Initialized torch.distributed process group for multi-node training.
- Added debug logging for rank, world_size, and master node info.
- Support for both single-node and multi-node distributed execution using TCPStore.
superbench/runner/runner.py
- Updated runner to pass --nnodes and --rdzv-endpoint arguments to torchrun if multi-node mode is enabled.
- Added randomized rendezvous ID for each run.

…buted settings

avazr · 2025-11-18T22:18:47Z

@microsoft-github-policy-service agree company="Microsoft"

…enchmarks

guoshzhao · 2025-11-20T00:29:09Z

docs/user-tutorial/benchmarks/model-benchmarks.md


+## Multi-node LLaMA Benchmarks
+
+SuperBench uses [torchrun](https://docs.pytorch.org/docs/stable/elastic/run.html) for multi-node LLaMA benchmarks based on PyTorch. Follow the steps below.


Why are you saying it is "for multi-node LLaMA benchmarks"? Looks the change should target for all models with the mode torch.distributed.

guoshzhao · 2025-11-20T00:33:07Z

docs/user-tutorial/benchmarks/model-benchmarks.md

+        NCCL_SOCKET_IFNAME: 'eth0'
+        NCCL_IB_DISABLE: '1'
+        NCCL_IGNORE_DISABLED_P2P: '0'
+        MASTER_ADDR: '10.0.0.6'            # Example of rank 0 node IP


Can we make these two parameters MASTER_ADDR and MASTER_PORT optional? For example, give a default value for port, and choose the first node as the default address. The reason is that in some automated cases, we'd like to use the existing configuration file directly and don't want to change the configuration file for each run.

guoshzhao · 2025-11-20T00:52:59Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+                self._local_world_size = int(os.environ['LOCAL_WORLD_SIZE'])
+                self._multi_node = True if self._world_size != self._local_world_size else False
+
+                if self._multi_node:


Do we really need to distinguish between multi_node and single_node here? Previously the current implementation should work for multiple nodes, and it works for both of multiple nodes distributed benchmarking and single nodes (multiple cards) parallel benchmarking. Have you tried if the current implementation work?

guoshzhao · 2025-11-20T00:54:41Z

superbench/runner/runner.py

+                f'--nnodes={mode.node_num} --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT '
+                f'--rdzv-id={random.randint(100, 999)} --rdzv-backend=c10d ' if


What happen if we use --nnodes=$NNODES --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT instead of "rdzv"-related parameters? Does it still work?

avazr added 4 commits September 30, 2025 16:44

Add multi-node support for LLaMA benchmarks and update PyTorch distri…

2dc1b08

…buted settings

Add example port

3ac61bf

Merge remote-tracking branch 'origin/main' into ali/multinode

54b4de4

Empty lines

334e9a4

avazr self-assigned this Nov 18, 2025

avazr requested a review from a team as a code owner November 18, 2025 22:14

avazr added documentation Improvements or additions to documentation benchmarks SuperBench Benchmarks runner SuperBench Runner configuration Benchmark configurations labels Nov 18, 2025

Fix documentation link for Ansible inventory configuration in model b…

27406a8

…enchmarks

guoshzhao changed the title ~~Add Multi-Node Support for LLaMA Benchmarks~~ Benchmark: Model benchmark - Add Multi-Node Support for LLaMA Benchmarks Nov 19, 2025

avazr added 4 commits November 19, 2025 17:22

Fix type conversion for node_num in __get_mode_command method

ccdc293

Empty line

43df3f1

Fix type checking for node_num in __get_mode_command method

e7451b6

Lint fix

5d4f6d9

guoshzhao reviewed Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark: Model benchmark - Add Multi-Node Support for LLaMA Benchmarks #759

Benchmark: Model benchmark - Add Multi-Node Support for LLaMA Benchmarks #759

avazr commented Nov 18, 2025

Uh oh!

avazr commented Nov 18, 2025

Uh oh!

guoshzhao Nov 20, 2025

Uh oh!

guoshzhao Nov 20, 2025 •

edited

Loading

Uh oh!

guoshzhao Nov 20, 2025

Uh oh!

guoshzhao Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		## Multi-node LLaMA Benchmarks

		SuperBench uses [torchrun](https://docs.pytorch.org/docs/stable/elastic/run.html) for multi-node LLaMA benchmarks based on PyTorch. Follow the steps below.

		f'--nnodes={mode.node_num} --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT '
		f'--rdzv-id={random.randint(100, 999)} --rdzv-backend=c10d ' if

Benchmark: Model benchmark - Add Multi-Node Support for LLaMA Benchmarks #759

Are you sure you want to change the base?

Benchmark: Model benchmark - Add Multi-Node Support for LLaMA Benchmarks #759

Conversation

avazr commented Nov 18, 2025

Uh oh!

avazr commented Nov 18, 2025

Uh oh!

guoshzhao Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

guoshzhao Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guoshzhao Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

guoshzhao Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guoshzhao Nov 20, 2025 •

edited

Loading