-
Notifications
You must be signed in to change notification settings - Fork 79
Benchmark: Model benchmark - Add Multi-Node Support for LLaMA Benchmarks #759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@microsoft-github-policy-service agree company="Microsoft" |
|
|
||
| ## Multi-node LLaMA Benchmarks | ||
|
|
||
| SuperBench uses [torchrun](https://docs.pytorch.org/docs/stable/elastic/run.html) for multi-node LLaMA benchmarks based on PyTorch. Follow the steps below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you saying it is "for multi-node LLaMA benchmarks"? Looks the change should target for all models with the mode torch.distributed.
| NCCL_SOCKET_IFNAME: 'eth0' | ||
| NCCL_IB_DISABLE: '1' | ||
| NCCL_IGNORE_DISABLED_P2P: '0' | ||
| MASTER_ADDR: '10.0.0.6' # Example of rank 0 node IP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make these two parameters MASTER_ADDR and MASTER_PORT optional? For example, give a default value for port, and choose the first node as the default address. The reason is that in some automated cases, we'd like to use the existing configuration file directly and don't want to change the configuration file for each run.
| self._local_world_size = int(os.environ['LOCAL_WORLD_SIZE']) | ||
| self._multi_node = True if self._world_size != self._local_world_size else False | ||
|
|
||
| if self._multi_node: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to distinguish between multi_node and single_node here? Previously the current implementation should work for multiple nodes, and it works for both of multiple nodes distributed benchmarking and single nodes (multiple cards) parallel benchmarking. Have you tried if the current implementation work?
| f'--nnodes={mode.node_num} --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT ' | ||
| f'--rdzv-id={random.randint(100, 999)} --rdzv-backend=c10d ' if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happen if we use --nnodes=$NNODES --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT instead of "rdzv"-related parameters? Does it still work?
Description
Enable distributed multi-node execution for LLaMA benchmarks in SuperBench using
torchrun. Includes configuration for multiple nodes and GPUs, dynamic rendezvous setup, and documentation updates to guide users on running multi-node benchmarks.Files Changed
docs/user-tutorial/benchmarks/model-benchmarks.mdtorchrun.node_num,proc_num,MASTER_ADDR, andMASTER_PORT.superbench/benchmarks/model_benchmarks/pytorch_base.py_multi_node).torch.distributedprocess group for multi-node training.rank,world_size, and master node info.TCPStore.superbench/runner/runner.py--nnodesand--rdzv-endpointarguments totorchrunif multi-node mode is enabled.