Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the multi-tenants example to support torch.compile and disable lazy mode #47

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion PyTorch/examples/multi_tenants/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ For further information on training deep learning models using Gaudi, refer to [
Please follow the instructions provided in the [Gaudi Installation Guide](https://docs.habana.ai/en/latest/Installation_Guide/GAUDI_Installation_Guide.html) to set up the
environment including the `$PYTHON` environment variable. The guide will walk you through the process of setting up your system to run the model on Gaudi.

### Create Docker Container and Set up Python

Please follow the instructions provided in [Run Using Containers on Habana Base AMI](https://docs.habana.ai/en/latest/AWS_User_Guides/Habana_Deep_Learning_AMI.html#run-using-containers-on-habana-base-ami) to pull the docker image and launch the container. And make sure to setup Python inside the docker container following [Model References Requirements](https://docs.habana.ai/en/latest/AWS_User_Guides/Habana_Deep_Learning_AMI.html#model-references-requirements).

### Clone Intel Gaudi Model-References

In the docker container, clone this repository and switch to the branch that matches your Intel Gaudi software version. You can run the [`hl-smi`](https://docs.habana.ai/en/latest/Management_and_Monitoring/System_Management_Tools_Guide/System_Management_Tools.html#hl-smi-utility-options) utility to determine the Intel Gaudi software version.
Expand Down Expand Up @@ -62,10 +66,31 @@ You can run multiple jobs in parallel using the script described in the followin

### multi_tenants_resnet_pt.sh

Running `multi_tenants_resnet_pt.sh` script without setting any arguments invokes 2 ResNet50 jobs in parallel, each using 4 Gaudis. User can also provide two sets of module IDs as the script arguments, i.e., `multi_tenants_resnet_pt.sh "0,1" "2,3"`, invokes 2 jobs in parallel, each using 2 Gaudis. Using the command `hl-smi -Q index,module_id -f csv` will produce a .csv file which will show the corresponding to the AIP number mapped to module_id. This can be used to find which module IDs are available for parallel training. The `HABANA_VISIBLE_MODULES` environment variable and model python script arguments need to be explicitly specified as different values for both jobs.
#### Run 2 ResNet50 Jobs on Total 8 HPUs with torch.compile Enabled

Running the script without setting any arguments invokes 2 ResNet50 jobs in parallel, each using 4 Gaudis.

```bash
bash multi_tenants_resnet_pt.sh
```

#### Run 2 ResNet50 Jobs on Total 4 HPUs with torch.compile Enabled

User can also provide two sets of module IDs as the script arguments. The following command invokes 2 jobs in parallel, each using 2 Gaudis.

```bash
bash multi_tenants_resnet_pt.sh "0,1" "2,3"
```

#### `HABANA_VISIBLE_MODULES`

Using the command `hl-smi -Q index,module_id -f csv` will produce a .csv file which will show the corresponding to the AIP number mapped to module_id. This can be used to find which module IDs are available for parallel training. The `HABANA_VISIBLE_MODULES` environment variable and model python script arguments need to be explicitly specified as different values for both jobs.

`HABANA_VISIBLE_MODULES` is an environment variable for the list of module IDs, composed by a sequence of single digit integers. The same integer should not be used by multiple jobs running in parallel:
For jobs with 4 Gaudis, it is recommended to set this to "0,1,2,3" or "4,5,6,7".
For jobs with 2 Gaudis, it is recommended to set this to "0,1", "2,3", "4,5", or "6,7".

## Changelog
### 1.16.2
- Added torch.compile support to improve training performance.
- Lazy mode support is deprecated for this example.
3 changes: 2 additions & 1 deletion PyTorch/examples/multi_tenants/multi_tenants_resnet_pt.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/bin/bash
export PT_HPU_LAZY_MODE=0
export MASTER_ADDR=localhost

SCRIPT_DIR=`dirname $(readlink -e ${BASH_SOURCE[0]})`
Expand Down Expand Up @@ -53,7 +54,7 @@ function run() {
--dl-time-exclude=False \
--custom-lr-values ${LR_VALUES} \
--custom-lr-milestones ${LR_MILESTONES} \
--seed=123 1> $STDOUT_LOG 2> $STDERR_LOG &
--seed=123 --run-lazy-mode=False --use_torch_compile 1> $STDOUT_LOG 2> $STDERR_LOG &

echo "Job ${JOB_ID} starts with ${NUM} cards, stdout: ${STDOUT_LOG}, stderr: ${STDERR_LOG}"
JOB_ID=$((JOB_ID+1))
Expand Down