diff --git a/PyTorch/examples/multi_tenants/README.md b/PyTorch/examples/multi_tenants/README.md index a3e728ccc..c93cc1ad4 100644 --- a/PyTorch/examples/multi_tenants/README.md +++ b/PyTorch/examples/multi_tenants/README.md @@ -15,6 +15,10 @@ For further information on training deep learning models using Gaudi, refer to [ Please follow the instructions provided in the [Gaudi Installation Guide](https://docs.habana.ai/en/latest/Installation_Guide/GAUDI_Installation_Guide.html) to set up the environment including the `$PYTHON` environment variable. The guide will walk you through the process of setting up your system to run the model on Gaudi. +### Create Docker Container and Set up Python + +Please follow the instructions provided in [Run Using Containers on Habana Base AMI](https://docs.habana.ai/en/latest/AWS_User_Guides/Habana_Deep_Learning_AMI.html#run-using-containers-on-habana-base-ami) to pull the docker image and launch the container. And make sure to setup Python inside the docker container following [Model References Requirements](https://docs.habana.ai/en/latest/AWS_User_Guides/Habana_Deep_Learning_AMI.html#model-references-requirements). + ### Clone Intel Gaudi Model-References In the docker container, clone this repository and switch to the branch that matches your Intel Gaudi software version. You can run the [`hl-smi`](https://docs.habana.ai/en/latest/Management_and_Monitoring/System_Management_Tools_Guide/System_Management_Tools.html#hl-smi-utility-options) utility to determine the Intel Gaudi software version. @@ -62,10 +66,31 @@ You can run multiple jobs in parallel using the script described in the followin ### multi_tenants_resnet_pt.sh -Running `multi_tenants_resnet_pt.sh` script without setting any arguments invokes 2 ResNet50 jobs in parallel, each using 4 Gaudis. User can also provide two sets of module IDs as the script arguments, i.e., `multi_tenants_resnet_pt.sh "0,1" "2,3"`, invokes 2 jobs in parallel, each using 2 Gaudis. Using the command `hl-smi -Q index,module_id -f csv` will produce a .csv file which will show the corresponding to the AIP number mapped to module_id. This can be used to find which module IDs are available for parallel training. The `HABANA_VISIBLE_MODULES` environment variable and model python script arguments need to be explicitly specified as different values for both jobs. +#### Run 2 ResNet50 Jobs on Total 8 HPUs with torch.compile Enabled + +Running the script without setting any arguments invokes 2 ResNet50 jobs in parallel, each using 4 Gaudis. + +```bash +bash multi_tenants_resnet_pt.sh +``` + +#### Run 2 ResNet50 Jobs on Total 4 HPUs with torch.compile Enabled + +User can also provide two sets of module IDs as the script arguments. The following command invokes 2 jobs in parallel, each using 2 Gaudis. + +```bash +bash multi_tenants_resnet_pt.sh "0,1" "2,3" +``` #### `HABANA_VISIBLE_MODULES` +Using the command `hl-smi -Q index,module_id -f csv` will produce a .csv file which will show the corresponding to the AIP number mapped to module_id. This can be used to find which module IDs are available for parallel training. The `HABANA_VISIBLE_MODULES` environment variable and model python script arguments need to be explicitly specified as different values for both jobs. + `HABANA_VISIBLE_MODULES` is an environment variable for the list of module IDs, composed by a sequence of single digit integers. The same integer should not be used by multiple jobs running in parallel: For jobs with 4 Gaudis, it is recommended to set this to "0,1,2,3" or "4,5,6,7". For jobs with 2 Gaudis, it is recommended to set this to "0,1", "2,3", "4,5", or "6,7". + +## Changelog +### 1.16.2 + - Added torch.compile support to improve training performance. + - Lazy mode support is deprecated for this example. diff --git a/PyTorch/examples/multi_tenants/multi_tenants_resnet_pt.sh b/PyTorch/examples/multi_tenants/multi_tenants_resnet_pt.sh index 6a03df9eb..1b1818a20 100755 --- a/PyTorch/examples/multi_tenants/multi_tenants_resnet_pt.sh +++ b/PyTorch/examples/multi_tenants/multi_tenants_resnet_pt.sh @@ -1,4 +1,5 @@ #!/bin/bash +export PT_HPU_LAZY_MODE=0 export MASTER_ADDR=localhost SCRIPT_DIR=`dirname $(readlink -e ${BASH_SOURCE[0]})` @@ -53,7 +54,7 @@ function run() { --dl-time-exclude=False \ --custom-lr-values ${LR_VALUES} \ --custom-lr-milestones ${LR_MILESTONES} \ - --seed=123 1> $STDOUT_LOG 2> $STDERR_LOG & + --seed=123 --run-lazy-mode=False --use_torch_compile 1> $STDOUT_LOG 2> $STDERR_LOG & echo "Job ${JOB_ID} starts with ${NUM} cards, stdout: ${STDOUT_LOG}, stderr: ${STDERR_LOG}" JOB_ID=$((JOB_ID+1))