From 02549e7e9ba56a6f522221eff92b1a783431c757 Mon Sep 17 00:00:00 2001 From: Dmitry Starov Date: Fri, 25 Oct 2024 11:52:41 +0200 Subject: [PATCH 1/5] [DOC] Fix non-rendering alerts && replace syntax `bash` with `shell` --- soperator/README.md | 230 ++++++++++++++++++++++++-------------------- 1 file changed, 125 insertions(+), 105 deletions(-) diff --git a/soperator/README.md b/soperator/README.md index d3ec0857..3a7c13e4 100644 --- a/soperator/README.md +++ b/soperator/README.md @@ -44,69 +44,81 @@ These checks are implemented as usual Slurm jobs - they stay in the same queue w Make sure you have the following programs installed on your machine. -- [Terraform CLI](https://developer.hashicorp.com/terraform/install) +### Terraform CLI - > [!IMPORTANT] - > The minimum version of Terraform needed for this recipe is `1.8.0`. +> [!IMPORTANT] +> The minimum version of Terraform needed for this recipe is `1.8.0`. - ```console - $ terraform version - Terraform v1.9.8 - on darwin_arm64 - ... - ``` +[How to install](https://developer.hashicorp.com/terraform/install). -- [Nebius CLI](https://docs.nebius.ai/cli/install) +```console +$ terraform version +Terraform v1.9.8 +on darwin_arm64 +... +``` - ```console - $ nebius version - 0.11.2 - ``` +### Nebius CLI - [Authorize it](https://docs.nebius.com/cli/configure/#authorize-with-a-user-account) with a user account. +[How to install](https://docs.nebius.ai/cli/install). -- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) +```console +$ nebius version +0.11.2 +``` - ```console - $ kubectl version - Client Version: v1.31.1 - ... - ``` +[Authorize it](https://docs.nebius.com/cli/configure/#authorize-with-a-user-account) with a user account. -- [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) +### `kubectl` - ```console - $ aws --version - aws-cli/2.17.20 Python/3.11.9 Darwin/23.6.0 exe/x86_64 - ``` +[How to install](https://kubernetes.io/docs/tasks/tools/#kubectl). -- [jq](https://jqlang.github.io/jq/download/) +```console +$ kubectl version +Client Version: v1.31.1 +... +``` - ```console - $ jq --version - jq-1.7.1 - ``` +### AWS CLI -- `md5sum` +[How to install](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). - We use `md5sum` utility to generate unique S3 bucket IDs. - - `md5sum` is often pre-installed on most of Unix-like OSs. Ensure that you have it installed on your machine. - - ```shell - which md5sum - ``` - - > [!TIP] - > To install `md5sum` on macOS, you have to install GNU coreutils that includes it. - > ```shell - > brew install coreutils - > ``` +```console +$ aws --version +aws-cli/2.17.20 Python/3.11.9 Darwin/23.6.0 exe/x86_64 +``` -- [direnv](https://direnv.net/#basic-installation) +### `jq` - `direnv` is a tool for automatic loading of directory-scoped environment variables. - It can find and load variables from e.g. `.envrc` file. +[How to install](https://jqlang.github.io/jq/download/). + +```console +$ jq --version +jq-1.7.1 +``` + +### `md5sum` + +We use `md5sum` utility to generate unique S3 bucket IDs. + +`md5sum` is often pre-installed on most of Unix-like OSs. Ensure that you have it installed on your machine. + +```shell +which md5sum +``` + +> [!TIP] +> To install `md5sum` on macOS, you have to install GNU coreutils that includes it. +> ```shell +> brew install coreutils +> ``` + +### `direnv` + +`direnv` is a tool for automatic loading of directory-scoped environment variables. +It can find and load variables from e.g. `.envrc` file. + +[How to install](https://direnv.net/#basic-installation). ## Step-by-step guide @@ -246,8 +258,8 @@ Let's create a S3 bucket in Object Storage, which will be used by Terraform to s tfstate-slurm-k8s- ``` - > [!NOTE] - > `NEBIUS_BUCKET_NAME` contains unique bucket name dedicated to the project inside your tenant. +> [!NOTE] +> `NEBIUS_BUCKET_NAME` contains unique bucket name dedicated to the project inside your tenant. 2. Create a bucket: @@ -258,24 +270,9 @@ Let's create a S3 bucket in Object Storage, which will be used by Terraform to s --versioning-policy 'enabled' ``` - > [!NOTE] - > `--versioning-policy 'enabled'` allows you to keep track of versions made by Terraform. - > It gives you a possibility to roll back to specified version of TF state in case your installation is broken. - -3. Add the key, the Nebius AI region ID and the Object Storage endpoint URL to the AWS CLI configuration - - ```bash - aws configure set aws_access_key_id "${NEBIUS_SA_ACCESS_KEY_AWS_ID}" - ``` - ```bash - aws configure set aws_secret_access_key "${NEBIUS_SA_SECRET_ACCESS_KEY}" - ``` - ```bash - aws configure set region 'eu-north1' - ``` - ```bash - aws configure set endpoint_url 'https://storage.eu-north1.nebius.cloud:443' - ``` +> [!NOTE] +> `--versioning-policy 'enabled'` allows you to keep track of versions made by Terraform. +> It gives you a possibility to roll back to specified version of TF state in case your installation is broken. ### Set environment variables @@ -384,13 +381,30 @@ IAM token is present > ``` Once you loaded `.envrc` file into your environment, you'll get `.aws_secret_access_key` and - files created in your installation directory. +files created in your installation directory. > [!IMPORTANT] > Make sure that: > - `.aws_secret_access_key` file is not empty > - `terraform_backend_override.tf` file contains valid bucket name +### Configure AWS CLI + +Add the key, the Nebius AI region ID and the Object Storage endpoint URL to the AWS CLI configuration: + +```shell +aws configure set aws_access_key_id "${AWS_ACCESS_KEY_ID}" +``` +```shell +aws configure set aws_secret_access_key "${AWS_SECRET_ACCESS_KEY}" +``` +```shell +aws configure set region 'eu-north1' +``` +```shell +aws configure set endpoint_url 'https://storage.eu-north1.nebius.cloud:443' +``` + ### Initialize Terraform To initialize a Terraform project, download all referenced providers and modules, execute: @@ -440,7 +454,7 @@ root@login-0:~# Take a look on the list of Slurm workers: -```bash +```shell sinfo -Nl ``` @@ -448,7 +462,7 @@ Make sure they all are in `idle` state. In order to connect to a specific worker, use the following command: -```bash +```shell srun -w -Z --pty bash ``` @@ -469,22 +483,25 @@ Additionally, you can [try out the special features](#try-out-special-features) There is a [test](test) directory. Enter it and run the script that uploads several batch job scripts to your cluster: -```bash +```shell ./prepare_for_quickcheck.sh -u root -k -a ${SLURM_IP} ``` Within an SSH session to the Slurm cluster, execute: -```bash +```shell cd /quickcheck - -sbatch hello.sh +``` +```shell +sbatch hello.sh && \ tail -f outputs/hello.out - -sbatch nccl.sh +``` +```shell +sbatch nccl.sh && \ tail -f outputs/nccl.out - -sbatch enroot.sh +``` +```shell +sbatch enroot.sh && \ tail -f outputs/enroot.out ``` @@ -532,7 +549,7 @@ filestore_jail_submounts = [{ Or, you can use the same filestore for multiple clusters. In order to do this, create it on your own with the Nebius CLI -```bash +```shell nebius compute filesystem create \ --parent-id "${NEBIUS_PROJECT_ID}" \ --name 'shared-mlperf-sd' \ @@ -559,14 +576,14 @@ It will attach the storage to your cluster at `/mlperf-sd` directory. Enter the [test](test) directory and run the script that uploads several batch job scripts to your cluster: -```bash +```shell ./prepare_for_mlperf_sd.sh -u root -k -a ${SLURM_IP} ``` Within an SSH session to the Slurm cluster, execute: -```bash -cd /opt/mlperf-sd +```shell +cd /opt/mlperf-sd && \ ./prepare_env.sh ``` @@ -576,26 +593,26 @@ downloading datasets and checkpoints. > [!NOTE] > The actual working directory for this benchmark is located at the root level - `/mlperf-sd`. > -> ```cd -> /mlperf-sd +> ```shell +> cd /mlperf-sd > ``` Wait until the job finishes. You can track the progress by running: -```bash +```shell watch squeue ``` Or checking the `aws_download.log` output: -```bash +```shell tail -f aws_download.log ``` Once it's done, start the benchmark: -```bash -cd /mlperf-sd/training/stable_diffusion +```shell +cd /mlperf-sd/training/stable_diffusion && \ ./scripts/slurm/sbatch.sh ``` @@ -606,7 +623,7 @@ If your setup consists of 2 worker nodes with 8 H100 GPU on each, you can compar Also, you can execute -```bash +```shell ./parselog -f nogit/logs/your_log_file ``` @@ -615,7 +632,7 @@ In order to parse your log file and calculate the result.
Usage example ->```bash +>```shell >./parselog -f nogit/logs/reference_02x08x08_1720163290.out -g 2xH100 >``` >```text @@ -663,7 +680,7 @@ filestore_jail_submounts = [{ Or, you can use the same filestore for multiple clusters. In order to do this, create it on your own with the Nebius CLI -```bash +```shell nebius compute filesystem create \ --parent-id "${NEBIUS_PROJECT_ID}" \ --name 'shared-mlperf-gpt3' \ @@ -690,14 +707,14 @@ It will attach the storage to your cluster at `/gpt3` directory. Enter the [test](test) directory and run the script that uploads several batch job scripts to your cluster: -```bash +```shell ./prepare_for_mlperf_gpt3.sh -u root -k -a ${SLURM_IP} ``` Within an SSH session to the Slurm cluster, execute: -```bash -cd /opt/mlperf-gpt3 +```shell +cd /opt/mlperf-gpt3 && \ ./init.sh ``` @@ -712,8 +729,9 @@ Once initialisation is done, start the benchmark: > [!NOTE] > The actual working directory for this benchmark is located at the root level - `/gpt3`. -```bash -cd /gpt3 && ./run.sh +```shell +cd /gpt3 && \ +./run.sh ``` ### Try out special features @@ -729,7 +747,7 @@ There's a wrapper script `createuser`, which:
Usage example ->```bash +>```shell > createuser pierre >``` >```text @@ -754,13 +772,15 @@ There's a wrapper script `createuser`, which: >```
-You can also check how new packages are installed into the shared filesystem: +You can also check how new packages are installed into the shared filesystem. -```bash -# Install the package on the login node +Install the package on the login node: +```shell apt update && apt install -y neofetch +``` -# Run it on a worker node +Run it on a worker node: +```shell srun neofetch ``` @@ -775,7 +795,7 @@ If everything is OK with GPUs on your nodes, the launch of CronJob will finish s In order to simulate GPU performance issues on one of the nodes, you can launch another NCCL test with half of available GPUs just before triggering the CronJob: -```bash +```shell srun -w worker-0 -Z --gpus=4 bash -c "/usr/bin/all_reduce_perf -b 512M -e 16G -f 2 -g 4" ``` @@ -784,12 +804,12 @@ srun -w worker-0 -Z --gpus=4 bash -c "/usr/bin/all_reduce_perf -b 512M -e 16G -f After that, `worker-0` should become drained: -```bash +```shell sinfo -Nl ``` You can see the verbose details in the Reason field of this node description: -```bash +```shell scontrol show node worker-0 ``` From 2dfb3251193f3762b922ab7f96ff159359e81ecc Mon Sep 17 00:00:00 2001 From: Dmitry Starov Date: Fri, 25 Oct 2024 11:55:13 +0200 Subject: [PATCH 2/5] [DOC] Remove GPT3 check for a while --- soperator/README.md | 160 -------------------------------------------- 1 file changed, 160 deletions(-) diff --git a/soperator/README.md b/soperator/README.md index 3a7c13e4..1270cd9f 100644 --- a/soperator/README.md +++ b/soperator/README.md @@ -474,7 +474,6 @@ Now you can check how it executes compute jobs. We offer two kind of checks: - [Quick](#quickly-check-the-slurm-cluster); - [MLCommons Stable Diffusion](#run-mlcommons-stable-diffusion-benchmark). -- [MLCommons GPT3](#run-mlcommons-gpt3-benchmark). Additionally, you can [try out the special features](#try-out-special-features) Soperator provides. @@ -654,162 +653,3 @@ In order to parse your log file and calculate the result. > max: 23.62s >```
- -#### Run MLCommons GPT3 benchmark - -If you are going to run the MLCommons GPT3 benchmark, you will probably need large storage for it. - -
-Creating storage for GPT3 benchmark - -You can create with this Terraform recipe, as in provided [terraform.tfvars](installations/example): - -```terraform -# Shared filesystems to be mounted inside jail. -# --- -filestore_jail_submounts = [{ - name = "mlperf-gpt3" - mount_path = "/gpt3" - spec = { - size_gibibytes = 8192 - block_size_kibibytes = 4 - } -}] -``` - -Or, you can use the same filestore for multiple clusters. -In order to do this, create it on your own with the Nebius CLI - -```shell -nebius compute filesystem create \ - --parent-id "${NEBIUS_PROJECT_ID}" \ - --name 'shared-mlperf-gpt3' \ - --type 'network_ssd' \ - --size-bytes 8796093022208 -``` - -And provide its ID to the recipe as follows: - -```terraform -# Shared filesystems to be mounted inside jail. -# --- -filestore_jail_submounts = [{ - name = "mlperf-gpt3" - mount_path = "/gpt3" - existing = { - id = "" - } -}] -``` - -It will attach the storage to your cluster at `/gpt3` directory. -
- -Enter the [test](test) directory and run the script that uploads several batch job scripts to your cluster: - -```shell -./prepare_for_mlperf_gpt3.sh -u root -k -a ${SLURM_IP} -``` - -Within an SSH session to the Slurm cluster, execute: - -```shell -cd /opt/mlperf-gpt3 && \ -./init.sh -``` - -This script: -- Clones the necessary parts from MLCommons git repository, and configures it for our cluster setup; -- Downloads dataset; -- Downloads checkpoint; -- Creates a Run script. - -Once initialisation is done, start the benchmark: - -> [!NOTE] -> The actual working directory for this benchmark is located at the root level - `/gpt3`. - -```shell -cd /gpt3 && \ -./run.sh -``` - -### Try out special features - -#### Shared root filesystem - -You can create a new user on a login node and have it appear on all nodes in the cluster. -There's a wrapper script `createuser`, which: -- Creates a new user & group; -- Adds they to sudoers; -- Creates a home directory with the specified public SSH key. - -
-Usage example - ->```shell -> createuser pierre ->``` ->```text -> Adding user `pierre' ... -> Adding new group `pierre' (1004) ... -> Adding new user `pierre' (1004) with group `pierre' ... -> Creating home directory `/home/pierre' ... -> Copying files from `/etc/skel' ... -> New password: ******** -> Retype new password: ******** -> passwd: password updated successfully -> Changing the user information for pierre -> Enter the new value, or press ENTER for the default -> Full Name []: Pierre Dunn -> Room Number []: 123 -> Work Phone []: -> Home Phone []: -> Other []: Slurm expert -> Is the information correct? [Y/n] y -> Enter the SSH public key, or press ENTER to avoid creating a key: -> ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKzxkjzPQ4EyZSjan4MLGFSA18idpZicoKW7HC4YmwgN pierre.dunn@gmail.com ->``` -
- -You can also check how new packages are installed into the shared filesystem. - -Install the package on the login node: -```shell -apt update && apt install -y neofetch -``` - -Run it on a worker node: -```shell -srun neofetch -``` - -#### Periodic GPU health checks - -The NCCL tests are launched from the `-nccl-benchmark` K8s CronJob. - -You can trigger this job manually if you don't want to wait until the next execution time. - -If everything is OK with GPUs on your nodes, the launch of CronJob will finish successfully. - -In order to simulate GPU performance issues on one of the nodes, you can launch another NCCL test with half of available -GPUs just before triggering the CronJob: - -```shell -srun -w worker-0 -Z --gpus=4 bash -c "/usr/bin/all_reduce_perf -b 512M -e 16G -f 2 -g 4" -``` - -> [!NOTE] -> We set the `-Z` option here, so it will ignore GPUs allocated in concurrent jobs. - -After that, `worker-0` should become drained: - -```shell -sinfo -Nl -``` - -You can see the verbose details in the Reason field of this node description: - -```shell -scontrol show node worker-0 -``` From 68815763a50e5406e8faeb158f6c32405985d994 Mon Sep 17 00:00:00 2001 From: Dmitry Starov Date: Fri, 25 Oct 2024 12:14:35 +0200 Subject: [PATCH 3/5] [DOC] Getting IP address --- soperator/README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/soperator/README.md b/soperator/README.md index 1270cd9f..46a2732f 100644 --- a/soperator/README.md +++ b/soperator/README.md @@ -450,6 +450,14 @@ $ ./login.sh -k ~/.ssh/id_rsa root@login-0:~# ``` +> [!NOTE] +> You can get the IP for connection from `./login.sh`. +> But it would be faster to get it via following command: +> ```shell +> terraform show -json \ +> | jq -r '.values.root_module.child_modules[].resources[] | select(.address | endswith("terraform_data.connection_ip")).values.output' +> ``` + ### Check it out Take a look on the list of Slurm workers: From e083ee7972817cc9d978763158e4a88502a62e8d Mon Sep 17 00:00:00 2001 From: Dmitry Starov Date: Fri, 25 Oct 2024 12:23:41 +0200 Subject: [PATCH 4/5] Bump version --- soperator/SUBVERSION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/soperator/SUBVERSION b/soperator/SUBVERSION index 0cfbf088..00750edc 100644 --- a/soperator/SUBVERSION +++ b/soperator/SUBVERSION @@ -1 +1 @@ -2 +3 From 667485905827e73fcdaab633d74293406237a157 Mon Sep 17 00:00:00 2001 From: Dmitry Starov Date: Fri, 25 Oct 2024 12:29:13 +0200 Subject: [PATCH 5/5] [DOC] Bump Nebius CLI version --- soperator/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/soperator/README.md b/soperator/README.md index 46a2732f..8c0f5740 100644 --- a/soperator/README.md +++ b/soperator/README.md @@ -64,7 +64,7 @@ on darwin_arm64 ```console $ nebius version -0.11.2 +0.11.6 ``` [Authorize it](https://docs.nebius.com/cli/configure/#authorize-with-a-user-account) with a user account.