From 02549e7e9ba56a6f522221eff92b1a783431c757 Mon Sep 17 00:00:00 2001
From: Dmitry Starov <dmitry.staroff@gmail.com>
Date: Fri, 25 Oct 2024 11:52:41 +0200
Subject: [PATCH 1/5] [DOC] Fix non-rendering alerts && replace syntax `bash`
 with `shell`

---
 soperator/README.md | 230 ++++++++++++++++++++++++--------------------
 1 file changed, 125 insertions(+), 105 deletions(-)
diff --git a/soperator/README.md b/soperator/README.md
index d3ec0857..3a7c13e4 100644
--- a/soperator/README.md
+++ b/soperator/README.md
@@ -44,69 +44,81 @@ These checks are implemented as usual Slurm jobs - they stay in the same queue w
 
 Make sure you have the following programs installed on your machine.
 
-- [Terraform CLI](https://developer.hashicorp.com/terraform/install)
+### Terraform CLI
 
-    > [!IMPORTANT]
-    > The minimum version of Terraform needed for this recipe is `1.8.0`.
+> [!IMPORTANT]
+> The minimum version of Terraform needed for this recipe is `1.8.0`.
 
-    ```console
-    $ terraform version
-    Terraform v1.9.8
-    on darwin_arm64
-    ...
-    ```
+[How to install](https://developer.hashicorp.com/terraform/install).
 
-- [Nebius CLI](https://docs.nebius.ai/cli/install)
+```console
+$ terraform version
+Terraform v1.9.8
+on darwin_arm64
+...
+```
 
-    ```console
-    $ nebius version
-    0.11.2
-    ```
+### Nebius CLI
 
-    [Authorize it](https://docs.nebius.com/cli/configure/#authorize-with-a-user-account) with a user account.
+[How to install](https://docs.nebius.ai/cli/install).
 
-- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)
+```console
+$ nebius version
+0.11.2
+```
 
-    ```console
-    $ kubectl version
-    Client Version: v1.31.1
-    ...
-    ```
+[Authorize it](https://docs.nebius.com/cli/configure/#authorize-with-a-user-account) with a user account.
 
-- [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
+### `kubectl`
 
-    ```console
-    $ aws --version
-    aws-cli/2.17.20 Python/3.11.9 Darwin/23.6.0 exe/x86_64
-    ```
+[How to install](https://kubernetes.io/docs/tasks/tools/#kubectl).
 
-- [jq](https://jqlang.github.io/jq/download/)
+```console
+$ kubectl version
+Client Version: v1.31.1
+...
+```
 
-    ```console
-    $ jq --version
-    jq-1.7.1
-    ```
+### AWS CLI
 
-- `md5sum`
+[How to install](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
 
-    We use `md5sum` utility to generate unique S3 bucket IDs.
-    
-    `md5sum` is often pre-installed on most of Unix-like OSs. Ensure that you have it installed on your machine.
-    
-    ```shell
-    which md5sum 
-    ```
-    
-    > [!TIP]
-    > To install `md5sum` on macOS, you have to install GNU coreutils that includes it.
-    > ```shell
-    > brew install coreutils
-    > ```
+```console
+$ aws --version
+aws-cli/2.17.20 Python/3.11.9 Darwin/23.6.0 exe/x86_64
+```
 
-- [direnv](https://direnv.net/#basic-installation)
+### `jq`
 
-    `direnv` is a tool for automatic loading of directory-scoped environment variables.
-    It can find and load variables from e.g. `.envrc` file.
+[How to install](https://jqlang.github.io/jq/download/).
+
+```console
+$ jq --version
+jq-1.7.1
+```
+
+### `md5sum`
+
+We use `md5sum` utility to generate unique S3 bucket IDs.
+
+`md5sum` is often pre-installed on most of Unix-like OSs. Ensure that you have it installed on your machine.
+
+```shell
+which md5sum 
+```
+
+> [!TIP]
+> To install `md5sum` on macOS, you have to install GNU coreutils that includes it.
+> ```shell
+> brew install coreutils
+> ```
+
+### `direnv`
+
+`direnv` is a tool for automatic loading of directory-scoped environment variables.
+It can find and load variables from e.g. `.envrc` file.
+
+[How to install](https://direnv.net/#basic-installation).
 
 ## Step-by-step guide
 
@@ -246,8 +258,8 @@ Let's create a S3 bucket in Object Storage, which will be used by Terraform to s
     tfstate-slurm-k8s-<project-hash>
     ```
 
-    > [!NOTE]
-    > `NEBIUS_BUCKET_NAME` contains unique bucket name dedicated to the project inside your tenant.
+> [!NOTE]
+> `NEBIUS_BUCKET_NAME` contains unique bucket name dedicated to the project inside your tenant.
 
 2. Create a bucket:
 
@@ -258,24 +270,9 @@ Let's create a S3 bucket in Object Storage, which will be used by Terraform to s
         --versioning-policy 'enabled' 
     ```
 
-    > [!NOTE]
-    > `--versioning-policy 'enabled'` allows you to keep track of versions made by Terraform.
-    > It gives you a possibility to roll back to specified version of TF state in case your installation is broken.
-
-3. Add the key, the Nebius AI region ID and the Object Storage endpoint URL to the AWS CLI configuration
-
-    ```bash
-    aws configure set aws_access_key_id "${NEBIUS_SA_ACCESS_KEY_AWS_ID}"
-    ```
-    ```bash
-    aws configure set aws_secret_access_key "${NEBIUS_SA_SECRET_ACCESS_KEY}"
-    ```
-    ```bash
-    aws configure set region 'eu-north1'
-    ```
-    ```bash
-    aws configure set endpoint_url 'https://storage.eu-north1.nebius.cloud:443'
-    ```
+> [!NOTE]
+> `--versioning-policy 'enabled'` allows you to keep track of versions made by Terraform.
+> It gives you a possibility to roll back to specified version of TF state in case your installation is broken.
 
 ### Set environment variables
 
@@ -384,13 +381,30 @@ IAM token is present
 > ```
 
 Once you loaded `.envrc` file into your environment, you'll get `.aws_secret_access_key` and 
- files created in your installation directory.
+files created in your installation directory.
 
 > [!IMPORTANT]
 > Make sure that:
 > - `.aws_secret_access_key` file is not empty
 > - `terraform_backend_override.tf` file contains valid bucket name
 
+### Configure AWS CLI
+
+Add the key, the Nebius AI region ID and the Object Storage endpoint URL to the AWS CLI configuration:
+
+```shell
+aws configure set aws_access_key_id "${AWS_ACCESS_KEY_ID}"
+```
+```shell
+aws configure set aws_secret_access_key "${AWS_SECRET_ACCESS_KEY}"
+```
+```shell
+aws configure set region 'eu-north1'
+```
+```shell
+aws configure set endpoint_url 'https://storage.eu-north1.nebius.cloud:443'
+```
+
 ### Initialize Terraform
 
 To initialize a Terraform project, download all referenced providers and modules, execute:
@@ -440,7 +454,7 @@ root@login-0:~#
 
 Take a look on the list of Slurm workers:
 
-```bash
+```shell
 sinfo -Nl
 ```
 
@@ -448,7 +462,7 @@ Make sure they all are in `idle` state.
 
 In order to connect to a specific worker, use the following command:
 
-```bash
+```shell
 srun -w <worker-name> -Z --pty bash
 ```
 
@@ -469,22 +483,25 @@ Additionally, you can [try out the special features](#try-out-special-features)
 There is a [test](test) directory.
 Enter it and run the script that uploads several batch job scripts to your cluster:
 
-```bash
+```shell
 ./prepare_for_quickcheck.sh -u root -k <Path to private key for provided public key> -a ${SLURM_IP}
 ```
 
 Within an SSH session to the Slurm cluster, execute:
 
-```bash
+```shell
 cd /quickcheck
-
-sbatch hello.sh
+```
+```shell
+sbatch hello.sh && \
 tail -f outputs/hello.out
-
-sbatch nccl.sh
+```
+```shell
+sbatch nccl.sh && \
 tail -f outputs/nccl.out
-
-sbatch enroot.sh
+```
+```shell
+sbatch enroot.sh && \
 tail -f outputs/enroot.out
 ```
 
@@ -532,7 +549,7 @@ filestore_jail_submounts = [{
 Or, you can use the same filestore for multiple clusters.
 In order to do this, create it on your own with the Nebius CLI
 
-```bash
+```shell
 nebius compute filesystem create \
   --parent-id "${NEBIUS_PROJECT_ID}" \
   --name 'shared-mlperf-sd' \
@@ -559,14 +576,14 @@ It will attach the storage to your cluster at `/mlperf-sd` directory.
 
 Enter the [test](test) directory and run the script that uploads several batch job scripts to your cluster:
 
-```bash
+```shell
 ./prepare_for_mlperf_sd.sh -u root -k <Path to private key for provided public key> -a ${SLURM_IP}
 ```
 
 Within an SSH session to the Slurm cluster, execute:
 
-```bash
-cd /opt/mlperf-sd
+```shell
+cd /opt/mlperf-sd && \
 ./prepare_env.sh
 ```
 
@@ -576,26 +593,26 @@ downloading datasets and checkpoints.
 > [!NOTE]
 > The actual working directory for this benchmark is located at the root level - `/mlperf-sd`.
 > 
-> ```cd
-> /mlperf-sd
+> ```shell
+> cd /mlperf-sd
 > ```
 
 Wait until the job finishes. You can track the progress by running:
 
-```bash
+```shell
 watch squeue
 ```
 
 Or checking the `aws_download.log` output:
 
-```bash
+```shell
 tail -f aws_download.log
 ```
 
 Once it's done, start the benchmark:
 
-```bash
-cd /mlperf-sd/training/stable_diffusion
+```shell
+cd /mlperf-sd/training/stable_diffusion && \
 ./scripts/slurm/sbatch.sh
 ```
 
@@ -606,7 +623,7 @@ If your setup consists of 2 worker nodes with 8 H100 GPU on each, you can compar
 
 Also, you can execute
 
-```bash
+```shell
 ./parselog -f nogit/logs/your_log_file
 ```
 
@@ -615,7 +632,7 @@ In order to parse your log file and calculate the result.
 <details>
 <summary>Usage example</summary>
 
->```bash
+>```shell
 >./parselog -f nogit/logs/reference_02x08x08_1720163290.out -g 2xH100
 >```
 >```text
@@ -663,7 +680,7 @@ filestore_jail_submounts = [{
 Or, you can use the same filestore for multiple clusters.
 In order to do this, create it on your own with the Nebius CLI
 
-```bash
+```shell
 nebius compute filesystem create \
   --parent-id "${NEBIUS_PROJECT_ID}" \
   --name 'shared-mlperf-gpt3' \
@@ -690,14 +707,14 @@ It will attach the storage to your cluster at `/gpt3` directory.
 
 Enter the [test](test) directory and run the script that uploads several batch job scripts to your cluster:
 
-```bash
+```shell
 ./prepare_for_mlperf_gpt3.sh -u root -k <Path to private key for provided public key> -a ${SLURM_IP}
 ```
 
 Within an SSH session to the Slurm cluster, execute:
 
-```bash
-cd /opt/mlperf-gpt3
+```shell
+cd /opt/mlperf-gpt3 && \
 ./init.sh
 ```
 
@@ -712,8 +729,9 @@ Once initialisation is done, start the benchmark:
 > [!NOTE]
 > The actual working directory for this benchmark is located at the root level - `/gpt3`.
 
-```bash
-cd /gpt3 && ./run.sh
+```shell
+cd /gpt3 && \
+./run.sh
 ```
 
 ### Try out special features
@@ -729,7 +747,7 @@ There's a wrapper script `createuser`, which:
 <details>
 <summary>Usage example</summary>
 
->```bash
+>```shell
 > createuser pierre
 >```
 >```text
@@ -754,13 +772,15 @@ There's a wrapper script `createuser`, which:
 >```
 </details>
 
-You can also check how new packages are installed into the shared filesystem:
+You can also check how new packages are installed into the shared filesystem.
 
-```bash
-# Install the package on the login node
+Install the package on the login node:
+```shell
 apt update && apt install -y neofetch
+```
 
-# Run it on a worker node
+Run it on a worker node:
+```shell
 srun neofetch
 ```
 
@@ -775,7 +795,7 @@ If everything is OK with GPUs on your nodes, the launch of CronJob will finish s
 In order to simulate GPU performance issues on one of the nodes, you can launch another NCCL test with half of available
 GPUs just before triggering the CronJob:
 
-```bash
+```shell
 srun -w worker-0 -Z --gpus=4 bash -c "/usr/bin/all_reduce_perf -b 512M -e 16G -f 2 -g 4"
 ```
 
@@ -784,12 +804,12 @@ srun -w worker-0 -Z --gpus=4 bash -c "/usr/bin/all_reduce_perf -b 512M -e 16G -f
 
 After that, `worker-0` should become drained:
 
-```bash
+```shell
 sinfo -Nl
 ```
 
 You can see the verbose details in the Reason field of this node description:
 
-```bash
+```shell
 scontrol show node worker-0
 ```

From 2dfb3251193f3762b922ab7f96ff159359e81ecc Mon Sep 17 00:00:00 2001
From: Dmitry Starov <dmitry.staroff@gmail.com>
Date: Fri, 25 Oct 2024 11:55:13 +0200
Subject: [PATCH 2/5] [DOC] Remove GPT3 check for a while

---
 soperator/README.md | 160 --------------------------------------------
 1 file changed, 160 deletions(-)

diff --git a/soperator/README.md b/soperator/README.md
index 3a7c13e4..1270cd9f 100644
--- a/soperator/README.md
+++ b/soperator/README.md
@@ -474,7 +474,6 @@ Now you can check how it executes compute jobs.
 We offer two kind of checks:
 - [Quick](#quickly-check-the-slurm-cluster);
 - [MLCommons Stable Diffusion](#run-mlcommons-stable-diffusion-benchmark).
-- [MLCommons GPT3](#run-mlcommons-gpt3-benchmark).
 
 Additionally, you can [try out the special features](#try-out-special-features) Soperator provides.
 
@@ -654,162 +653,3 @@ In order to parse your log file and calculate the result.
 > max: 23.62s
 >```
 </details>
-
-#### Run MLCommons GPT3 benchmark
-
-If you are going to run the MLCommons GPT3 benchmark, you will probably need large storage for it.
-
-<details>
-<summary>Creating storage for GPT3 benchmark</summary>
-
-You can create with this Terraform recipe, as in provided [terraform.tfvars](installations/example):
-
-```terraform
-# Shared filesystems to be mounted inside jail.
-# ---
-filestore_jail_submounts = [{
-  name       = "mlperf-gpt3"
-  mount_path = "/gpt3"
-  spec = {
-    size_gibibytes       = 8192
-    block_size_kibibytes = 4
-  }
-}]
-```
-
-Or, you can use the same filestore for multiple clusters.
-In order to do this, create it on your own with the Nebius CLI
-
-```shell
-nebius compute filesystem create \
-  --parent-id "${NEBIUS_PROJECT_ID}" \
-  --name 'shared-mlperf-gpt3' \
-  --type 'network_ssd' \
-  --size-bytes 8796093022208
-```
-
-And provide its ID to the recipe as follows:
-
-```terraform
-# Shared filesystems to be mounted inside jail.
-# ---
-filestore_jail_submounts = [{
-  name       = "mlperf-gpt3"
-  mount_path = "/gpt3"
-  existing = {
-    id = "<ID of created filestore>"
-  }
-}]
-```
-
-It will attach the storage to your cluster at `/gpt3` directory.
-</details>
-
-Enter the [test](test) directory and run the script that uploads several batch job scripts to your cluster:
-
-```shell
-./prepare_for_mlperf_gpt3.sh -u root -k <Path to private key for provided public key> -a ${SLURM_IP}
-```
-
-Within an SSH session to the Slurm cluster, execute:
-
-```shell
-cd /opt/mlperf-gpt3 && \
-./init.sh
-```
-
-This script:
-- Clones the necessary parts from MLCommons git repository, and configures it for our cluster setup;
-- Downloads dataset;
-- Downloads checkpoint;
-- Creates a Run script.
-
-Once initialisation is done, start the benchmark:
-
-> [!NOTE]
-> The actual working directory for this benchmark is located at the root level - `/gpt3`.
-
-```shell
-cd /gpt3 && \
-./run.sh
-```
-
-### Try out special features
-
-#### Shared root filesystem
-
-You can create a new user on a login node and have it appear on all nodes in the cluster.
-There's a wrapper script `createuser`, which:
-- Creates a new user & group;
-- Adds they to sudoers;
-- Creates a home directory with the specified public SSH key.
-
-<details>
-<summary>Usage example</summary>
-
->```shell
-> createuser pierre
->```
->```text
-> Adding user `pierre' ...
-> Adding new group `pierre' (1004) ...
-> Adding new user `pierre' (1004) with group `pierre' ...
-> Creating home directory `/home/pierre' ...
-> Copying files from `/etc/skel' ...
-> New password: ********
-> Retype new password: ********
-> passwd: password updated successfully
-> Changing the user information for pierre
-> Enter the new value, or press ENTER for the default
-> 	Full Name []: Pierre Dunn
-> 	Room Number []: 123
-> 	Work Phone []:
-> 	Home Phone []:
-> 	Other []: Slurm expert
-> Is the information correct? [Y/n] y
-> Enter the SSH public key, or press ENTER to avoid creating a key:
-> ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKzxkjzPQ4EyZSjan4MLGFSA18idpZicoKW7HC4YmwgN pierre.dunn@gmail.com
->```
-</details>
-
-You can also check how new packages are installed into the shared filesystem.
-
-Install the package on the login node:
-```shell
-apt update && apt install -y neofetch
-```
-
-Run it on a worker node:
-```shell
-srun neofetch
-```
-
-#### Periodic GPU health checks
-
-The NCCL tests are launched from the `<cluster-name>-nccl-benchmark` K8s CronJob.
-
-You can trigger this job manually if you don't want to wait until the next execution time.
-
-If everything is OK with GPUs on your nodes, the launch of CronJob will finish successfully.
-
-In order to simulate GPU performance issues on one of the nodes, you can launch another NCCL test with half of available
-GPUs just before triggering the CronJob:
-
-```shell
-srun -w worker-0 -Z --gpus=4 bash -c "/usr/bin/all_reduce_perf -b 512M -e 16G -f 2 -g 4"
-```
-
-> [!NOTE]
-> We set the `-Z` option here, so it will ignore GPUs allocated in concurrent jobs.
-
-After that, `worker-0` should become drained:
-
-```shell
-sinfo -Nl
-```
-
-You can see the verbose details in the Reason field of this node description:
-
-```shell
-scontrol show node worker-0
-```

From 68815763a50e5406e8faeb158f6c32405985d994 Mon Sep 17 00:00:00 2001
From: Dmitry Starov <dmitry.staroff@gmail.com>
Date: Fri, 25 Oct 2024 12:14:35 +0200
Subject: [PATCH 3/5] [DOC] Getting IP address

---
 soperator/README.md | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/soperator/README.md b/soperator/README.md
index 1270cd9f..46a2732f 100644
--- a/soperator/README.md
+++ b/soperator/README.md
@@ -450,6 +450,14 @@ $ ./login.sh -k ~/.ssh/id_rsa
 root@login-0:~#
 ```
 
+> [!NOTE]
+> You can get the IP for connection from `./login.sh`.
+> But it would be faster to get it via following command:
+> ```shell
+> terraform show -json \
+>   | jq -r '.values.root_module.child_modules[].resources[] | select(.address | endswith("terraform_data.connection_ip")).values.output'
+> ```
+
 ### Check it out
 
 Take a look on the list of Slurm workers:

From e083ee7972817cc9d978763158e4a88502a62e8d Mon Sep 17 00:00:00 2001
From: Dmitry Starov <dmitry.staroff@gmail.com>
Date: Fri, 25 Oct 2024 12:23:41 +0200
Subject: [PATCH 4/5] Bump version

---
 soperator/SUBVERSION | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/soperator/SUBVERSION b/soperator/SUBVERSION
index 0cfbf088..00750edc 100644
--- a/soperator/SUBVERSION
+++ b/soperator/SUBVERSION
@@ -1 +1 @@
-2
+3

From 667485905827e73fcdaab633d74293406237a157 Mon Sep 17 00:00:00 2001
From: Dmitry Starov <dmitry.staroff@gmail.com>
Date: Fri, 25 Oct 2024 12:29:13 +0200
Subject: [PATCH 5/5] [DOC] Bump Nebius CLI version

---
 soperator/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/soperator/README.md b/soperator/README.md
index 46a2732f..8c0f5740 100644
--- a/soperator/README.md
+++ b/soperator/README.md
@@ -64,7 +64,7 @@ on darwin_arm64
 
 ```console
 $ nebius version
-0.11.2
+0.11.6
 ```
 
 [Authorize it](https://docs.nebius.com/cli/configure/#authorize-with-a-user-account) with a user account.