This repository is the simplest way to create a high-availability
SLURM cluster on Crusoe Cloud.
To get started, create a file named terraform.tfvars
with the cluster
parameters. For example .tfvars
files, see the examples
directory.
terraform init
terraform apply
By default, this solution will create a high-availability SLURM cluster with:
- 2
c1a.16x
head nodes - 2
c1a.16x
login nodes - 1
s1a.80x
nfs node - n compute nodes of any instance type.
This solution currently supports four tiers of storage:
Each slurm-compute-node
instance supports up to 7.5 TiB
of local scratch
storage, depending on the instance type. If present, scratch storage is located
at /scratch/local
. The local scratch storage is erased whenever the compute
node is stopped.
The local scratch size on common compute node instance types are:
a40.8x
: 7.5 TiB100.8x
: 7.5 TiBa100-80gb.8x
: 7.5 TiBa100-80gb-sxm-ib.8x
: 7.5 TiBh100-80gb-sxm-ib.8x
: 7.5 TiBl40s-48gb.8x
: Nonec1a.176x
: None
Each cluster supports up to 51.2 TiB
of shared scratch storage, depending on
the instance type of slurm-nfs-node-0
. This scratch storage is located at
/scratch/shared
. The shared scratch storage is erased whenever
slurm-nfs-node-0
is stopped.
The slurm_nfs_node_type
variable can optionally be set in the terraform.tfvars
file
to configure the instance type used to create slurm-nfs-node-0
. If left unconfigured,
this will default to s1a.80x
.
The remote scratch size on common nfs node instance types are:
s1a.80x
: 51.2 TiBs1a.40x
: 25.6 TiBc1a.176x
: None
The slurm-nfs-node-0
node exports a persistent /home
directory that is mounted by
all login nodes and all compute nodes. This offers up to 10 TiB
of persistent shared
storage.
The slurm_nfs_home_size
variable can optionally be set in the terraform.tfvars
file
to configure the size of the /home
nfs share. If left unconfigured, this will default
to 10 TiB. Note that 10 TiB
is the maximum currently supported by Crusoe Cloud.
Note: the lifecycle of the shared home directory is tied to the lifecycle of the cluster. Deleting the cluster will delete the shared home directory. A seperate shared data directory is recommended for storing critical data.
The slurm_shared_volumes
variable can be used to add additional
shared volumes storage
to the cluster. This is the recommended way to add high-performance persistent file storage
to your cluster.
slurm_shared_volumes = [{
id = "8146e3ef-4192-4f59-b2d6-6a8b3dfe5cf3"
name = "data-01"
mode = "read-write"
mount_point = "/mnt/data-01"
}]
To add additional users to your cluster, configure the slurm_users
variable in your
terraform.tfvars
file and run terraform apply
. The following example adds three
additional users user1
, user2
, and user3
to the slurm cluster.
# slurm users configuration
slurm_users = [{
name = "user1"
uid = 1001
ssh_pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIjPRr0iVR4mgzJy0ehnM5hWX4O86hM1bVTgdi5g3nkZ [email protected]"
}, {
name = "user2"
uid = 1002
ssh_pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIjPRr0iVR4mgzJy0ehnM5hWX4O86hM1bVTgdi5g3nkZ [email protected]"
}, {
name = "user3"
uid = 1003
ssh_pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIjPRr0iVR4mgzJy0ehnM5hWX4O86hM1bVTgdi5g3nkZ [email protected]"
}]
This solution provides support for NVIDIA Enroot and Pyxis.
srun --container-image=<image> <cmd>
This solution includes PMIx support for running Open MPI applications.
srun --mpi=pmix <cmd>
For example, this can be used to run the nccl-tests included in the ubuntu22.04-nvidia-slurm
image.
# Run 2-node NCCL AllReduce Benchmark using RDMA Transport.
export NCCL_IB_HCA=^mlx5_0:1
srun -N 2 --ntasks-per-node=8 --gres=gpu:8 --cpus-per-task=22 --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1
# Run 2-node NCCL AllReduce Benchmark using Ethernet Transport.
export NCCL_IB_DISABLE=1
export NCCL_IBEXT_DISABLE=1
srun -N 2 --ntasks-per-node=8 --gres=gpu:8 --cpus-per-task=22 --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1
This solution utilizes a secondary head-node that will take over within 10 seconds if the primary head-node stops responding. As long as at least one head-node is still responsive, the cluster will remain usable.
Note that NFS is not deployed in a high-availability configuration.
If slurm-nfs-node-0
goes down, then none of the login nodes or compute
nodes will be able to mount the /home
directory. This will prevent users
from logging in to any slurm-login-node
or slurm-compute-node
. The cluster
will recover gracefully once slurm-nfs-node-0
is brought back online.
During this time, it is still possible to login to any of the nodes as
the root
user.
If one of the login nodes goes offline, the other login nodes will remain fully functional. As long as cluster users store data in their home directory, which is backed by a nfs share, they should be able to ssh into another login node and continue using the cluster.
When a compute node is rebooted or stops responding for more than 5 minutes,
it will be marked as DOWN
.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 1 down* slurm-compute-node-0
batch* up infinite 7 idle slurm-compute-node-[1-7]
Once the compute node is back online, it will automatically rejoin the cluster as long as it is healthy.
Once the compute node has been returned to the cluster, it should enter an IDLE
state.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 8 idle slurm-compute-node-[0-7]