This guide is intended for users of Sapling (sapling.stanford.edu). If you're a new user, please follow the steps below for first-time setup. The rest of the guide describes how to use the machine.
Please follow these steps when you first set up your Sapling account.
passwd
This will help make access to the machine nodes easier, so that when you run jobs you don't need to enter your password.
ssh-keygen -t ed25519 # just enter an empty password
cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys
We have a #sapling
channel on Zulip. If you have any questions, that's
the best place to ask. Also, that is where we will provide announcements
when there are changes, maintenance, etc. for the machine. (Please ask
for the signup link if you are not already on our Zulip instance.)
Sapling is a shared research machine and therefore may work differently than other machines you are used to. While the rules on Sapling are intentionally flexible (to allow a broad range of research to be performed), it is still important to understand how your usage of the machine impacts other users and to share the machine well so that everyone can get their work done.
Here are some guidelines that we adhere to when using Sapling:
-
The head node is a shared resource. Do NOT run compute- or memory-intensive workloads there.
-
Builds are fine as long as they don't go on too long and you use a reasonable parallelism setting (e.g.,
make -j16
). For longer builds consider launching an interactive job (see below). -
If you start a long-running process like an IDE, please make sure it doesn't use too much memory, as this will cut into the available memory for everyone else.
-
-
Allocate compute nodes through SLURM. Do NOT directly SSH to a compute node:
- Do this:
srun -N 1 -n 1 -c 40 -p gpu --pty bash --login
- Don't do this:
ssh g0001
If for some reason you need SSH, then allocate the node through
salloc
before you SSH to it:salloc -n 1 -N 1 -c 40 -p gpu --exclusive ssh $SLURM_NODELIST
Be sure to close out your session when you are done with it so that the nodes are returned to the queue.
- Do this:
-
You are responsible for your usage of the machine. Keep track of your jobs and make sure they are running as expected. If you intend to run something for a long time (say, more than 4 hours), it's a good idea to let everyone know on the
#sapling
channel on Zulip. This is especially true if you intend to do performance experiments for a paper, so that we know not to interrupt your jobs.-
You can say: "I'm going to be running experiments on 2 GPU nodes for 8 hours. Please don't interrupt my jobs if you see them in the queue, and let me know if there any problems."
-
You can also set limits on your jobs to make sure they don't hang indefinitely. When you run
sbatch
,salloc
orsrun
, you can add the flag--time=HH:MM:SS
to specify a maximum running time ofHH
hours,MM
minutes andSS
seconds.
-
-
If something goes wrong, we will contact you on Zulip. For example, if we see a job has been running a long time, we might double check that this is intentional and not the result of an unintended hang. Other users may also contact you if you are using a lot of resources and they can't get work done. Please respond to them (and to us) so that we can coordinate usage of the machine. If you do not respond we may kill your jobs, especially if they use excessive resources. Be sure to respond so we know what you are doing.
-
Please avoid excessive usage of the machine. For example, using all 4 GPU nodes for many hours is not very reasonable, as it will block all other users from using GPUs. If you need to do so for an experiment, please let us know on
#sapling
so we can talk about it and schedule the experiment to avoid preventing other users from doing their work.
These instructions are the fastest way to get started with Legion or Regent on Sapling.
git clone -b master https://github.com/StanfordLegion/legion.git
srun -n 1 -N 1 -c 40 -p gpu --exclusive --pty bash --login
module load cuda
cd legion/examples/circuit
LG_RT_DIR=$PWD/../../runtime USE_CUDA=1 make -j20
./circuit -ll:gpu 1
git clone -b master https://github.com/StanfordLegion/legion.git
srun -n 1 -N 1 -c 40 -p gpu --exclusive --pty bash --login
module load cmake cuda llvm
cd legion/language
./install.py --debug --cuda
./regent.py examples/circuit_sparse.rg -fcuda 1 -ll:gpu 1
Sapling consists of four sets of nodes:
Type | Name | Memory | CPU (Cores) | GPU |
---|---|---|---|---|
Head | sapling | 256 GB | Intel Xeon Silver 4316 (20 cores) |
|
CPU | c0001 to c0004 | 256 GB | 2x Intel Xeon CPU E5-2640 v4 (2x10 cores) |
|
GPU | g0001 to g0004 | 256 GB | 2x Intel Xeon CPU E5-2640 v4 (2x10 cores) |
4x Tesla P100 (Pascal) |
CI | n0000 to n0002 | 48 GB | 2x Intel Xeon X5680 (2x6 cores) |
2x Tesla C2070 (Fermi) |
When you log in, you'll get to the head node. Note that, because it uses a different architecture from the CPU/GPU nodes, it is probably best to use one of those nodes to build and run your software. (See below for machine access instructions.)
The following filesystems are mounted on NFS and are available on all nodes in the cluster.
Path | Filesystem | Total Capacity | Quota | Replication Factor |
---|---|---|---|---|
/home |
ZFS | 7 TiB | 100 GiB | 2x |
/scratch |
ZFS | 7 TiB | None | None |
/scratch2 |
ZFS | 7 TiB | None | None |
Please note that /home
has a quota. Larger files may be placed on
/scratch
or /scratch2
, but please be careful about disk usage. If
your usage is excessive, you may be contacted to reduce it.
You may check your /home
quota usage with:
df -h $HOME
Some things to keep in mind while using the machine:
As of the May 2023 upgrade, we now maintain uniform software across
the machine (e.g., the module system below, and the base OS). However,
the head node and compute nodes still use different generations of
Intel CPUs. It is usually possible to build compatible software, as
long as you do not specify an -march=native
flag (or similar) while
building. However, note that Legion and Regent set -march=native
by default and should not be expected to work. There are two
possible solutions to this:
-
Build on a compute node. (See below for how to launch an interactive job.)
-
Disable
-march=native
.- For Legion, with Make build system: set
MARCH=broadwell
- For Legion, with CMake build system: set
-DBUILD_MARCH=broadwell
- For Regent: requires modifications to Terra, so it is easiest to follow (1) above.
- For Legion, with Make build system: set
Sapling has a very minimal module system. The intention is to provide compilers, MPI, and SLURM. All other packages should be installed on a per-user basis with Spack (see below).
To get started with the module system, we recommend adding the following
to your ~/.bashrc
:
module load slurm mpi cmake
If you wish to use CUDA, you may also add:
module load cuda
Note: as of the May 2023 upgrade, we now maintain a uniform module system across the machine. All modules should be available on all nodes. For example, CUDA is available even on nodes without a GPU, including the head node. This should make it easier to build software that runs across the entire cluster.
To launch an interactive, single-node job (e.g., for building software):
srun -N 1 -n 1 -c 40 -p cpu --pty bash --login
Here's a break-down of the parts in this command:
srun
: we're going to launch the job immediately (as opposed to, say, via a batch script).-N 1
(a.k.a.,--nodes 1
): request one node.-n 1
(a.k.a.,--ntasks 1
): we're going to run one "task". This is the number of processes to launch. In this case because we only want one copy of bash to be running.-c 40
(a.k.a.,--cpus-per-task 40
): the number of CPUs per task. This is important, or else your job will be bound to a single core.-p cpu
(a.k.a.,--partition cpu
): select the CPU partition. (Change togpu
if you want to use GPU nodes, or skip if you don't care.)--pty
: because it's an interactive job, we want the terminal to be set up like an interactive shell. You'd skip this on a batch job.bash
: the command to run. (Replace this with your shell of choice.)
To launch a batch job, you might do something like the following:
sbatch my_batch_script.sh
Where my_batch_script.sh
contains:
#!/bin/bash
#SBATCH -N 2
#SBATCH -n 2
#SBATCH -c 40
#SBATCH -p cpu
srun hostname
Note that, because the flags (-N 2 -n 2 -c 40 -p cpu
) were provided on the
SBATCH
lines in the script, it is not necessary to provide them
when calling srun
. SLURM will automatically pick them up and use
them for any srun
commands contained in the script.
After the job runs, you will get a file like slurm-12.out
that
contains the job output. In this case, that output would look
something like:
c0001.stanford.edu
c0002.stanford.edu
Note: this workaround is no longer required. MPI should now detect the SLURM job properly.
Previously, MPI had not been built with SLURM
compatibility enabled. That meant that Legion, Regent and MPI jobs
could not be launched with srun
. A workaround for this was to use
mpirun
instead. For example, in a 2 node job, you might do:
mpirun -n 2 -npernode 1 -bind-to none ...
Now, instead of doing this, you can simply do:
srun -n 2 -N 2 -c 40 ...
(Note: the -c 40
is required to make sure your job is given access
to all the cores on the node.)
Reminder: all software should be installed and built on the compute nodes themselves. Please run the following in an interactive SLURM job (see Launching Interactive Jobs above).
Note: these instructions are condensed from the Spack documentation at https://spack.readthedocs.io/en/latest/. However for more advanced topics please see the original documentation.
git clone https://github.com/spack/spack.git
echo "if [[ \$(hostname) = c0* || \$(hostname) = g0* ]]; then source \$HOME/spack/share/spack/setup-env.sh; fi" >> ~/.bashrc
source $HOME/spack/share/spack/setup-env.sh
spack compiler find
spack external find openmpi
At that point, you should be able to install Spack packages. E.g.:
spack install legion
(Note: you probably don't want to do this, because most Legion users
prefer to be on master
or control_replication
, but anyway, it
should work.)
Sapling is a mixed-mode machine. While SLURM is the default job
scheduler, users can still use SSH to directly access nodes (and for
some purposes, may need to do so). Therefore, when you are doing
something performance-sensitive, please let us know on the #sapling
channel that you intend to do so. Similarly, please watch the #sapling
channel to make sure you're not stepping on what other users are doing.
How to...
Contact action@cs
. They installed the CUDA driver originally on the
g000*
nodes, and know how to upgrade it.
For posterity, here is the upgrade procedure used (as of 2022-09-15)—but you can let the admins do this:
wget https://us.download.nvidia.com/XFree86/Linux-x86_64/515.65.01/NVIDIA-Linux-x86_64-515.65.01.run
systemctl stop nvidia-persistenced.service
chmod +x NVIDIA-Linux-x86_64-515.65.01.run
./NVIDIA-Linux-x86_64-515.65.01.run --uninstall
./NVIDIA-Linux-x86_64-515.65.01.run -q -a -n -X -s
systemctl start nvidia-persistenced.service
We can do this ourselves. Note: for the driver, see above.
cd admin/cuda
./install_cuda_toolkit.sh
./install_cudnn.sh # note: requires download (see script)
cd ../modules
./setup_modules.sh
We can do this ourselves, but watch out for potential upgrade hazards (e.g., GCC minor version updates, Linux kernel upgrades).
sudo apt update
sudo apt upgrade
sudo reboot
Important: check the status of the NVIDIA driver after this. If
nvidia-smi
breaks, see above.
We are responsible for maintaining Docker on the compute nodes.
cd admin
./install_docker.sh
Do NOT add users to to the docker
group. This is equivalent to
adding them to sudo
. Instead see rootless setup below.
From: https://docs.docker.com/engine/security/rootless/
Install uidmap
utility:
sudo apt update
sudo apt install uidmap
Create a range of at least 65536 UIDs/GIDs for the user in
/etc/subuid
and /etc/subgid
:
$ cat /etc/subuid
test_docker:655360:65536
$ cat /etc/subgid
test_docker:655360:65536
Within the user account, run:
salloc -n 1 -N 1 -c 40 -p gpu --exclusive
ssh $SLURM_NODELIST
dockerd-rootless-setuptool.sh install
(The SSH is required because user-level systemctl seems to be highly sensitive to how you log into the node. Otherwise you'll get an error saying "systemd not detected".)
Make sure to export the variables printed by the command above. E.g.:
export PATH=/usr/bin:$PATH
export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock
Relocate Docker's internal storage into /tmp/$USER
to avoid issues
with NFS:
mkdir -p /tmp/$USER
mkdir -p ~/.config/docker
echo '{"data-root":"/tmp/'$USER'"}' > ~/.config/docker/daemon.json
systemctl --user stop docker
systemctl --user start docker
Try running a container:
docker run -ti ubuntu:22.04
Clean up old containers (WARNING: may delete data):
docker container prune
We are responsible for maintaining GitLab Runner on the compute nodes.
See admin/gitlab
for some sample scripts.
There are three ways to reboot nodes, with progressively more aggressive settings:
-
Soft reboot. SSH to the node and run:
sudo reboot
IMPORTANT: be sure you are on the compute node and not the head node when you do this. Otherwise you will kill the machine for everyone.
In some cases, I've seen nodes not come back after a soft reboot, so a hard reboot may be required. Or if the node is out of memory, it may not be possible to get a shell on the node to run
sudo reboot
from. -
Hard reboot. Run:
ipmitool -U IPMI_USER -H c0001-ipmi -I lanplus chassis power cycle
Be sure to change
IPMI_USER
to your IPMI username (different from your regular username!) andc0001
to the node you want to reboot.This cuts power to the node and restarts it. There is also a soft reboot setting with
soft
, but I do not think it is particularly useful compared tosudo reboot
(though it does not require SSH). -
Otherwise, contact
action@cs
. It may be that there is a hardware issue preventing the node from coming back up.
Check the state of SLURM nodes with:
sinfo
To set the SLURM state of a node to S
:
sudo /usr/local/slurm-23.02.1/bin/scontrol update NodeName=c0001 State=S
Here are some states you might find useful:
-
DRAIN
: This prevents any further jobs from being scheduled on the node, but allows current jobs to complete. Recommended when you want to do maintenance but don't want to disrupt jobs on the system. -
DOWN
: This kills any jobs currently running on the node and prevents further jobs from running. This is usually not required, but may be useful if something gets really messed up and the job cannot be killed by the system. -
RESUME
: Makes the node available to schedule jobs. Note that any issues have to be resolved before doing this, or else the node will just go back into theDOWN
state again.
For DRAIN
and DOWN
states, a Reason
argument is also
required. Please use this to indicate why the node is down (e.g.,
maintenance).
-
Jobs do not complete, but remain in the
CG
(completing) state.There seem to be two possible causes for this:
-
Either there is something wrong with the compute node itself (e.g., out of memory) that is preventing it from killing the job, or
-
There is some sort of a network issue. For example, we have seen DNS issues that created these symptoms. If
ping sapling
fails or connects to127.0.1.1
instead of the head node, this is the likely culprit.
Remember that SLURM itself will keep retrying to clear the completing job, so if it stays in the queue, it's because of an ongoing (not just one-time) issue.
Steps to diagnose:
-
SSH to the failing node and see if it looks healthy (
htop
). -
Try
ping sapling
from the compute node and see if it works.
/var/log/syslog
is unlikely to be helpful at default SLURM logging levels. You can/etc/slurm.conf
to define log files and levels to allow you to get more information if you need to.Once the issue is resolved, the node state will usually fix itself, but if that doesn't happen, you can force it to reset by setting the node to
DOWN
and thenRESUME
again (see above). -