-
Notifications
You must be signed in to change notification settings - Fork 2
Research Clusters
Computing clusters allow us to outsource computationally expensive jobs to external computer banks. If you are unsure of whether a job should be submitted to a cluster, first ask another RA and then proceed to MG and JMS if necessary.
The Simple Linux Utility for Resource Management (SLURM) is used by most computing clusters. One can find a comprehensive guide on SLURM through its homepage and a quick cheatsheet on SLURM at this Slurm 101 website.
The basic workflow in these clusters is the same as in the local computer. One should always check/test the code carefully before submitting to the clusters.
Sherlock is the computing cluster for Stanford which uses SLURM. The following are some tips for quickly getting started on these clusters:
You can access Sherlock from your terminal by running:
For first time logging in, please check this Sherlock guide to set up your credentials.
On Sherlock, we should clone our directories to personal folders in gentzkow
that can be accessed via running the command cd $OAK
.
Upon first logging into the cluster, you can install desired softwares by trying the followings in order:
- use
module avail
to see if the software is already installed in the cluster. If so, use module load to load the software. - use
brew install <name>
if the software is available in Homebrew. - use
wget
andunzip/tar
. - contact Sherlock Support support for the cluster if all the above methods fail.
You should follow the protocol defined by the points below to set up your environment.
- Explicitly specify your Python or Conda path at the beginning of your
.bashrc
script , by setting :
export PYTHONPATH=$OAK/<YourPythonPathInSherlock>/python3.6/site-packages:$PYTHONPATH
or if using Conda :
source <YourCondaPathInSherlock>/conda.sh
conda activate <CONDAENVIRONMENT>
- Load the needed modules.
- Initialize your environment as required by your project (usually in a repository README or wiki).
There are two steps of testing that you should follow.
- Test your bash script before submitting a job to be sure that the environment is set up accordingly.
One way to do this is to have your sbatch
script call only a specific lib
file.
These files gather functions that are used throughout the repository, and hence run fast. However, they need your environment to be fully working to run and thus serve as a fast check for your environment setup.
In general, your jobs are assigned a node depending on their priority and the resources needed. If you allocate your job only to the gentzkow partition
(see below) then it will run whenever the partition has a free node, regardless of the time or memory requested. However, if you opt for multiple partitions, then the allocation depends on the time and memory requested.
Since your environment testing is fast, we suggest that you adjust the time and memory needed and then allocate your job to multiple partitions : #SBATCH --partition=gentzkow,owners
to access whichever is free.
Remember to remove the owners partition
to run your full job as you do not have priority on it.
- Test your full script by running it on a subset of the data. You can also interactively test out your code by running the command
sdev -m16GB
After setting up your environment and testing your code you can submit a job. Cluster-specific guides on submitting and running jobs can be found here for Sherlock.
You should follow the protocol defined by the points below before submitting a job.
MG has purchased resources on Stanford's Sherlock 2 cluster. Jobs submitted to this partition only compete with other lab members for resources, and interactive jobs can request all available resources. The partition consists of:
- A single Dell C6320 server with 20 cores, 256G of RAM, and a 200G SSD.
You can submit all your jobs to multiple partitions. This is done by setting #SBATCH --partition=gentzkow,hns,normal
at the beginning of your sbatch script.
The Humanities and Sciences Dean's Office at Stanford has purchased Sherlock nodes for its researchers' exclusive use. These nodes belong to the hns partition. This partition consists of:
- 10 CPU nodes, which have 64 GB of RAM and 16 multi-core CPUs, and
- hns_gpu, a graphical processing unit node with 128 GB of RAM, a 8 Tesla K80 GPU @ 1.87 Tflops (double precision) and a 5.60 Tflops GPU (single precision), and 16 CPUs.
- a large memory node with 1.5 TB of RAM. Job requests that require over 64 GB of RAM are automatically sent to this node.
Access these nodes by adding -p hns
to our job submission requests. Similarly, we can run graphical processing unit (GPU) jobs by adding -p hns_gpu --gres gpu:1
to our request commands.
- Home
- Our organization
- Our core values
- Getting organized
- Working on Projects
- Contributing to the TSH website
- Notes for Deboarding