-
Notifications
You must be signed in to change notification settings - Fork 0
Home
- Cayuga is a private cluster with restricted access to members of cayuga_xxxx projects/groups.
- Access is restricted to connections from the Weill or Ithaca VPNs.
- For access to the Cayuga cluster send email to [email protected]. Please include Cayuga in the subject area.
- Running Rocky 8.5 and built with OpenHPC 2 and Slurm 20.11.9
- Cluster networking: EDR Infiniband
- New users might find the Getting Started on Cayuga information helpful
- Once you have completed the form sent for gaining access via ssh to your account on the cayuga cluster, you will receive a welcome email guiding you through: https://www.cac.cornell.edu/techdocs/clusters/cayuga/
-
Login nodes: [cayuga-login1,cayuga-login2,cayuga-vis1].cac.cornell.edu
- access via ssh using public/private keys
- You will have access to all 3 of the login nodes in order to: submit jobs to the scheduler, gain access to your project data files on the /athena storage and your home directories.
- ALL jobs must be run via the slurm scheduler. DO NOT run your jobs on the login nodes. We may cancel any jobs running on the login nodes rather than via the scheduler for this can affect all other users. If you need assistance running, please send email to: [email protected]
- Qty 1: A100 GPU node
* g0001: CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=1024000 GPU [0-3]: NVIDIA A100: 80GB PCIe
- Qty 2: A40 GPU node
* g00[2-3]: CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=1024000 GPU [0-3]: NVIDIA A40: 48GB PCIe
- Qty 21: CPU nodes (hyperthreading ON)
* c00[01-11]: CPUs=112 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 ThreadsPerCore=2 RealMemory=768000 * c00[12-21]: CPUS=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=512000
EDR Infiniband
- Path:
~
OR$HOME
OR/home/fs01/<cwid>
- Users' home directories are located on a NFS export from the Cayuga head node.
- Most data should go in your /athena/project folder yet some smaller sets may make more sense to have in your $HOME:
- Scripts, code, profiles, and other files and user-installed software where this is the assumed location.
- Small datasets or low I/O applications that don't benefit from a high-performance filesystem.
- Data rarely or never accessed from compute nodes.
- Applications where client-side caching is important: binaries, libraries, virtual/conda environments, Singularity containers (unless staging to /tmp on compute nodes is feasible).
- Data in users' home directories are NOT backed up; users are responsible for backing up their own data.
- Path:
- Parallel File System (3.8P)
- Each of the cayuga projects will have a setup: /athena/cayuga_####/scratch/[cwid]
- There is also a symlink from the labname per cayuga project: /athena/[labname] --> /athena/cayuga_####
- Most all computing should be done in your /athena/cayuga_####/scratch/[cwid] (not in your $HOME) to avoid causing heavy I/O and affecting other users.
- Recommended method of transferring files to the cayuga endpoint https://www.cac.cornell.edu/TechDocs/files/FileTransferGlobus/
- cayuga globus endpoint is: cac#cayuga
- If you are using rsync to copy your data to the cayuga cluster, you do need to use your key just as you need to for login. An example rsync line:
- rsync -avhP –progress /Path_to_FromDir_Data -e "ssh -i .ssh/your_cayuga_key" [cwid]@cayuga-login1.cac.cornell.edu:/athena/[labname]/scratch/[cwid]/
- Recommended method of transferring files to the cayuga endpoint https://www.cac.cornell.edu/TechDocs/files/FileTransferGlobus/
- The cluster scheduler is Slurm v22.05.2.
- Slurm Quick Start
- Slurm Information
There are currently 2 partitions on the cayuga cluster that everyone can submit to at this time:
- scu-cpu: PartitionName=scu-cpu Nodes=c00[01-21] Default=YES MaxTime=7-0
- scu-gpu: PartitionName=scu-gpu Nodes=g000[1-3] Default=NO MaxTime=7-0
- Access to the above partitions is regulated through a slurm fairshare system.
- To request specific numbers of GPUs (either a40 or a100), you should add your request to your srun/sbatch:
for the a40: --gres=gpu:a40:<# of requested GPUs> for the a100: --gres=gpu:a100:<# of requested GPUs>
- example to have two a40 gpus assigned to your bash session
[cayuga-login1 ~]$ srun -p scu-gpu --gres=gpu:a40:2 --pty bash bash-4.4$ hostname g0002 bash-4.4$ nvidia-smi Wed Aug 30 15:46:06 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A40 On | 00000000:17:00.0 Off | 0 | | 0% 28C P8 27W / 300W| 0MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A40 On | 00000000:65:00.0 Off | 0 | | 0% 27C P8 30W / 300W| 0MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
- If you want your job to run on specific hardware types, you can specify constraints with
-C
.
- If you want your job to run on specific hardware types, you can specify constraints with
A lot of software can be installed in a user's $HOME
directory without root access, or is easy to build from source. Please check for such options, as well as the virtual environment and container solutions described below before requesting system-wide software installation (unless there are licensing issues).
Users can manage their own python environment (including installing needed modules) using virtual environments. Please see the documentation on virtual environments on python.org for details.
NOTE: Consider starting with Miniconda if you do not need a multitude of packages for it will be smaller, faster to install as well as update.
- Anaconda can be used to maintain custom environments for R, Python, and many other software packages, including alternate interpreter versions and dependencies.
- Reference to help decide if Miniconda is enough: https://conda.io/docs/user-guide/install/download.html
- Reference for Anaconda R Essentials: https://conda.io/docs/user-guide/tasks/use-r-with-conda.html
- Reference for Linux install: https://conda.io/docs/user-guide/install/linux.html
- Please take the tutorials to assist you with your management of conda packages:
- We recommend installing miniconda in your Athena directory (/athena/cayuga_0xxx/scratch/username)
- To Install miniconda in your athena directory, follow: https://docs.conda.io/en/latest/miniconda.html
example: mkdir -p /athena/cayuga_0001/scratch/jhs3001/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /athena/cayuga_0001/scratch/jhs3001/miniconda3/miniconda.sh
bash /athena/cayuga_0001/scratch/jhs3001/miniconda3/miniconda.sh -b -u -p /athena/cayuga_0001/scratch/jhs3001/miniconda3
rm -rf /athena/cayuga_0001/scratch/jhs3001/miniconda3/miniconda.sh
/athena/cayuga_0001/scratch/jhs3001/miniconda3/bin/conda init bash
logout and back in, or type: source ~/.bashrc
You may also need to run: conda update -n base -c defaults conda
You can create as many virtual environments, each in their own directory, as needed.
- python3.9:
python3.9 -m venv <your virtual environment directory>
You need to activate a virtual environment before using it:
source <your virtual environment directory>/bin/activate
Once such an environment is activated, both python
and python3
should become aliases for python3.9
.
After activating your virtual environment, you can now install python modules for the activated environment:
- It's always a good idea to update pip first:
pip install --upgrade pip
- Install the module:
pip install <module name>
- List installed python modules in the environment:
pip list modules
Singularity is a container system similar to Docker, but suitable for running in HPC environments without root access. You might want to use Singularity if:
- You're using software or dependencies designed for a different Linux distribution or version than the one on Cayuga.
- Your software is easy to install using a Linux distribution's packaging system, which would require root access.
- There's a Docker or Singularity image available from a registry like Nvidia NGC, Docker Hub, Singularity Hub, or from another cluster user with the software you need.
- There's a Dockerfile or Singularity recipe that is close to what you need with a few modifications.
- You want a reproducible software environment on different systems, or for publication.
module load singularity
.
Download an existing image with singularity pull
, which doesn't require root access. If multiple people will use the same image, we can publish them in a shared location.
Build a new image with singularity build
, which usually must be run on an outside machine where you have root access. Then you can upload it directly to the cluster to run, or transfer it through a container registry.
Run software in the container using singularity run
,singularity exec
, or singularity shell
. Performance will likely be best if you copy the image to local disk on the compute node, or run it from the HOME
filesystem (not BeeGFS). Remember that the container potentially works like a different OS distribution and software stack, even though it can access some host filesystems by default (such as HOME), so be careful about interactions with your existing environment (shell startup files, lmod, Anaconda, venv, etc.). Consider using the -c
option or maintaining an environment specific to each container you use.
Software | Module to set PATH | Notes |
---|---|---|
Matlab 2023 | module load matlab/2023 | |
Rstudio 2023 4.2.1 | module load rstudio |
- Add to or create a new file on your laptop to add the login nodes and all hosts your may use with rstudio: ~/.ssh/config
Host cayuga-login1 Hostname cayuga-login1.cac.cornell.edu IdentityFile ~/.ssh/your_cayuga_key User your_cwid Host c0001 Hostname c0001 IdentityFile ~/.ssh/your_cayuga_key User your_cwid Host c0002 Hostname c0002 IdentityFile ~/.ssh/your_cayuga_key User your_cwid
- Once the above has been added to your ~/.ssh/config file, login to cayuga-login1.cac.cornell.edu with your key
- type: module load rstudio
- type: rstudio_run
- Follow the output.
i.e. Let's say you are put on c0002, open a new terminal window on your laptop and type: ssh -J your_cwid@cayuga-login1 -NL [port#]:localhost:[port#] your_cwid@c0002
B. If not connected to VPN: ***Option B will only work if you previously had an account on the Greenberg cluster(aphrodite/pascal)***
- Once the forwarding is setup (your above terminal window on your laptop will appear like it is hanging), bring up a browser on your local box with: http://localhost:[port#]
- login with your cwid and password that was provided upon running rstudio_run
- When done using RStudio, terminate the job by:
* Exit the RStudio Session ("power" button in the top right corner of the RStudio window) * Issue the following command on the login node: scancel -f [your_job_id]
1. Setup ~/.ssh/config on your laptop:
Host cayuga-login1 Hostname cayuga-login1.cac.cornell.edu IdentityFile ~/.ssh/your_cayuga_key User [your_cwid]
Host c00* IdentityFile ~/.ssh/[your_key_name] User [your_cwid] ProxyCommand ssh -i .ssh/[your_key_name] -W %h:%p cayuga-login1
2. Login to one of the login nodes:
ssh -i ~/.ssh/[your_key_name] [your_cwid]@cayuga-login1
3. From your terminal window on cayuga-login1, type:
module load anaconda3 srun --pty -n1 --mem=8G -p scu-cpu /bin/bash -i
4. Be sure you know which compute node you were placed on; type:
hostname
5. Lets say you were put on c0001, still in the same terminal window you were placed on (i.e. c0001 in our example), type:
jupyter notebook --no-browser --ip 0.0.0.0 --port=8962
NOTE: you will not receive a prompt back in this terminal window
6. Back on your laptop terminal window, type:
ssh -NL 127.0.0.1:8962:c0001:8962 c0001
NOTE: replace both whatever port # you used in step 5 and the nodename you were placed on from step 4 in 2 places on the above line. Again - you will not receive a prompt back in your terminal window on your laptop.
7. In a browser on your laptop, yank and paste the line that was last in the output of step 5. i.e. It should look something like:
http://127.0.0.1:8962/?token=dd21318d568114149b7b169fad09466fc8683b5b1773fd0e
Set up your working environment for each software package using the module command. The module command will activate dependent modules if there are any.
- Show all available modules:
module avail
- Show currently loaded modules:
module list
- Load a module:
module load [software_name/version] (as show in the output of module avail)
- Unload a module:
module unload [software_name/version]
It is possible to create your own personal modulefiles to support the software and settings you use.
Software can generally be installed system-wide:
- It's a simple package install from a standard repository (Rocky, EPEL, OpenHPC-2) with no likely conflicts.
- Try
dnf search
as a quick check for package availability.
- Try
- It's required for licensing reasons, subject to additional direct approval by the cluster owner (potentially only the license infrastructure will be installed).
- It can't be installed by the mechanisms above, OR it is version-stable and used by >1 in your group.
- Submit requests by sending email to: [email protected]. Please include 'Cayuga' in the subject area.
- It's a simple package install from a standard repository (Rocky, EPEL, OpenHPC-2) with no likely conflicts.
Default shell for all users is bash
. If you would like to request a different shell due to some tools having feature requirements, you may submit a request to have your shell changed. You will *not* be able to chsh
to change it. To change your own default shell (on all CAC clusters), you do need to send a request to: [email protected] requesting a login shell change on the cayuga cluster.
- Head node. Don't run jobs on the head node, as it can make things unresponsive for other users or in the worst case take down the whole cluster. It's ok to compile code, scan files, and do other minor administrative tasks on the head node though.
- Threads. If you have a multithreaded job, you might want to limit the number of threads to something like 1 or 2 per core reserved, or reserve one core per thread or two. The simplest way to do this is usually to use the '-c' with a value that is double the number of threads (e.g. if you want 4 cores/task, use '-c 8'). Slurm sees each core as being 2 CPUs due to hyperthreading, however, your program might not use hyperthreading well. Many multithreaded programs will default to the number of CPUs they see on the system, and are not aware of scheduled resources. We are now forcing CPU affinity, so jobs with too many threads should no longer interfere with other jobs (but might hurt their own performance). Ref: More information re. threads
- SIMD jobs / job arrays. If you are running many instances of the same job with different data or settings, please don't just launch tons of separate jobs in a loop. Use a job array because they are easier to monitor, manage, and cancel. Be sure to set a slot limit (% notation) to avoid flooding the queue and allow others to use the resources too — as a rule of thumb, you should be using less than 25% of any in-demand resources such as GPUs. It's good for there to always be some idle resources in case someone needs to test something quickly, etc., using a small amount of resources. Also, please run such jobs at lower priority (using qos or nice) when possible, or otherwise communicate directly with other users in case there's a problem. See the ics-research/ics-cluster-script-examples repository on COECIS Github for an example.
- Job priority / nice. Please follow the guidelines for QOS level, and use nice as applicable. Keep in mind that if everyone runs at the highest priority all the time, the priority levels will become useless. See above.
- Interactive use.
- Interactive use with multiple windows.
- Code development/testing/debugging.
- Long-running jobs / checkpoint-and-resume.
- Submit questions or requests by sending email to: [email protected]. Please include Cayuga in the subject area.
- Access to Cayuga requires ssh keys. Step by Step Instructions