Home

Table of Contents Cayuga General Information Accessing the cayuga cluster Hardware Networking Storage Home Directories Athena Copying data to cayuga-login1 Scheduler Partitions (Queues) Requesting GPUs Job Containment Hardware Features QOS Priority Resource Limits User-Installed Software Python Virtual Environments (venv) Anaconda (Miniconda) Create Virtual Environment Activate Virtual Environment Install Python Modules Using pip Singularity Containers Software List Using rstudio on the cluster example Using jupyter notebook on the cluster example Environment Modules (Lmod) Software Installation Requests Default Shell Rules, Tips, and Best Practices Help

Cayuga General Information

Cayuga is a private cluster with restricted access to members of cayuga_xxxx projects/groups.
Access is restricted to connections from the Weill or Ithaca VPNs.
For access to the Cayuga cluster send email to [email protected]. Please include Cayuga in the subject area.
Running Rocky 8.5 and built with OpenHPC 2 and Slurm 20.11.9
Cluster networking: EDR Infiniband
New users might find the Getting Started on Cayuga information helpful

Accessing the cayuga cluster

Once you have completed the form sent for gaining access via ssh to your account on the cayuga cluster, you will receive a welcome email guiding you through: https://www.cac.cornell.edu/techdocs/clusters/cayuga/
Login nodes: [cayuga-login1,cayuga-login2,cayuga-vis1].cac.cornell.edu
- access via ssh using public/private keys
- You will have access to all 3 of the login nodes in order to: submit jobs to the scheduler, gain access to your project data files on the /athena storage and your home directories.
ALL jobs must be run via the slurm scheduler. DO NOT run your jobs on the login nodes. We may cancel any jobs running on the login nodes rather than via the scheduler for this can affect all other users. If you need assistance running, please send email to: [email protected]

Hardware

Qty 1: A100 GPU node

  * g0001: CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=1024000
              GPU [0-3]: NVIDIA A100: 80GB PCIe

Qty 2: A40 GPU node

  * g00[2-3]: CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=1024000
              GPU [0-3]: NVIDIA A40: 48GB PCIe

Qty 21: CPU nodes (hyperthreading ON)

  * c00[01-11]: CPUs=112 Boards=1 SocketsPerBoard=2 CoresPerSocket=28 ThreadsPerCore=2 RealMemory=768000
  * c00[12-21]: CPUS=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=512000

Networking

EDR Infiniband

Storage

Home Directories

Path: ~ OR $HOME OR /home/fs01/<cwid>
Users' home directories are located on a NFS export from the Cayuga head node.
Most data should go in your /athena/project folder yet some smaller sets may make more sense to have in your $HOME:
- Scripts, code, profiles, and other files and user-installed software where this is the assumed location.
- Small datasets or low I/O applications that don't benefit from a high-performance filesystem.
- Data rarely or never accessed from compute nodes.
- Applications where client-side caching is important: binaries, libraries, virtual/conda environments, Singularity containers (unless staging to /tmp on compute nodes is feasible).
Data in users' home directories are NOT backed up; users are responsible for backing up their own data.

Athena

Parallel File System (3.8P)
Each of the cayuga projects will have a setup: /athena/cayuga_####/scratch/[cwid]
There is also a symlink from the labname per cayuga project: /athena/[labname] --> /athena/cayuga_####
Most all computing should be done in your /athena/cayuga_####/scratch/[cwid] (not in your $HOME) to avoid causing heavy I/O and affecting other users.

Copying data to cayuga-login1

Recommended method of transferring files to the cayuga endpoint https://www.cac.cornell.edu/TechDocs/files/FileTransferGlobus/
- cayuga globus endpoint is: cac#cayuga
If you are using rsync to copy your data to the cayuga cluster, you do need to use your key just as you need to for login. An example rsync line:
- rsync -avhP –progress /Path_to_FromDir_Data -e "ssh -i .ssh/your_cayuga_key" [cwid]@cayuga-login1.cac.cornell.edu:/athena/[labname]/scratch/[cwid]/

Scheduler

The cluster scheduler is Slurm v22.05.2.
Slurm Quick Start
Slurm Information

Partitions (Queues)

There are currently 2 partitions on the cayuga cluster that everyone can submit to at this time:

scu-cpu: PartitionName=scu-cpu Nodes=c00[01-21] Default=YES MaxTime=7-0
scu-gpu: PartitionName=scu-gpu Nodes=g000[1-3] Default=NO MaxTime=7-0
Access to the above partitions is regulated through a slurm fairshare system.

Requesting GPUs

To request specific numbers of GPUs (either a40 or a100), you should add your request to your srun/sbatch:

   for the a40:   --gres=gpu:a40:<# of requested GPUs>
   for the a100:  --gres=gpu:a100:<# of requested GPUs>

example to have two a40 gpus assigned to your bash session

[cayuga-login1 ~]$ srun -p scu-gpu --gres=gpu:a40:2 --pty bash
bash-4.4$ hostname
g0002
bash-4.4$ nvidia-smi
Wed Aug 30 15:46:06 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                      On | 00000000:17:00.0 Off |                    0 |
|  0%   28C    P8               27W / 300W|      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A40                      On | 00000000:65:00.0 Off |                    0 |
|  0%   27C    P8               30W / 300W|      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Job Containment

Hardware Features

If you want your job to run on specific hardware types, you can specify constraints with -C.

QOS

Priority

Resource Limits

User-Installed Software

A lot of software can be installed in a user's $HOME directory without root access, or is easy to build from source. Please check for such options, as well as the virtual environment and container solutions described below before requesting system-wide software installation (unless there are licensing issues).

Python Virtual Environments (venv)

Users can manage their own python environment (including installing needed modules) using virtual environments. Please see the documentation on virtual environments on python.org for details.

Anaconda (Miniconda)

NOTE: Consider starting with Miniconda if you do not need a multitude of packages for it will be smaller, faster to install as well as update.

Anaconda can be used to maintain custom environments for R, Python, and many other software packages, including alternate interpreter versions and dependencies.
Reference to help decide if Miniconda is enough: https://conda.io/docs/user-guide/install/download.html
Reference for Anaconda R Essentials: https://conda.io/docs/user-guide/tasks/use-r-with-conda.html
Reference for Linux install: https://conda.io/docs/user-guide/install/linux.html
Please take the tutorials to assist you with your management of conda packages:
- https://conda.io/docs/user-guide/tutorials/index.html
We recommend installing miniconda in your Athena directory (/athena/cayuga_0xxx/scratch/username)
- To Install miniconda in your athena directory, follow: https://docs.conda.io/en/latest/miniconda.html

example: mkdir -p /athena/cayuga_0001/scratch/jhs3001/miniconda3 
 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /athena/cayuga_0001/scratch/jhs3001/miniconda3/miniconda.sh
 bash /athena/cayuga_0001/scratch/jhs3001/miniconda3/miniconda.sh -b -u -p /athena/cayuga_0001/scratch/jhs3001/miniconda3
 rm -rf /athena/cayuga_0001/scratch/jhs3001/miniconda3/miniconda.sh
 /athena/cayuga_0001/scratch/jhs3001/miniconda3/bin/conda init bash
 logout and back in, or type: source ~/.bashrc
 You may also need to run: conda update -n base -c defaults conda

Create Virtual Environment

You can create as many virtual environments, each in their own directory, as needed.

python3.9: python3.9 -m venv <your virtual environment directory>

Activate Virtual Environment

You need to activate a virtual environment before using it:

source <your virtual environment directory>/bin/activate

Once such an environment is activated, both python and python3 should become aliases for python3.9.

Install Python Modules Using pip

After activating your virtual environment, you can now install python modules for the activated environment:

It's always a good idea to update pip first:

pip install --upgrade pip

Install the module:

pip install <module name>

List installed python modules in the environment:

pip list modules

Singularity Containers

Singularity is a container system similar to Docker, but suitable for running in HPC environments without root access. You might want to use Singularity if:

You're using software or dependencies designed for a different Linux distribution or version than the one on Cayuga.
Your software is easy to install using a Linux distribution's packaging system, which would require root access.
There's a Docker or Singularity image available from a registry like Nvidia NGC, Docker Hub, Singularity Hub, or from another cluster user with the software you need.
There's a Dockerfile or Singularity recipe that is close to what you need with a few modifications.
You want a reproducible software environment on different systems, or for publication.

Singularity is provided as an environment module; to use it first run module load singularity.

Download an existing image with singularity pull, which doesn't require root access. If multiple people will use the same image, we can publish them in a shared location.

Build a new image with singularity build, which usually must be run on an outside machine where you have root access. Then you can upload it directly to the cluster to run, or transfer it through a container registry.

Run software in the container using singularity run,singularity exec, or singularity shell. Performance will likely be best if you copy the image to local disk on the compute node, or run it from the HOME filesystem (not BeeGFS). Remember that the container potentially works like a different OS distribution and software stack, even though it can access some host filesystems by default (such as HOME), so be careful about interactions with your existing environment (shell startup files, lmod, Anaconda, venv, etc.). Consider using the -c option or maintaining an environment specific to each container you use.

Software List

Software	Module to set PATH	Notes
Matlab 2023	module load matlab/2023
Rstudio 2023 4.2.1	module load rstudio

Using rstudio on the cluster example

Add to or create a new file on your laptop to add the login nodes and all hosts your may use with rstudio: ~/.ssh/config

 Host cayuga-login1
   Hostname cayuga-login1.cac.cornell.edu
   IdentityFile ~/.ssh/your_cayuga_key
   User your_cwid
 Host c0001
   Hostname c0001
   IdentityFile ~/.ssh/your_cayuga_key
   User your_cwid
 Host c0002
   Hostname c0002
   IdentityFile ~/.ssh/your_cayuga_key
   User your_cwid

Once the above has been added to your ~/.ssh/config file, login to cayuga-login1.cac.cornell.edu with your key
type: module load rstudio
type: rstudio_run
Follow the output.

  i.e. Let's say you are put on c0002, open a new terminal window on your laptop and type: 
     ssh -J your_cwid@cayuga-login1 -NL [port#]:localhost:[port#] your_cwid@c0002

B. If not connected to VPN: ***Option B will only work if you previously had an account on the Greenberg cluster(aphrodite/pascal)***

Once the forwarding is setup (your above terminal window on your laptop will appear like it is hanging), bring up a browser on your local box with: http://localhost:[port#]
login with your cwid and password that was provided upon running rstudio_run
When done using RStudio, terminate the job by:

 * Exit the RStudio Session ("power" button in the top right corner of the RStudio window)
 * Issue the following command on the login node:
      scancel -f [your_job_id]

Using jupyter notebook on the cluster example

1. Setup ~/.ssh/config on your laptop:

   Host cayuga-login1
     Hostname cayuga-login1.cac.cornell.edu
     IdentityFile ~/.ssh/your_cayuga_key
     User [your_cwid]

   Host c00*
     IdentityFile ~/.ssh/[your_key_name]
     User [your_cwid]
     ProxyCommand ssh -i .ssh/[your_key_name] -W %h:%p cayuga-login1

2. Login to one of the login nodes:

 ssh -i ~/.ssh/[your_key_name] [your_cwid]@cayuga-login1

3. From your terminal window on cayuga-login1, type:

  module load anaconda3
  srun --pty -n1 --mem=8G -p scu-cpu /bin/bash -i

4. Be sure you know which compute node you were placed on; type:

  hostname

5. Lets say you were put on c0001, still in the same terminal window you were placed on (i.e. c0001 in our example), type:

   jupyter notebook --no-browser --ip 0.0.0.0 --port=8962

NOTE: you will not receive a prompt back in this terminal window

6. Back on your laptop terminal window, type:

   ssh -NL 127.0.0.1:8962:c0001:8962 c0001

NOTE: replace both whatever port # you used in step 5 and the nodename you were placed on from step 4 in 2 places on the above line. Again - you will not receive a prompt back in your terminal window on your laptop.

7. In a browser on your laptop, yank and paste the line that was last in the output of step 5. i.e. It should look something like:

  http://127.0.0.1:8962/?token=dd21318d568114149b7b169fad09466fc8683b5b1773fd0e

Environment Modules (Lmod)

Set up your working environment for each software package using the module command. The module command will activate dependent modules if there are any.

Show all available modules:

  module avail

Show currently loaded modules:

  module list

Load a module:

  module load [software_name/version] (as show in the output of module avail)

Unload a module:

  module unload [software_name/version]

It is possible to create your own personal modulefiles to support the software and settings you use.

Software Installation Requests

Software can generally be installed system-wide:

It's a simple package install from a standard repository (Rocky, EPEL, OpenHPC-2) with no likely conflicts.
- Try dnf search as a quick check for package availability.
It's required for licensing reasons, subject to additional direct approval by the cluster owner (potentially only the license infrastructure will be installed).
It can't be installed by the mechanisms above, OR it is version-stable and used by >1 in your group.
Submit requests by sending email to: [email protected]. Please include 'Cayuga' in the subject area.

Default Shell

Default shell for all users is bash. If you would like to request a different shell due to some tools having feature requirements, you may submit a request to have your shell changed. You will *not* be able to chsh to change it. To change your own default shell (on all CAC clusters), you do need to send a request to: [email protected] requesting a login shell change on the cayuga cluster.

Rules, Tips, and Best Practices

Head node. Don't run jobs on the head node, as it can make things unresponsive for other users or in the worst case take down the whole cluster. It's ok to compile code, scan files, and do other minor administrative tasks on the head node though.
Threads. If you have a multithreaded job, you might want to limit the number of threads to something like 1 or 2 per core reserved, or reserve one core per thread or two. The simplest way to do this is usually to use the '-c' with a value that is double the number of threads (e.g. if you want 4 cores/task, use '-c 8'). Slurm sees each core as being 2 CPUs due to hyperthreading, however, your program might not use hyperthreading well. Many multithreaded programs will default to the number of CPUs they see on the system, and are not aware of scheduled resources. We are now forcing CPU affinity, so jobs with too many threads should no longer interfere with other jobs (but might hurt their own performance). Ref: More information re. threads
SIMD jobs / job arrays. If you are running many instances of the same job with different data or settings, please don't just launch tons of separate jobs in a loop. Use a job array because they are easier to monitor, manage, and cancel. Be sure to set a slot limit (% notation) to avoid flooding the queue and allow others to use the resources too — as a rule of thumb, you should be using less than 25% of any in-demand resources such as GPUs. It's good for there to always be some idle resources in case someone needs to test something quickly, etc., using a small amount of resources. Also, please run such jobs at lower priority (using qos or nice) when possible, or otherwise communicate directly with other users in case there's a problem. See the ics-research/ics-cluster-script-examples repository on COECIS Github for an example.
Job priority / nice. Please follow the guidelines for QOS level, and use nice as applicable. Keep in mind that if everyone runs at the highest priority all the time, the priority levels will become useless. See above.
Interactive use.
Interactive use with multiple windows.
Code development/testing/debugging.
Long-running jobs / checkpoint-and-resume.

Help

Submit questions or requests by sending email to: [email protected]. Please include Cayuga in the subject area.
Access to Cayuga requires ssh keys. Step by Step Instructions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly