Skip to content

Latest commit

 

History

History
179 lines (141 loc) · 7.01 KB

submit_jobs_to_Ginsburg.md

File metadata and controls

179 lines (141 loc) · 7.01 KB

Ginsburg HPC Cluster

Access

To gain access to the cluster, you must first get approval by emailing your UNI and account name (Ocean Climate Physics: Abernathey) to [email protected] while also cc'ing Ryan.

Log In

Log in to the secure LDEO vpn with your LDEO username and password. From your terminal, SSH into the submit node: ssh -x <UNI>@ginsburg.rcs.columbia.edu ("-x" enables X11 forwarding). Enter your Columbia password and you will be in the directory /burg/home/<UNI>.

Alternatively, you can define specific SSH settings that will make connecting to the host faster. In your local .ssh/config file, add the following information:

Host gins
        HostName ginsburg.rcs.columbia.edu
        ForwardAgent yes
        User <UNI>
        ServerAliveInterval 60

ServerAliveInterval helps keep your ssh connection alive if you have an unsteady internet.

Now logging in to Ginburg is a bit faster.

ssh -x gins

Note: Logging in direcly to the burg filesystem using ssh -x <UNI>@burg.rcs.columbia.edu allows you more RAM to work with. This route is useful if you are working with virtual python environments on the burg filesystem.

Start a Job with Slurm

Ginsburg uses Slurm to manage the cluster workload. A batch script is used to allocate resources and execute your job. You must specify an account when you're ready to submit a job to the cluster.

Account Full Name
abernathey Ocean Climate Physics: Abernathey

The example submit script runs the job Date, which prints the time and date to an output file (slurm-####.out) written in your current directory. The #'s correspond to the job ID given by Slurm.

#!/bin/bash -l
#
# Replace ACCOUNT with your account name before submitting.
#
#SBATCH --account=ACCOUNT        # Replace ACCOUNT with your group account name
#SBATCH --job-name=Date          # The job name
#SBATCH -N 1                     # The number of nodes to request
#SBATCH -c 1                     # The number of cpu cores to use (up to 32 cores per server)
#SBATCH --time=0-0:10            # The time the job will take to run in D-HH:MM
#SBATCH --mem-per-cpu=5G         # The memory the job will use per cpu core

# Run program
date
 
# End of script

Once saved, scripts like this one can be submitted to the cluster:

$ sbatch date.sh

To view your jobs on the system:

$ squeue -u <UNI>

To view information about a job:

$ scontrol show job [job ID]

To cancel a job:

$ scancel [job ID]

Python

An example script using anaconda:

#!/bin/sh
#
#SBATCH --account=ACCOUNT         # Replace ACCOUNT with your group account name
#SBATCH --job-name=HelloWorld     # The job name.
#SBATCH -c 1                      # The number of cpu cores to use
#SBATCH -t 0-0:30                 # Runtime in D-HH:MM
#SBATCH --mem-per-cpu=5gb         # The memory the job will use per cpu core
 
module load anaconda
 
#Command to execute Python program
python example.py
 
#End of script

Save the script as helloword.sh and submit it:

$ sbatch helloword.sh

GPU

OCP owns 4 GPU servers with priority access and the ability to run up to 5 days jobs. If the OCP GPU servers are not available, Slurm will allocate non-OCP GPU nodes.

Specify the OCP GPU partition in your submit script:

#SBATCH --partition=ocp_gpu       # Request ocp_gpu nodes first. If none are available, the scheduler will request non-OCP gpu nodes.
#SBATCH --gres=gpu:1              # Request 1 gpu (Up to 2 gpus per GPU node)

If you want to take advantage of the GPUs with TensorFlow, make sure you install the correct release:

conda install tensorflow-gpu

Some useful modules

Name Version Module
Anaconda Python 3.8.5 2020.11 Python 3.8.5 module load anaconda/3-2020.11
Anaconda Python 2.7.16 2019.10 Python 2.7.16 module load anaconda/2-2019.10
netcdf/gcc 4.7.4 netcdf/gcc/64/gcc/64/4.7.4

To see all available modules already installed on the cluster, run module avail from the command line. More module commands.

Jupyter Notebooks

Jupyter notebooks run on a semi-public port that is accessible to other users logged in to a submit node on Ginsburg. Therefore, it is strongly reccomended to set up a password using the following steps:

  1. load the anaconda python module
$ module load anaconda
  1. initialize your jupyter environment
$ jupyter notebook --generate-config
  1. start python or ipython session
$ ipython
  1. generate a hashed password for web authentication. After you enter a unique password, a hashed password will be displayed. The string should be of the form type:salt:hashed-password. Copy this password.
from notebook.auth import passwd; passwd()
  1. paste the hash password into ~/.jupyter/jupyter_notebook_config.py with the line starting with c.NotebookApp.password =. Make sure this line is uncommented.

Run a Jupyter Notebook

Start an interactive job. This example uses a time limit of one hour.

$ srun --pty -t 0-01:00 -A <ACCOUNT> /bin/bash
$ unset XDG_RUNTIME_DIR                          # get ride of XDG_RUNTIME_DIR environment variable
$ module load anaconda                            # load anaconda
$ hostname -i                                     # This will print the IP of your interactive job node
$ jupyter notebook --no-browser --ip=<IP>         # Start the jupyter notebook with your node IP

SSH port forwarding

At this point, the port number of you notebook should be displayed. Open another connection to Ginsburg that forwards a local port to the remote node/port. For example:

$ ssh -L 8080:10.43.4.206:8888 [email protected]

where 8888 is the remote port number, 8080 is the local port, and 10.43.4.206 is the IP of the interactive job node.

To see the notebook, navigate to a browser on your desktop and go to localhost:8080.

Where do I save my files?

Ginsburg has a shared storage server named "burg".

Location Storage What to save
/burg/home/ 50 GB small files, documents, source code, and scripts
/burg/abernathey 1 TB large data files !NOT BACKED UP!

Transfer data to the cluster using SCP:

$ scp MyDataFile <UNI>@motion.rcs.columbia.edu:<DESTINATION_PATH>

You can access data on Habanero by navigating to the /rigel directory on Ginsburg's transfer or login node.

Still need help?