In SFIM, we use the NIH Biowulf cluster to do most of our data analysis. This is a GNU/Linux parallel processing system designed and built at the NIH that permits running large number of simultaneous jobs with high requirements of memory and processing power. This document collates resources for using Biowulf and HPC. Much of this is useful for setting up your account on Biowulf to benefit our typical workflows.
- Setting up Biowulf
- Using Biowulf
Sometimes Biowulf and Python can have issues. HPC has a fairly comprehensive guide here that you can read, which includes information about the default versions of Python that are installed on Biowulf and common pitfalls. They have also created a guide on how to use conda and mamba to manage environments on Biowulf; there are a variety of important steps to do this correctly, so we have additional step-by-step instructions for setting up conda on biowulf.
It is important to note that there can be a naming conflict ("dbus") when using conda
on Biowulf. This confict can cause NoMachine to fail intermittently.
The solution is to first try to remove the lines inserted by conda into your .bashrc
, so that conda does not load by default.
In the case that this does not work, try removing dbus
with
conda uninstall dbus
which will remove the dbus
package.
This can cause issues if you have packages which depend on it.
These are instructions for setting Biowulf drives to automount on MacOS. This allows you to see your Biowulf drives in Finder as though they are a local directory on your laptop.
In order to automatically mount Biowulf drives, you should create a script
using Script Editor called BiowulfAutoMount.scpt
that looks like this:
tell application "Finder"
mount volume "smb://hpcdrive.nih.gov/SFIM_100RUNS" as user name "USERNAME"
mount volume "smb://hpcdrive.nih.gov/NIMH_SFIM" as user name "USERNAME"
mount volume "smb://hpcdrive.nih.gov/USERNAME" as user name "USERNAME"
mount volume "smb://hpcdrive.nih.gov/SFIMLBC" as user name "USERNAME"
mount volume "smb://hpcdrive.nih.gov/SFIM" as user name "USERNAME"
mount volume "smb://hpcdrive.nih.gov/data" as user name "USERNAME"
end tell
with USERNAME
your NIH username.
The USERNAME
address will mount to /home/USERNAME
on biowulf.
The /data
address will mount to /data/USERNAME
.
All others will go to /data/DIRNAME
with DIRNAME the directory name. Repeat for as many directories as you need.
Then export it as an application in Script Editor.
When you run it, it will mount the directories.
You should be able to access them in MacOS under /Volumes/
in Terminal.
Follow additional instructions here for MacOS in order to improve mount performance.
You may find it useful to create an SSH key to connect to Biowulf without having to type in your password every time. To do so, you can follow the instructions from HPC. The main difference here is that you need to create your key on your local laptop, then add the public key to your ~/.ssh/authorized_keys
file on Biowulf.
You will want an SSH key on Biowulf to connect with GitHub via the command line. You can follow the same instructions used for setting up your laptop.
If you are using Git on Biowulf, you might get some weird fatal errors when you start a new session. If this happens, try to restart the ssh-agent and re-add the key (instructions here from GitHub).
The HPC team has put together an incredible amount of helpful tutorials, including a user guide for common tasks and commands. There are many of them, but the following may be particularly useful as you get started:
- Connect to NIH HPC systems on your Mac or via NoMachine
- You can connect to Biowulf through the Terminal using
ssh
, but using NoMachine may be necessary if you are using graphical applications.
- You can connect to Biowulf through the Terminal using
- Using Jupyter Notebooks on Biowulf
- If you want to run a Jupyter Notebook that connects to a compute node, you must create an SSH Tunnel. This requires a few specific steps outlined in the HPC documentation.
- NIMH-specific resources
- Specifically, more information about
spersist
sessions. These can be useful for setting up SSH tunneling when you want to have a longer session.
- Specifically, more information about
- Swarm guide
- Swarm simplifies submitting a group of commands to the batch system on Biowulf.
In addition to the HPC user guide and tutorials, the Data Sharing and Science Team has helpfully created additional Biowulf resources. Several key tools are the ability to store an environment in an spersist
node on the cluster, and the ability to easily run BIDS and fMRIPrep validation.
When using Jupyterlab, you need to create two SSH tunnels. First, open a terminal and connect to Biowulf:
ssh biowulf.nih.gov
Next, create a tmux session:
module load tmux
tmux new
Now, you can create an spersist session. The command below will also start a VNC server, which is useful if you're using graphically demanding applications (ex. AFNI), but it's not necessary. CPUs and memory can also be edited to suit your needs. The important thing here is that there are two --tunnel flags, which will allow you to connect to Jupyterlab:
spersist --tunnel --tunnel --cpus-per-task=16 --mem=32g --vnc
Copy the SSH command it gives you, then open a new terminal and paste it. It will look something like this:
ssh -L 00000:localhost:00000 -L 11111:localhost:11111 -L 22222:localhost:22222 [email protected]
Make sure you save this command, as you will need to input it whenever you lose connection.
Back in the tmux session, cd to the directory you will be working in and activate your Jupyterlab environment. Then, execute this command, replacing ${PORT1} with the first port in your SSH command. In this example, it is 00000.
jupyter-lab --port ${PORT1} --ip localhost --no-browser
Paste the URL it gives you into your browser and bookmark it. Now you can use Jupyterlab!
Close the tmux window by pressing the 'X' button – do not type 'exit' or it will end the session.
If you ever want to reopen your tmux session, run:
module load tmux
tmux ls
The ls
command will output the session number. Using it, you can open your session:
tmux attach -t <session number>
To exit your session, you can simply run:
exit
Biowulf has a module system that is sometimes useful for loading common programs that aren't loaded by default. You can use it to load either the newest version on Biowulf or an older version for a range of programs. Several modules that may be particularly useful for neuroimagers are
module load afni # usually kept up-to-date
module load R
module load git # default git is a very old verion. This will load a more up-to-date version
module load fsl
module load fmriprep
module load mriqc
module load matlab
It's a bit messy, but some additional common programs are in /data/NIMH_SFIM/CommonScripts
and some common anatomical parcellations are in /data/NIMH_SFIM/CommonParcellations
. These might be out-of-date, but if you want to install a newer version there so that others can benefit, just ask around before updating.
If you're collaborating with others in a directory on Biowulf, you may need to change the permissions to allow others to write or read content. On Biowulf, each file or directory is part of a group. That group should be SFIM
or the name of the /data/[group]
directory. New files sometimes have the group as an individual's user ID, which means others won't be able to see it. chgrp -R SFIM directory
will change the group for directory
and all of the files inside. You can then adjust group access with chmod -R 2770 directory
. The first 2
means that new files within a directory will (theoretically) inherit the same group name and permissions. The next 3 digits define file owner, group, and world permissions. 7 means a file is read/write/executable, 4 is read only, and 0 is no permissions. You can also consider adding umask 007
to your ~/.bashrc
on Biowulf, which will allow all members of the directory's group to have write access to new subdirectories automatically.
Here are some useful options
chmod -R 2770 directory # Owner and group can read/write/execute
chmod -R 2700 directory # Only owner can read/write/execute
chmod -R 2740 directory # Owner can read/write/execute and group members can read, but not alter files
Unless you are sharing files that do not contain personnally identifiable information (PII) with an out-group Biowulf user, never make files world-readable on Biowulf. If you have an ongoing collaboration with someone outside of SFIM, request a new group where you can define who has access. For example, SFIMLBC
has been used for some collaborations between SFIM and other LBC lab members.
For more details and guidance from the HPC team, see their documetnation here.
Every cloned GitHub repo has a named owner. That means, if multiple people are working on the same code in the same Biowulf directory, it will be impossible to see who made which edit. Having multiple people work on code in one location can be convenient if one person is making most of the edits with some paired coding support from someone else. If this is the case, make sure to set chmod
permissions for the code, including the .git
directory so that they both have write access.
If a repo has more than one active contributor, storing the codebase in a single directory can cause problems. In this case, it is better for each person to have their own clone of the repo where they edit and commit changes to GitHub while specifying one location for where code is run from. Concretely, this might look like:
SFIM_shared_dir/ # shared directory on Biowulf
├── data # store data here
├── personA # person A's clone of a git repo
│ └── .git
└── personB # person B's clone of the git repo
└── .git
If there is a shared directory and files (particularly those in the .git directory) have permissions get changed so they are associated with a single user (rather than the group), you may get an error where Git does not realize a repo exists. If this is the case, you can try to change the group of the directory. You may have to run this command specifically for the .git
directory if chgrp
does not touch the hidden directories.
chgrp -R SFIM directory/.git
If you aren't sure what the best structure for your project is, talk to the Scientific Programmer at the beginning of your project to get things set up.
In order to interact with GitHub repos when you're connected to a compute node on VSCode via the Remote Developer Extension, you must have added the following to ~/.ssh/config
on Biowulf, replacing $username
with your Biowulf username. Note that for this to work, the repository must be cloned using https
not ssh
.
Host github.com
User git
ProxyCommand /usr/bin/ssh -o ForwardAgent=yes [email protected] nc -w 120ms %h %p
For additional information on using VSCode, check out our VSCode Guide.