-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This wiki documents the steps required to run code for the Short Baseline Neutrino (SBN) program on the Polaris supercomputer at Argonne National Laboratory.
SBN-related code is built on the LArSoft framework. The system libraries required to build and run LArSoft and related packages are provided using a Scientific Linux 7 container. Pre-compiled versions of LArSoft and experiment-specific software is downloaded from manifests available at https://scisoft.fnal.gov/.
Once LArSoft is installed, it becomes possible to load experiment-specific software via ups
in the same was as on Fermilab virtual machines, e.g.,
source ${LARSOFT_ROOT}/setup
setup sbndcode v09_75_03_02 -q e20:prof
Disk resources required to run the code are divided into two filesystems available on Polaris: eagle
and grand
. The grand
filesystem contains compiled code and input files, while eagle
is used for outputs and transfers.
-
Request a user account on Polaris with access to
neutrinoGPU
project. - Once logged in, create a local Conda environment and install
parsl
:module use /soft/modulefiles module load conda; conda activate conda create -n sbn python=3.10 conda activate sbn conda install -y -c conda-forge ndcctools pip install parsl
- Clone the sbnd_parsl repository to your home directory. Modify the
entry_point.py
program to adjust the list of.fcl
files, change submission configuration options, etc. - Submit jobs by running the
entry_point.py
program, e.g.python sbnd_parsl/entry_point.py -o /lus/eagle/projects/neutrinoGPU/my-production
The pullProducts
script handles downloading and extracting tarballs of pre-compiled Fermilab software. These software distributions can then be loaded via UPS. As an example, we can download the SBND software distribution and its dependencies into our project's larsoft
directory via:
./pullProducts /grand/neutrinoGPU/software/larsoft/ slf7 sbnd-v09_78_00 e20 prof
The argument sbnd-v09_78_00
is a software bundle provided by SciSoft at Fermilab. A list of available bundles can be found at https://scisoft.fnal.gov/scisoft/bundles/.
You can test software within an interactive job. To begin an interactive job, create a script called interactive_job.sh
with the following contents and run it:
#!/bin/sh
# Start an interactive job
ALLOCATION="neutrinoGPU"
FILESYSTEM="home:grand:eagle"
qsub -I -l select=1 -l walltime=0:45:00 -q debug \
-A "${ALLOCATION}" -l filesystems="${FILESYSTEM}"
Once a slot on the debug
queue becomes available, you will be automatically connected to a prompt within the interactive job.
The following script executes a single .fcl
file by setting up LArSoft in a singularity
container:
#!/bin/bash
# Start singularity and run a fcl. intended to be run from inside an interactive job
LARSOFT_DIR="/grand/neutrinoGPU/software/larsoft"
SOFTWARE="sbndcode"
VERSION="v09_78_00"
# SOFTWARE="icaruscode"
# VERSION="v09_78_04"
QUAL="e20:prof"
ALLOCATION="neutrinoGPU"
FILESYSTEM="home:grand:eagle"
CONTAINER="/grand/neutrinoGPU/software/slf7.sif"
module use /soft/spack/gcc/0.6.1/install/modulefiles/Core
module load apptainer
singularity run -B /lus/eagle/ -B /lus/grand/ ${CONTAINER} << EOF
source ${LARSOFT_DIR}/setup
setup ${SOFTWARE} ${VERSION} -q ${QUAL}
lar -c ${@}
EOF
With a properly configured conda
environment, you can submit your jobs from the login nodes by running the parsl
workflows as regular Python programs, e.g.,
~: python workflow.py
The specific options of your job submission can be defined within your workflow program. The sbnd_parsl
code provides some functions for configuring Parsl
to run on Polaris.
The main resource for job-related information can be found here: https://docs.alcf.anl.gov/polaris/running-jobs/. Ideally, you will be able to test your code using the debug
queue which allows you to submit to one node at a time. Once your code works on the debug queue, other queues, debug-scaling
and prod
may be used for larger-scale productions.
Jobs with mis-configured resource requests, e.g., a debug
queue job with walltime
larger than 1 hour or more than 2 requested nodes, will not run. Consult the link above for the list of appropriate resource requests. Note that the prod
queue is a routing queue. Your job will be automatically assigned to a specific small
, medium
, or large
queue depending on the resources requested.
-
The program
pbsq
is installed at/grand/neutrinoGPU/software/pbsq
. It produces more readable output about job status and can be invoked withpbsq -f neutrinoGPU
. -
Once your job is running you can
ssh
into the worker node. Get the node withqstat -u $(whoami)
or viapbsq
, it should start with "x." Once connected, you can check the memory usage and other metrics with e.g.cat /proc/meminfo
. -
Individual job history can be checked with
qstat -xf <jobid>
-
You can log in to Polaris once via an ssh tunnel, and allow future
ssh
connections to connect without requiring authentication. Place the function in your computer's local.bashrc
or.zshrc
file:connect_polaris () { # macOS (BSD-based ps) # s=$(ps -Ao user,pid,%cpu,%mem,vsz,rss,tt,stat,start,time,command \ # | grep $(whoami) | sed -e 's/sshd//g' | grep ssh | grep fNT | grep polaris) # Unix s=$(ps -aux | grep $(whoami) | sed -e 's/sshd//g' | grep ssh | grep fNT | grep polaris) if [ -z "$s" ]; then echo "Opening background connection to Polaris" ssh -fNTY "$@" ${USER}@polaris.alcf.anl.gov else ssh -Y "$@" ${USER}@polaris.alcf.anl.gov fi }
-
If
parsl
ends immediately with exit status 0 or crashes, it is usually a job queue issue. The first scenario usually meansparsl
has put jobs into the queue, and exited, while the second could be there are outstanding held jobs that should be manually removed withjobsub_rm
-
To get additional UPS products that are not listed in a manifest, you can instead use a local manifest with
pullProducts
. Start by downloading a manifest for a software distribution by passing the-M
flag:./pullProducts -M /grand/neutrinoGPU/software/larsoft/ slf7 icarus-v09_78_04 e20 prof
This will create a file called
icarus-09.78.04-Linux64bit+3.10-2.17-e20-prof_MANIFEST.txt
. You can modify the file to include additional products. Below, we add specific versions oflarbatch
andicarus_data
requested byicaruscode
to the manifest which are not listed in the manifest provided from SciSoft:icarus_signal_processing v09_78_04 icarus_signal_processing-09.78.04-slf7-x86_64-e20-prof.tar.bz2 icarusalg v09_78_04 icarusalg-09.78.04-slf7-x86_64-e20-prof.tar.bz2 icaruscode v09_78_04 icaruscode-09.78.04-slf7-x86_64-e20-prof.tar.bz2 icarusutil v09_75_00 icarusutil-09.75.00-slf7-x86_64-e20-prof.tar.bz2 larbatch v01_58_00 larbatch-01.58.00-noarch.tar.bz2 -f NULL icarus_data v09_79_02 icarus_data-09.79.02-noarch.tar.bz2 -f NULL
You can now re-run the
pullProducts
command with the-l
flag to have the script use the local manifest instead. Note that the file name is automatically deduced based on the final three arguments, so do not modify the file name of the downloaded manifest../pullProducts -l /grand/neutrinoGPU/software/larsoft/ slf7 icarus-v09_78_04 e20 prof
-
Part of running CORSIKA requires performing a copy of database files. The default method for copying is to use the IDFH tool provided by Fermilab, but this has issues on Polaris. Adding the line
physics.producers.corsika.ShowerCopyType: "DIRECT"
to the fcl file responsible for running CORSIKA suppresses the IDFH copy and uses the system default instead.
-
The
sbndata
package is not listed in the sbnd distribution manifest provided by SciSoft, but it is needed to produce CAF files with flux weights -
Worker nodes can't write to your home directory, so make sure your job outputs are being sent to
eagle
orgrand
filesystems. -
Both the login nodes and the worker nodes must use the same version of
parsl
. Theparsl
version on the worker nodes is chosen based on theworker_init
line in the setup of theprovider
class (sbnd_parsl/utils.py
). Specific versions ofparsl
can be installed in your Python environment on the login nodes via, e.g.,pip install --force-reinstall -v "parsl==2023.10.04"
. -
By default, the
tar
command used withinpullProducts
will apply your user'sumask
to the uncompressed files. This means directories created by this script will typically have the permissionsdrwxr-xr-x
, i.e., they will not be modifiable by other members of your project. To change this, you must call theumask
command so that it doesn't include the group write bit (002
). Below shows the snippet frompullProducts
where this can be added (line 117):113 if [ ! -e ${working_dir}/${mytar} ]; then 114 echo "ERROR: could not find ${working_dir}/${mytar}" 1>&2 115 exit 1 116 fi 117 umask 002; tar -C "${product_topdir}" -x -f "${mytar}" || \ 118 { status=$? 119 cat 1>&2 <<EOF 120 ERROR: untar of ${working_dir/${mytar} failed 121 EOF
If you don't do this, you can still use
chmod -R g+w ...
on the installation directory, but it will take a very long time!
This is a summary of the steps required to run a .cpp file with dependence on libtorch. If you want to run on cpu, you can perform these steps in your home directory. For access to a gpu you can go to interactive node.
Additionally to use the hpcviewer GUI, one must pass the -X argument when they ssh.
module use /soft/modulefiles
module load conda
conda activate
module swap gcc/12.2.0 gcc/11.2.0
Now that the environment is set up, we need to make our build/directory structure. The following is just one such example and should be modified to meet the need of the user.
mkdir example-app
cd example-app
Make a example-app.cpp file of the following form
#include <torch/torch.h>
#include <iostream>
// If running on cpu, can change torch::kCUDA to "cpu"
torch::Device device(torch::kCUDA);
int main() {
torch::Tensor tensor = torch::rand({2,3}).to(device);
std::cout << tensor << std::endl;
}
Make a CMakeLists.txt file of the following form
cmake_minimum_required(VERSION 3.5 FATAL_ERROR)
project(example-app)
find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")
add_executable(example-app example-app.cpp)
target_link_libraries(example-app "${TORCH_LIBRARIES}")
set_property(TARGET example-app PROPERTY CXX_STANDARD 17)
Then create and move to a build directory
mkdir build
cd build
Configure cmake file and make. Note the difference in apostrophe and grave accent in this line.
cmake -DCMAKE_PREFIX_PATH=`python3 -c 'import torch;print(torch.utils.cmake_prefix_path)'` ..
make
Can now run executable
./example-app
To apply hpctoolkit to the code
module use /soft/modulefiles
module load hpctoolkit
hpcrun -t ./example-app
hpcstruct OUTPUT_FOLDER_OF_HPCRUN
hpcprof OUTPUT_FOLDER_OF_HPCRUN
This process gives you an additional folder labeled as a database. One can use the hpcviewer GUI to see analyze the software. This should be done from a login node instead of a compute node.
hpcviewer OUTPUT_FOLDER_OF_HPCPROF
In order to utilize the spack builds of sbndcode present on Polaris, one has to follow a few short steps.
cd /grand/neutrinoGPU/software/fermi-spack-Jul7/
. spack/share/spack/setup-env.sh
spack env activate sbndcode-09_90_00_env
spack load sbndcode
After this, one can use lar
commands. In order to utilize GPU resources (for example via dnnroi inference), one must be on node with GPU resources (ie not a login node). For example, after entering a debug node, one can utilize the standard workflow fcl files which have been modified to include dnnroi with GPU resources using those in /grand/neutrinoGPU/software/fermi-spack-Jul7/gpu_workflow_fcls
.
As an example, using an already produced geant4 output in /grand/neutrinoGPU/software/fermi-spack-Jul7/Outputs
one can use
lar -c /grand/neutrinoGPU/software/fermi-spack-Jul7/gpu_workflow_fcls/standard_detsim.fcl -s Outputs/test_g4.root -o Outputs/example_detsim.root