Student Cluster Competition - Tutorial 3
- Checklist
- Managing Your Environment
- Install Lmod
- Running the High Performance LINPACK (HPL) Benchmark on Your Compute Node
- Building and Compiling OpenBLAS and OpenMPI Libraries from Source
- Intel oneAPI Toolkits and Compiler Suite
- LinPACK Theoretical Peak Performance
- Spinning Up a Second Compute Node Using a Snapshot
- HPC Challenge
- Application Benchmarks and System Evaluation
Tutorial 3 demonstrates environment variable manipulation through the use of modules and the compilation and optimization of HPC benchmark software. This introduces the reader to the concepts of environment management and workspace sanity, as well as compilation of various software applications on Linux.
In this tutorial, you will also be spinning up and connecting a second compute node in order to further extend the capabilities of your small VM cluster. More importantly, you will given detailed specifics on exactly how to go about running application benchmarks across multiple nodes.
In this tutorial you will:
- Understand the importance of having a consistent environment across your cluster.
- Understand the difference between system software and user (local to the user's
home
directory) installed software. - Install, configure and use Lmod.
- Understand some of the fundamental considerations around optimizing HPL.
- Understand the pros and cons of compiling libraries from source.
- Install and make use of Intel's oneAPI framework to run HPL.
- Understand theoretical system peak performance.
- Appreciate the significance of the Top500 list and benchmarking.
- Standup and Configure a Second Compute Node, and running applications across a cluster.
- Download and compile the High Performance Computing Challenge (HPCC) benchmark.
- Understand that scientific computer applications are primarily used to conduct scientific research, and can also evaluate system performance.
One of the most central and fundamental problems that you and your team will need to tackle, will be that of managing you environment. When you run an application, there are a number of locations that are searched to determine the binary to execute.
For example, if you wanted to know "which" GNU C Compiler your VM's are using by default:
# If you completed the HPL excercise in tutorial1, you would have installed
# a system-wide GCC on your head node, and can expect and out of /usr/bin/gcc
which gcc
We see that for this particular system, the gcc
that will be invoked by default is located in the directory /usr/bin/
.
Tip
Try the above command on your compute node, and you should notice that no gcc in $PATH
is found...
You will recall that you were required to configure an NFS Mounted home dir. This means that any software that you install into your /home/<USER_DIRECTORY>
on your head node, will also automatically be available on your compute nodes.
In order for this to work as expected, there are two important conditions that must be satisfied:
- Firstly, you must ensure that you're
PATH
variable is correctly configured on your head node and must similarly have a corresponding configuration on your compute node(s). For example, to see a colon (:
) separated list of directories that are searched whenever you execute a binary:echo $PATH
- Secondly, you must ensure that any system dependencies are correctly installed on each of your nodes. For example, this would be a good time to install
gcc
on your compute node:sudo dnf install gcc
Instructions for APT (Ubuntu) and Pacman (Arch)
# APT
sudo apt install gcc
# Pacman
sudo pacman -S gcc
Important
Software on one node will not automatically be installed across all nodes. For example, if you want to monitor the system performance of your head node, you must install and/or run top # or htop or btop
on your head node. Similarly if you want to do this for your compute node(s), you must install and run the application on all of your compute node(s).
In the next few sections you will be installing and deploying Lmod across your cluster. You will be configuring, building and compiling Lmod from a directory in your /home/<USER>
directory. This will mean that the Lmod binary will be available across your cluster, however in order to run Lmod on your compute nodes for example, you must ensure that all Lua system dependencies are installed across all you nodes.
Environment Modules provide a convenient way to dynamically change a user's environment through modulefiles to simplify software and library use when there are multiple versions of a particular software package (e.g. Python2.7 and Python 3.x) installed on the system. Environment Module parameters typically involve, among other things, modifying the PATH
environment variable for locating a particular package (such as dynamically changing the path to Python from /usr/local/bin/python2.7
to /usr/local/bin/python3
).
In this section, you are going to be building and compiling Lmod from source. Lmod is a Lua-based environment module tool for users to easily manipulate their HPC software environment and is used on thousands of HPC systems around the world. Carefully follow these instructions, as there are prerequisites and dependencies that are required to build Lmod, which are slightly different to those required to execute the Lmod binary.
Important
You can build Lmod on either your head node or one of your compute nodes. Since your compute node(s), will generally speaking have more CPUs for compute, they will typically be able to build and compile applications much much faster than your administrative (or login) head node.
-
Install prerequisites required to build Lmod: From one of your compute nodes, install the following dependencies
- DNF / YUM
# Rocky (or similar RPM based systems: RHEL, Alma, CentOS Stream) sudo dnf install -y epel-release sudo dnf install -y git gcc make
- APT
# Ubuntu (or similar APT based systems) sudo apt update sudo apt install -y git gcc make
- Pacman
# Arch sudo pacman -S git gcc make
- DNF / YUM
-
Install dependencies for running and using Lmod: You will need to install these on all the nodes you intend to use with Lmod
- DNF / YUM
# Rocky (or similar RPM based systems: RHEL, Alma, CentOS Stream) sudo dnf install -y epel-release sudo dnf install -y tcl-devel tcl tcllib bc sudo dnf install -y lua lua-posix lua-term sudo dnf --enable-repo=devel install lua-devel
- APT
# Ubuntu (or similar APT based systems) sudo apt update sudo apt install -y tcl tcl-dev lua5.3 lua-posix bc
- Pacman
# Arch sudo pacman -S lua lua-filesystem lua-posix bc
- DNF / YUM
-
Compile, Build and Install Lmod The following instructions will be the same regardless of the system you are using. You will be using the Lmod repo from the Texas Advanced Computing Center at the University of Texas, to build and compile Lmod from source into your own
home
directory:# Clone the repository git clone https://github.com/TACC/Lmod.git # Navigate into the Lmod directory cd Lmod # Run the configuration script and install into you home directory ./configure --prefix=$HOME/lmod # Build and install Lmod make -j$(nproc) make install # If you are on Rocky 9.3 and receive an error about "lua.h", # You can install Lmod from the DNF / YUM package repos.
--prefix
: This directive instructs the./configure
command, to install Lmod into a specific directory, where the$HOME
variable is used as a shortcut for/home/<user>
.-j$(nproc)
: This directive instructs themake
command to build and compile use the maximum number of processors on the system.
Tip
You and your team are STRONGLY encouraged to review and make sure you understand the Compile, Build and Installation instructions for Lmod as these steps will apply to virtually all application benchmarks you will encounter in this competition.
With Lmod installed, you'll now have some new commands on the terminal. Namely, these are: module <subcommand>
. The important ones for you to know and use are: module avail
, module list
, module load
and module unload
. These commands do the following:
Command | Operation |
---|---|
module avail |
Lists all modules that are available to the user. |
module list |
Lists all modules that are loaded by the user. |
module load <module_name> |
Loads a module to the user's environment. |
module unload <module_name> |
Removes a loaded module from the user's environment. |
Lmod also features a shortcut command ml
which can perform all of the above commands:
Command | Operation |
---|---|
ml |
Same as module list |
ml avail |
Same as module avail |
ml <module_name> |
Same as module load <module_name> |
ml -<module_name> |
Same as module unload <module_name> |
ml foo |
Same as module load foo |
ml foo -bar |
Same as module load foo and module unload bar |
Note
Some installed packages will automatically add environment modules to the Lmod system, while others will not and will require you to manually add definitions for them. For example, the Intel oneAPI Toolkits
package that we will install from source later in this tutorial will have automatic configuration scripts to add module files to the system for loading via Lmod.
The High Performance LINPACK (HPL) benchmark is used to measure a system's floating point number processing power. The resulting score (in Floating Point Operations Per Second, or FLOPS for short) is often used to roughly quantify the computational power of an HPC system. HPL requires math libraries to perform its floating point operations as it does not include these by itself and it also requires an MPI installation for communication in order to execute in parallel across multiple CPU cores (and hosts).
A library is a collection of pre-compiled code that provides functionality to other software. This allows the re-use of common code (such as math operations) and simplifies software development. You get two types of libraries on Linux: static
and dynamic
libraries.
-
Static Libraries
Static libraries are embedded into the binary that you create when you compile your software. In essence, it copies the library that exists on your computer into the executable that gets created at compilation time. This means that the resulting program binary is self-contained and can operate on multiple systems without them needing the libraries installed first. Static libraries are normally files that end with the
.a
extension, for "archive".Advantages here are that the program can potentially be faster, as it has direct access to the required libraries without having to query the operating system first, but disadavanges include the file size being larger and updating the library requires recompiling (and linking the updated library) the software.
-
Dynamic Libraries
Dynamic libraries are loaded into a compiled program at runtime, meaning that the library that the program needs is not embedded into the executable program binary at compilation time. Dynamic libraries are files that normally end with the
.so
extension, for "shared object".Advantages here are that the file size can be much smaller and the application doesn't need to be recompiled (linked) when using a different version of the library (as long as there weren't fundamental changes in the library). However, it requires the library to be installed and made available to the program on the operating system.
Note
Applications (such as HPL) can be configured to use static or dynamic libraries for its math and MPI communication, as mentioned above.
-
Message Passing Interface (MPI)
MPI is a message-passing standard used for parallel software communication. It allows for software to send messages between multiple processes. These processes could be on the local computer (think multiple cores of a CPU or multiple CPUs) as well as on networked computers. MPI is a cornerstone of HPC. There are many implementations of MPI in software such as OpenMPI, MPICH, MVAPICH2 and so forth. To find out more about MPI, please read the following: https://www.linuxtoday.com/blog/mpi-in-thirty-minutes.html
-
Basic Linear Algebra Subprograms Libraries
Basic Linear Algebra Subprograms (BLAS) libraries provide low-level routines for performing common linear algebra operations such as vector and matrix multiplication. These libraries are highly optimized for performance on various hardware architectures and are a fundamental building block in many numerical computing applications.
We need to install the statically ($(LIBdir)/libhpl.a)
and dynamically($(LAdir)/libtatlas.so $(LAdir)/libsatlas.so)
linked libraries that HPL expects to have, as well as the software for MPI. The MPI implementation we're going to use here is OpenMPI and we will use the Automatically Tuned Linear Algebra Software (ATLAS) math library.
Important
Remember that since the MPI and BLAS libraries are dynamically linked, you need to ensure that ALL of our nodes that you expect to run HPL on have the expected MPI and BLAS libraries.
If you've managed to successfully build, compile and run HPL in tutorial 1, and you've managed to successfully configure your NFS home
directory export in tutorial 2, then you may proceed. Otherwise you must discuss with, and seek advice from an instructor.
-
Install the necessary dependencies on your compute node:
- DNF / YUM
# RHEL, Rocky, Alma, Centos sudo dnf update -y sudo dnf install openmpi atlas openmpi-devel atlas-devel -y sudo dnf install wget nano -y
- APT
# Ubuntu sudo apt update sudo apt install openmpi libatlas-base-dev sudo apt install wget nano
- Pacman
# Arch sudo pacman -Syu sudo pacman -S base-devel openmpi atlas-lapack nano wget
- DNF / YUM
-
Configuring and Tuning HPL
The
HPL.dat
file (in the same directory asxhpl
) defines how the HPL benchmark solves a large dense linear array of double precision floating point numbers. Therefore, selecting the appropriate parameters in this file can have a considerable effect on the GFLOPS score that you obtain. The most important parameters are:N
which defines the length of one of the sides of the 2D array to be solved. It follows therefore:- "Problem Size" is proportional to
N x N
, - "Runtime" is proportional to
N x N
, - "Memory Usage" is proportional to
N x N
.
- "Problem Size" is proportional to
We can observe that if you we to double
N
, your run would take four times as long. If you tippledN
, your run would use nine times as much memory. If you madeN
ten times larger, your run would use hundred times more memory, and would take hundred times as long to run.NB
defines the block (or chunk) size into which the array is divided. The optimal value is determined by the CPU architecture such that the block fits in cache. For best performanceN
should be a multiple ofNB
.P x Q
define the domains (in two dimensions) for how the array is partitioned on a distributed memory system. ThereforeP x Q
typically should equate more or less to the number of MPI ranks, or number of nodes, or number of NUMA domains. For example, if you have 4 single CPU nodes, the permutations forP
andQ
include [1, 4] and [2, 2]. Similarly, if you have 4 dual socket nodes, the permutations forP
andQ
include [1,8], [2, 4], etc...
-
Prepare your environment to rerun your
xhpl
binary on your compute nodeMake sure to open an additional ssh session to your compute node, so that you can monitor your CPU utilization using
top # preferably btop / htop
.# Export the path to the OpenMPI Library export PATH=/usr/lib64/openmpi/bin:$PATH # Edit your HPL.dat file with the following changes cd ~/hpl/bin/<TEAM_NAME> nano HPL.dat
-
Make the following changes to your
HPL.dat
file:22000 Ns 164 NBs
-
Finally, rerun
xhpl
and record your GFLOPS score:./xhpl
Tip
You can find online calculators that will generate an HPL.dat
file for you as a starting point, but you will still need to do some tuning if you want to squeeze out maximum performance.
Caution
Compiling your entire application stack and tool-chains from source, can provide you with a tremendous performance improvement. Compiling applications from source 'can' take a very long time and can also be a very tricky process. For this reason the compilation of gcc
is omitted for the competition.
You are advised to skip this section if you have fallen behind the pace recommended by the course coordinators. Skipping this section will NOT stop you from completing the remainder of the tutorials.
You now have a functioning HPL benchmark. However, using math libraries (BLAS, LAPACK, ATLAS) from a repository (dnf
) will not yield optimal performance, because these repositories contain generic code compiled to work on all x86 hardware. If you were monitoring your compute during the execution of xhpl
, you would have noticed that the OpenMPI and Atlas configurations restricted HPL to running with no more than two OpenMP threads.
Code compiled specifically for HPC hardware can use instruction sets like AVX
, AVX2
and AVX512
(if available) to make better use of the CPU. A (much) higher HPL result is possible if you compile your math library (such as ATLAS, GOTOBLAS, OpenBLAS or Intel MKL) from source code on the hardware you intend to run the code on.
-
Install dependencies
# DNF / YUM (RHEL, Rocky, Alma, Centos Stream) sudo dnf group install "Development Tools" sudo dnf install gfortran git gcc wget # APT (Ubuntu) sudo apt install build-essential hwloc libhwloc-dev libevent-dev gfortran wget # Pacman sudo dnf install base-devel gfortran git gcc wget
-
Fetch and Compile OpenBLAS Source Files
# Fetch the source files from the GitHub repository git clone https://github.com/xianyi/OpenBLAS.git cd OpenBLAS # Tested against version 0.3.26, you can try an build `develop` branch git checkout v0.3.26 # You can adjust the PREFIX to install to your preferred directory make make PREFIX=$HOME/opt/openblas install
-
Fetch, Unpack and Compile OpenMPI Source Files
# Fetch and unpack the source files wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz tar xf openmpi-4.1.4.tar.gz cd openmpi-4.1.4 # Pay careful attention to tuning options here, and ensure they correspond # to your compute node's processor. # # If you are unsure, you can replace the `cascadelake` architecture option # with `native`, however you are expected to determine your compute node's # architecture using `lscpu` or similar tools. # # Once again you can adjust the --prefix to install to your preferred path. CFLAGS="-Ofast -march=cascadelake -mtune=cascadelake" ./configure --prefix=$HOME/opt/openmpi # Use the maximum number of threads to compile the application make -j$(nproc) make install
-
Compile and Configure HPL
# Copy the Makefile `Make.<TEAM_NAME>` that you'd previously prepared # and customize it to utilize the OpenBLAS and OpenMPI libraries that # you have just compiled. cd ~/hpl cp Make.<TEAM_NAME> Make.compile_BLAS_MPI nano Make.compile_BLAS_MPI
-
Edit the platform identifier (architecture), MPI and BLAS paths, and add compiler optimization flags:
ARCH = compile_BLAS_MPI MPdir = $(HOME)/opt/openmpi MPinc = -I$(MPdir)/include MPlib = $(MPdir)/lib/libmpi.so LAdir = $(HOME)/opt/openblas LAinc = LAlib = $(LAdir)/lib/libopenblas.a CC = mpicc CCFLAGS = $(HPL_DEFS) -O3 -march=cascadelake -mtune=cascadelake -fopenmp -fomit-frame-pointer -funroll-loops -W -Wall LDFLAGS = -O3 -fopenmp LINKER = $(CC)
-
You can now compile your new HPL:
# You will also need to temporarily export the following environment # variables to make OpenMPI available on the system. export MPI_HOME=$HOME/opt/openmpi export PATH=$MPI_HOME/bin:$PATH export LD_LIBRARY_PATH=$MPI_HOME/lib:$LD_LIBRARY_PATH # Remember that if you make a mistake and need to recompile, first run # make clean arch=compile_BLAS_MPI make arch=compile_BLAS_MPI
-
Edit HPL to take advantage of your custom compiled MPI and Math Libraries Verify that a
xhpl
executable binary was in fact produced and configure yourHPL.dat
file with reference to the Official HPL Tuning Guide:cd bin/compile_BLAS_MPI # As a starting point when running HPL on a single node, with a single CPU # Try setting Ps = 1 and Qs = 1, and Ns = 21976 and NBs = 164 nano HPL.dat
-
Finally, you can run your
xhpl
binary with custom compiled libraries.# There is no need to explicitly use `mpirun`, nor do you have to specify # the number of cores by exporting the `OMP_NUM_THREADS`. # This is because OpenBLAS is multi-threaded by default. ./xhpl
Tip
Remember to open a new ssh session to your compute node and run either top # preferably htop / btop
. Better yet, if you are running tmux
in your current session, open a new tmux window using C-b c
then ssh to your compute node from there, and you can cycle between the two tmux windows using C-b n
.
Join the Discussion by replying to the thread with a screenshot of your compute node's CPU threads hard at work.
Intel oneAPI Toolkits provide a comprehensive suite of development tools that span various programming models and architectures. These toolkits help developers optimize their applications for performance across CPUs, GPUs, FPGAs, and other accelerators, visit Intel oneAPI Toolkits for more information.
Caution
The Intel oneAPI Base and HPC Toolkits can provide considerable improvements of your benchmarking results. However, they can be tricky to install and configure. You are advised to skip this section if you have fallen behind the pace recommended by the course coordinators. Skipping this section will NOT stop you from completing the remainder of the tutorials.
You will need to install and configure Intel's oneAPI Base Toolkit which includes Intel's optimized Math Kernel Libraries and Intel's C/C++ Compilers. Additionally, you will also need to install Intel's HPC Toolkit which extends the functionality of the oneAPI Base Toolkit and includes Intel's optimized FORTRAN and MPI Compilers.
You will be making use of the 2024-2 versions of the Intel oneAPI and HPC Toolkits.
-
Optionally the following prerequisites and install dependencies, to make use of Intel's VTune Profiler for a graphical user interface.
# DNF / YUM (RHEL, Rocky, Alma, CentOS Stream) sudo dnf install libdrm gtk3 libnotify xdg-utils libxcb mesa-libgbm at-spi2-core # APT (Ubuntu) sudo apt install libdrm2 libgtk-3-0 libnotify4 xdg-utils libxcb-dri3-0 libgbm1 libatspi2.0-0 # Pacman (Arch) sudo pacman -S libdrm gtk3 libnotify xdg-utils libxcb mesa-libgbm at-spi2-core
-
Download the offline installers into your
HOME
directory-
Intel oneAPI Base Toolkit
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9a98af19-1c68-46ce-9fdd-e249240c7c42/l_BaseKit_p_2024.2.0.634_offline.sh
-
Intel oneAPI HPC Toolkit
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/d4e49548-1492-45c9-b678-8268cb0f1b05/l_HPCKit_p_2024.2.0.635_offline.sh
-
-
Use
chmod
to make the scripts executable# First make the Basekit executable chmod +x l_BaseKit_p_2024.2.0.634_offline.sh # Then make the HPCkit executable chmod +x l_HPCKit_p_2024.2.0.635_offline.sh
-
Run the installation script using the following command line parameters:
-a
: List of arguments to follow...--cli
: Executing the installer on a Command Line Interface.--eula accept
: Agree to accept the end user license agreement. More details about the Command Line Arguments can be found within Intel's installation guide.
These must be run separately and you will need to navigate through a number of CLI text prompts and accept the end-user license agreement.
# Run Intel oneAPI Basekit installation script ./l_BaseKit_p_2024.2.0.634_offline.sh -a --cli --eula accept
Should there be any missing dependencies, use your systems' Package manager to install them
# Run Intel oneAPI HPCkit installation script ./l_HPCKit_p_2024.2.0.635_offline.sh -a --cli --eula accept
-
Configure your Environment to use Intel oneAPI Toolkits You can either use the
setvars.sh
configuration script or modulefiles:- To have your environment automaticaly prepared for use with Intel's oneAPI Toolkit append the following line either to your
/etc/profile
or your~/.bashrc
or your~/.profile
or run the command every time you login to your nodesource ~/intel/oneapi/setvars.sh
- If you managed to successfully configure Lmod, then you can make use of Intel oneAPI modulefile configuration script:
# Navigate to the location of your Intel oneAPI installation cd ~/intel/oneapi/ # Execute the modulefiles setup script ./modulefiles-setup.sh # Return to the top level of your $HOME directory and # configure Lmod to make use of the newly created modules # Alternatively, this line can be appended to /etc/profile or your .bashrc ml use $HOME/modulefiles # Make sure the newly created modules are available to use and have been correclty configured ml avail
- To have your environment automaticaly prepared for use with Intel's oneAPI Toolkit append the following line either to your
Important
You will need to configure your environment each time you login to a new shell, as is the case when you use mpirun
over multiple nodes. You will be shown how to do this automatically when you run HPL over multiple nodes.
You have successfully installed the Intel oneAPI Base and HPC Toolkits, including Intel Compiler Suite and Math Kernel Libraries.
After you've successfully completed the previous section, you will be ready to recompile HPL with Intel's icx
compiler and mkl
math kernel libraries.
-
Copy and Edit the
Make.Linux_Intel64
From your
~/hpl
folder, with a properly configured environment, copy and edit the configuration# Copy a setup configuration script to use as a template cp setup/Make.Linux_Intel64 ./ # Edit the configuration file to make use of your Intel oneAPI Toolkit nano Make.Linux_Intel64
-
Configure your
Make.Linux_Intel64
Ensure that you make the following changes and amendments:
CC = mpiicx OMP_DEFS = -qopenmp CCFLAGS = $(HPL_DEFS) -O3 -w -ansi-alias -z noexecstack -z relro -z now -Wall
-
Compile your HPL Binary using the Intel oneAPI Toolkit
make arch=Linux_Intel64
-
Reuse your
HPL.dat
from when you compiled OpenMPI and OpenBLAS from source.
Tip
Remember to use tmux to open a new tmux window, C-b c
. You can cycle between the tmux windows using C-b n
.
It is useful to know what the theoretical FLOPS performance (RPeak) of your hardware is when trying to obtain the highest benchmark result (RMax). RPeak can be derived from the formula:
RPeak = CPU Frequency [GHz] x Num CPU Cores x OPS/cycle
Newer CPU architectures allow for 'wider' instruction sets which execute multiple instructions per CPU cycle. The table below shows the floating point operations per cycle of various instruction sets:
CPU Extension | Floating Point Operations per CPU Cycle |
---|---|
SSE4.2 | 4 |
AVX | 8 |
AVX2 | 16 |
AVX512 | 32 |
You can determine your CPU model as well as the instruction extensions supported on your compute node(s) with the command:
lscpu
For model name, you should see something along the lines "Intel Xeon Processor (Cascadelake)",
You can determine the maximum and base frequency of your CPU model on the Intel Ark website. Because HPL is a demanding workload, assume the CPU is operating at its base frequency and NOT the boost/turbo frequency. You should have everything you need to calculate the RPeak of your cluster. Typically an efficiency of at least 75% is considered adequate for Intel CPUs (RMax / RPeak > 0.75).
The TOP500 list is a project that ranks and details the 500 most powerful supercomputers in the world. The ranking is based on the High-Performance Linpack (HPL) benchmark, which measures a system's floating point computing power.
-
Go the the Top500 List and compare your results
Populate the following table by recording your Rmax from HPL results, and calculating your expected Rpeak value.
Rank | System | Threads | Rmax (GFlops/s) | Rpeak (GFlops/s) |
---|---|---|---|---|
1 | Frontier - HPE - United States | 8 699 904 | 1206 x 106 | 1714.81 x 106 |
2 | ||||
3 | ||||
Head node | 2 | |||
Compute node using head node xhpl binary |
||||
Compute node using custom compiled MPI and BLAS | ||||
Compute node using Intel oneAPI Toolkits | ||||
Across two compute nodes |
Important
You do NOT need to try and Rank you VM's HPL performance. Cores and threads are used interchangeably in this context. Following the recommended configuration and guides, your head node has one CPU package with two compute cores (or threads). Continuing this same analogy, your compute node has one CPU with six cores (or threads).
At this point you are ready to run HPL on your cluster with two compute nodes. From your OpenStack workspace, navigate to Compute
→ Instances
and create a snapshot from your compute node.
Launch a new instance, as you did in Tutorial 1 and Tutorial 2 only this time you'll be using the snapshot that you have just created as boot source.
Pay careful attention to the hostname, network and other configuration settings that may be specific to and may conflict with your initial node. Once your two compute nodes have been successfully deployed, are accessible from the head node and added to your MPI hosts
file, you can continue with running HPL across multiple nodes.
Everything is now in place for you to run HPL across your two compute nodes. You must ensure that all libraries and dependencies are satisfied across your cluster. You must also ensure that your passwordless SSH is properly configured. Your NFS mounted /home
directory must be properly configured.
-
Configuring OpenMPI Hosts File
You must configure a
hosts
(ormachinefile
) file which contains the IP addresses or hostnames of your compute nodes.# The slots value indicates the number of processes to run on each node. # Adjust this number based on the number of CPU cores available on each node. compute01 slots=1 compute02 slots=1
-
Runtime and Environment Configuration Options for
mpirun
You compute nodes each have a single CPU with multiple OpenMP threads. It is critical that your
environment
is correctly configured to you to run HPL across your two compute nodes.- Navigate to the directory where your HPL executable and
HPL.dat
file are located. Usempirun
to run HPL across the nodes specified in the hosts file - Edit your
~/.profile
to set environment variables whenmpirun
creates a new shell - Execute
mpirun
mpirun -np 2 --hostfile hosts ./xhpl
- Navigate to the directory where your HPL executable and
HPC Challenge (or HPCC) is benchmark suite which contains 7 micro-benchmarks used to test various performance aspects of your cluster. HPCC includes HPL which it uses to access FLOPs performance. Having successfully compiled and executed HPL, the process is fairly straight forward to setup HPCC (it uses the same Makefile structure).
-
Download HPCC from https://icl.utk.edu/hpcc/software/index.html
-
Extract the file, then enter the
hpcc/
sub-directory. -
Copy and modify the
Makefile.<arch>
as your did for the HPL benchmark -
Compile HPCC from the base directory using
make arch=<arch>
-
Edit the
hpccinf.txt
fileHPCC replies on the input parameter file
hpccinf.txt
(same asHPL.dat
). Run HPCC as you did HPL. -
Prepare and format your output
Run the format.pl script with to format your benchmark results into a readable format. Compare your HPL score with your standalone HPL.
# You may need to install perl ./format.pl -w -f hpccoutf.txt
Have the output hpccoutf.txt
and your Make.<architecture>
ready for the instructors to view on request.
HPC applications are widely used in scientific research and systems evaluation or benchmarking to address complex computational problems. These applications span various fields, including computational chemistry, computational fluid dynamics, cosmology / astrophysics, quantum mechanics, weather forecasting, genomics, to name a few...
These applications are integral to advancing scientific research, enabling researchers to solve complex problems that are otherwise computationally prohibitive. They are also essential for evaluating and benchmarking the performance of high-performance computing systems, ensuring that they meet the demands of cutting-edge research and industrial applications.
You will now build, compile, install and run a few such examples.
GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, such as polymers.
Detailed installation instructions can be found at: http://manual.gromacs.org/current/install-guide/index.html, but here's a general installation overview:
-
Ensure you have an up-to-date
cmake
available on your system. -
You will also require a compiler such as the GNU
gcc
, Intelicc
or other, and MPI (OpenMPI, MPICH, Intel MPI or other) be installed on system. Your PATH & LD_LIBRARY_PATH environment variables should be set up to reflect this. -
Compile GROMACS with MPI support from source using
cmake
.
The benchmark (adh_cubic) should complete within a few minutes and has a small memory footprint, it is intended to demonstrate that your installation is working properly. The metric which will be used to assess your performance is the ns/day (number of nanoseconds the model is simulated for per day of computation), quoted at the end of the simulation output. Higher is better.
Ensure that your GROMACS /bin directory is exported to your PATH. You should be able to type gmx_mpi --version
in your terminal and have the application information displayed correctly. The first task is to pre-process the input data into a usable format, using the grompp
tool:
gmx_mpi grompp -f pme_verlet.mdp -c conf.gro -p topol.top -o md_0_1.tpr
#export PATH and LD_LIBRARY_PATH
mpirun gmx_mpi mdrun -nsteps 5000 -s md_0_1.tpr -g gromacs.log
Then execute the script from you head node, which will in turn launch the simulation using MPI and write output to the log file gromacs_log
.
You may modify the mpirun
command to optimise performance (significantly) but in order to produce a valid result, the simulation must run for 5,000 steps. Quoted in the output as:
"5000 steps, 10.0 ps."
Note
Please be able to present the instructors with the output of gmx_mpi --version
. Also be able to present the instructors with your Slurm batch script and gromacs_log
files for the adh_cubic benchmark.
LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) is a classical molecular dynamics simulation code designed for simulating particles in a variety of fields including materials science, chemistry, physics, and biology. It was originally developed at Sandia National Laboratories and is now maintained by a community of developers. LAMMPS runs on single processors or in parallel using message-passing techniques and a spatial-decomposition of the simulation domain.
The purpose of this benchmark is to demonstrate to you that there are often multiple way to build and compile many applications.
-
Configure prerequisites and install dependencies
- DNF / YUM
# RHEL, Rocky, Alma, CentOS Stream sudo dnf groupinstall 'Development Tools' -y sudo dnf install cmake git -y sudo dnf install fftw-devel libjpeg-devel libpng-devel libtiff-devel libX11-devel libXext-devel libXrender-devel -y
- APT
# Ubuntu sudo apt install build-essential cmake git -y sudo apt install libfftw3-dev libjpeg-dev libpng-dev libtiff-dev libx11-dev libxext-dev libxrender-dev -y
- Pacman
# Arch sudo pacman -S base-devel cmake git -y sudo pacman -S fftw libjpeg-turbo libpng libtiff libx11 libxext libxrender -y
-
Clone, build and compile LAMMPS with
make
Building LAMMPS with traditional Makefiles requires that you have a
Makefile.<machine>
file appropriate for your system in either thesrc
folder.# Ensure that the correct paths are exported git clone -b stable https://github.com/lammps/lammps.git cd lammps/src # List the different make options make # Build a serial LAMMPS executable using GNU g++ # Remember to monitor top / htop / btop in another tmux pane make serial # Build a parallel LAMMPS executable with MPI # If you were frustrated at how long the previous make build took, # try to build and compile using the -j<num_threads> switch make mpi
-
Copy the executable binaries to the benchmarks folder
cp lmp_serial ../bench cp lmp_mpi ../bench
-
Execute the Lennard Jones benchmarks
cd ../bench # Verify that only one thread is utilized ./lmp_serial -in in.lj # Verify the number of OpenMP threads utilized export OMP_NUM_THREADS=<num_threads> mpirun -np <num_procs> lmp_mpi -in in.lj
-
Rerun your binaries against the Rhodopsin Structure benchmark
The Lennard Jones benchmark might be too short for proper evaluation. These small bencharks are often used for an installation validation test.
# Save the output for submission ./lmp_serial < in.rhodo > lmp_serial_rhodo.out mpirun -np <num_procs> lmp_mpi -in in.rhodo > lmp_mpi_rhodo.out
Important
The following section is included here for illustrative purposes. If you feel that you are falling behind in the competition, you may read through this section without completing it. Limited instructions will be provided, and you will be required to take decisions in terms of choice of compiler, MPI implementation, FFTW library. This will be good practice for what benchmarks might look like in the Nationals Round of the Student Cluster Competition.
- Build LAMMPS with GCC, OpenMP and OpenMPI using CMake
- In addition to a choice of
gcc
,MPI
implementation and anFFTW
library, you'll need to also installcmake
. - If you're using the same checkout as before, you need to purge your
src
directory
cd lammps/src
# Remove conflicting files from the previous build, uninstall all packages
# make no-all purge
- Configure the build with CMake, then compile and install
cmake ../cmake -D BUILD_MPI=on -D BUILD_OMP=on -D CMAKE_C_COMPILER=gcc -D CMAKE_CXX_COMPILER=g++ -D MPI_C_COMPILER=mpicc -D MPI_CXX_COMPILER=mpicxx
make -j$(nproc)
make DESTDIR=/<path-to-install-dir> install
- Rerun the benchmarks
export OMP_NUM_THREADS=<num_threads>
mpirun -np <num_procs> ./lmp -in <input_file>
IBM's Qiskit is an open-source Software Development Kit (SDK) for working with quantum computers at the level of circuits, pulses, and algorithms. It provides tools for creating and manipulating quantum programs and running them on prototype quantum devices on IBM Quantum Platform or on simulators on a local computer.
Qiskit-Aer is an extension to the Qiskit SDK for using high performance computing resources to simulate quantum computers and programs. It provides interfaces to run quantum circuits with or without noise using a number of various simulation methods. Qiskit-Aer supports leveraging MPI to improve the performance of simulation.
Quantum Volume (QV) is a single-number metric that can be measured using a concrete protocol on near-term quantum computers of modest size. The QV method quantifies the largest random circuit of equal width and depth that the computer successfully implements. Quantum computing systems with high-fidelity operations, high connectivity, large calibrated gate sets, and circuit rewriting tool chains are expected to have higher quantum volumes. Simply put, Quantum Volume is a single number meant to encapsulate the performance of today’s quantum computers, like a classical computer’s transistor count.
For this benchmark, we will be providing you with the details of the script that you will need to write yourself, or download from the competition GitHub repository in order to successfully conduct the (Quantum Volume Experiment)(https://qiskit.org/ecosystem/experiments/dev/manuals/verification/quantum_volume.html).
-
Configure and install dependencies You will be using Python Pip - PyPI to configure and install Qiskit.
pip
is the official tool for installing and using Python packages from various indexes.- DNF / YUM
# RHEL, Rocky, Alma, CentOS Stream sudo dnf install python python-pip
- APT
# Ubuntu sudo apt install python python-pip
- Pacman
# Arch sudo pacman -S python python-pip
- DNF / YUM
-
Create and Activate a New Virtual Environment
Separate your python projects and ensure that they exist in their own, clean environments:
python -m venv QiskitAer source QiskitAer/bin/activate
-
Install
qiskit-aer
pip install qiskit-aer
-
Save the following in a Python script
qv_experiment.py
:from qiskit import * from qiskit.circuit.library import * from qiskit_aer import * import time import numpy as np def quant_vol(qubits=15, depth=10): sim = AerSimulator(method='statevector', device='CPU') circuit = QuantumVolume(qubits, depth, seed=0) circuit.measure_all() circuit = transpile(circuit, sim) start = time.time() result = sim.run(circuit, shots=1, seed_simulator=12345).result() time_val = time.time() - start # Optionally return and print result for debugging # Bonus marks available for reading the simulation time directly from `result` return time_val
-
Parameterize the following variables for the QV experiment
These are used to generate the QV circuits and run them on a backend and on an ideal simulator:
qubits
: number or list of physical qubits to be simulated for the experiment,depth
: meaning the number of discrete time steps during which the circuit can run gates before the qubits decohere.shots
: used for sampling statistics, number of repetitions of each circuit.
-
Run the benchmark by executing the script you've just written:
$ python qv_experiment.py
-
Deactivate the Python virtualenv
deactivate