Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA 12.1.1, CUDA samples, and CUDA related hooks and lmodrc changes #434

Merged
Merged
Changes from 6 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
fcc7ddb
Also recreated lmodrc when it was changed in a PR
Dec 19, 2023
2b09d1c
Modified lmodrc to add CUDA support. It now checks if you load the CU…
Dec 19, 2023
62e70ba
Adapt created_lmodrc.py for the new domain
Dec 19, 2023
045c099
Add post_sanitycheck hook for CUDA in order to only ship the files we…
Dec 19, 2023
4a4c6e7
Add (the redistributable part of) CUDA to the softare stack
Dec 19, 2023
0346b22
Add CUDA-Samples to the build list
Dec 19, 2023
5ec4c3b
Merge remote-tracking branch 'upstream/2023.06-software.eessi.io' int…
ocaisa Dec 20, 2023
5905e72
Tweak GPU support implementation
ocaisa Dec 20, 2023
73618a0
Add missing quotes on errors
ocaisa Dec 20, 2023
46727cb
Merge branch '2023.06-software.eessi.io' into cuda_cuda_samples_eessi_io
Dec 20, 2023
039921b
Merge branch 'cuda_cuda_samples_eessi_io' into cuda_cuda_samples_eess…
casparvl Dec 20, 2023
a4e8de7
Merge pull request #1 from ocaisa/cuda_cuda_samples_eessi_io
casparvl Dec 20, 2023
32925fe
Error messages now refer to the scripts that need to be run to instal…
Dec 20, 2023
94a2bfe
Merge branch 'cuda_cuda_samples_eessi_io' of github.com:casparvl/soft…
Dec 20, 2023
a33a0cd
make install_scripts a bit more verbose
boegel Dec 20, 2023
c7b380d
use separate easystack file for CUDA + control order in which easysta…
boegel Dec 20, 2023
f506566
copy EasyBuild log file in case CUDA installation failed in install_c…
boegel Dec 20, 2023
e3ddacc
add additional optional options required for handling NVIDIA support …
boegel Dec 20, 2023
16ddf7f
fix typo when passing --host-injections to container script
boegel Dec 20, 2023
35d6084
correctly pass --nv to singularity command
boegel Dec 20, 2023
fd97667
use quotes when adding --nv
boegel Dec 20, 2023
1917146
comment out running of link_nvidia_host_libraries.sh script, since it…
boegel Dec 20, 2023
f80f0fc
clean up post_sanitycheck_cuda hook and inject_gpu_property function …
boegel Dec 20, 2023
2d37842
remove empty line in eessi-2023.06-eb-4.8.2-2023a.yml
boegel Dec 20, 2023
f007c40
use easyconfigs PR 19451 for installing CUDA-Samples v12.1
boegel Dec 20, 2023
70fa0f9
Ship the scripts, and keep them in a single location
ocaisa Dec 20, 2023
db0c141
Update create_lmodrc.py
ocaisa Dec 21, 2023
293b107
Update create_tarball.sh
ocaisa Dec 21, 2023
73476b2
Only copy scripts if the contents differ
ocaisa Dec 21, 2023
a333a74
Remove temporary test directory
ocaisa Dec 21, 2023
43c73c0
Get rid of copy/paste unfriendly '.'
ocaisa Dec 21, 2023
3ec3df8
Update create_tarball.sh
ocaisa Dec 21, 2023
42e3404
always append to list of files to include in tarball, to avoid overwr…
boegel Dec 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 4 additions & 7 deletions EESSI-install-software.sh
Original file line number Diff line number Diff line change
@@ -189,19 +189,16 @@ pr_diff=$(ls [0-9]*.diff | head -1)

# install any additional required scripts
# order is important: these are needed to install a full CUDA SDK in host_injections
install_scripts_changed=$(cat ${pr_diff} | grep '^+++' | cut -f2 -d' ' | sed 's@^[a-z]/@@g' | grep '^install_scripts.sh$' > /dev/null; echo $?)
if [ ${install_scripts_changed} == '0' ]; then
# for now, this just reinstalls all scripts. Note the most elegant, but works
${TOPDIR}/install_scripts.sh --prefix ${EESSI_CVMFS_REPO}
fi
# for now, this just reinstalls all scripts. Note the most elegant, but works
${TOPDIR}/install_scripts.sh --prefix ${EESSI_PREFIX}
boegel marked this conversation as resolved.
Show resolved Hide resolved

# Install full CUDA SDK in host_injections
# Hardcode this for now, see if it works
# TODO: We should make a nice yaml and loop over all CUDA versions in that yaml to figure out what to install
${EESSI_CVMFS_REPO}/gpu_support/nvidia/install_cuda_host_injections.sh 12.1.1
${EESSI_PREFIX}/gpu_support/nvidia/install_cuda_host_injections.sh -c 12.1.1 --accept-cuda-eula

# Install drivers in host_injections
${EESSI_CVMFS_REPO}/gpu_support/nvidia/link_nvidia_host_libraries.sh
${EESSI_PREFIX}/gpu_support/nvidia/link_nvidia_host_libraries.sh

# use PR patch file to determine in which easystack files stuff was added
for easystack_file in $(cat ${pr_diff} | grep '^+++' | cut -f2 -d' ' | sed 's@^[a-z]/@@g' | grep '^easystacks/.*yml$' | egrep -v 'known-issues|missing'); do
Original file line number Diff line number Diff line change
@@ -35,7 +35,8 @@ easyconfigs:
- Boost-1.82.0-GCC-12.3.0.eb
- netCDF-4.9.2-gompi-2023a.eb
- FFmpeg-6.0-GCCcore-12.3.0.eb
- CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb

boegel marked this conversation as resolved.
Show resolved Hide resolved
- CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb:
# use easyconfig that only install subset of CUDA samples,
# to circumvent problem with nvcc linking to glibc of host OS;
# see https://github.com/easybuilders/easybuild-easyconfigs/pull/19189
6 changes: 3 additions & 3 deletions gpu_support/nvidia/link_nvidia_host_libraries.sh
Original file line number Diff line number Diff line change
@@ -36,7 +36,7 @@ if [ ${#found_paths[@]} -gt 0 ]; then
host_ldconfig=${found_paths[0]}
else
error="$command_name not found in PATH or only found in paths starting with $exclude_prefix."
fatal_error $error
fatal_error "$error"
fi

# Make sure EESSI is initialised (doesn't matter what version)
@@ -52,7 +52,7 @@ if $nvidia_smi_command > /dev/null; then
host_cuda_version=$(nvidia-smi -q --display=COMPUTE | grep CUDA | awk 'NF>1{print $NF}')
else
error="Failed to successfully execute\n $nvidia_smi_command\n"
fatal_error $error
fatal_error "$error"
fi

# Let's make sure the driver libraries are not already in place
@@ -71,7 +71,7 @@ if [ -e "$host_injection_driver_version_file" ]; then
rm $host_injection_driver_dir/*
if [ $? -ne 0 ]; then
error="Unable to remove files under '$host_injection_driver_dir'."
fatal_error $error
fatal_error "$error"
fi
fi
fi
16 changes: 8 additions & 8 deletions install_scripts.sh
Original file line number Diff line number Diff line change
@@ -13,7 +13,7 @@ POSITIONAL_ARGS=()

while [[ $# -gt 0 ]]; do
case $1 in
-o|--prefix)
-p|--prefix)
INSTALL_PREFIX="$2"
shift 2
;;
@@ -38,25 +38,25 @@ set -- "${POSITIONAL_ARGS[@]}"
TOPDIR=$(dirname $(realpath $0))

# Subdirs for generic scripts
SCRIPTS_DIR_SOURCE=${TOPDIR}/scripts/ # Source dir
SCRIPTS_DIR_TARGET=${INSTALL_PREFIX}/scripts/ # Target dir
SCRIPTS_DIR_SOURCE=${TOPDIR}/scripts # Source dir
SCRIPTS_DIR_TARGET=${INSTALL_PREFIX}/scripts # Target dir

# Create target dir
mkdir -p ${SCRIPTS_DIR_TARGET}

# Copy scripts into this prefix
for file in utils.sh; do
cp ${SCRIPTS_DIR_SOURCE}/${file} ${SCRIPTS_DIR_TARGET}/${file}
cp -u ${SCRIPTS_DIR_SOURCE}/${file} ${SCRIPTS_DIR_TARGET}/${file}
done
# Subdirs for GPU support
NVIDIA_GPU_SUPPORT_DIR_SOURCE=${TOPDIR}/gpu_support/nvidia/ # Source dir
NVIDIA_GPU_SUPPORT_DIR_TARGET=${INSTALL_PREFIX}/gpu_support/nvidia/ # Target dir
NVIDIA_GPU_SUPPORT_DIR_SOURCE=${TOPDIR}/gpu_support/nvidia # Source dir
NVIDIA_GPU_SUPPORT_DIR_TARGET=${INSTALL_PREFIX}/gpu_support/nvidia # Target dir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we are shipping scripts then the GPU stuff should appear as a subdir in there, I'll try to make a PR tonight

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't let that block this PR, since we have things in a working state to get CUDA-Samples installed, so should be in a follow-up PR.

I'm actually very much in favor of not shipping the scripts in the EESSI repository just yet, and to provide instructions in the documentation to clone the software-layer repo instead, for now...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean /cvmfs/reponame/scripts/gpu_support? I thought about that as well. I think it is cleaner, but note that the scripts often assume other files to be present in certain relative paths. So you'll have to make sure to adjust these.

I could try to have a look in a minute...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boegel this PR is already shipping the scripts...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, I think it's more convenient for the end user to have this self-documenting via Lmod

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what happens if we change how we do things for a new EESSI version, the Lmod hook would then point to bad jnformation

Docs can be updated to have version-specific info, so that's not exactly true.
Lmod hook could be updated to point to https://eessi.io/docs/gpu/2027.09, and https://eessi.io/docs/gpu could be set up to auto-redirect to https://eessi.io/docs/gpu/2023.06.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't think thats a lot of extra effort for little gain?

Personally, I think there are plenty of people who do not care how it works, just that it works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can always add the docs pointers in the scripts if they're actually needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some people, just getting it to work will be enough.

Not for system administrators though, they won't go and run a random script just because they're told to.
They may even want to do what needs to be done some other way (through Puppet or whatever).

Copy link
Member

@ocaisa ocaisa Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but then you are sculpting things for minority use case (sysadmins).

Anyway, we can do both, the Lmod hook can mention the script and the docs, let the users decide


# Create target dir
mkdir -p ${NVIDIA_GPU_SUPPORT_DIR_TARGET}

# Copy files from this directory into the prefix
# To be on the safe side, we dont do recursive copies, but we are explicitely copying each individual file we want to add
for file in install_cuda_host_injections.sh link_nvidia_host_injections.sh; do
cp ${NVIDIA_GPU_SUPPORT_DIR_SOURCE}/${file} ${NVIDIA_GPU_SUPPORT_DIR_TARGET}/${file}
for file in install_cuda_host_injections.sh link_nvidia_host_libraries.sh; do
cp -u ${NVIDIA_GPU_SUPPORT_DIR_SOURCE}/${file} ${NVIDIA_GPU_SUPPORT_DIR_TARGET}/${file}
done
Loading