Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tweak libpaths in TensorFlow easyblock by adding directory containing libnccl.so.2 #3497

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from
Open
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions easybuild/easyblocks/t/tensorflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -701,13 +701,24 @@ def configure_step(self):
})
else:
raise EasyBuildError("TensorFlow has a strict dependency on cuDNN if CUDA is enabled")

if nccl_root:
nccl_version = get_software_version('NCCL')
# Ignore the PKG_REVISION identifier if it exists (i.e., report 2.4.6 for 2.4.6-1 or 2.4.6-2)
nccl_version = nccl_version.split('-')[0]
config_env_vars.update({
'NCCL_INSTALL_PATH': nccl_root,
})

# add absolute path to libnccl.so.2 directory provided by NCCL
# when LD_LIBRARY_PATH is filtered and LIBRARY_PATH is not
# filtered, e.g., in an environment such as EESSI
filtered_env_vars = build_option('filter_env_vars') or []
if 'LD_LIBRARY_PATH' in filtered_env_vars and 'LIBRARY_PATH' not in filtered_env_vars:
system_libs_info_as_list = list(self.system_libs_info)
system_libs_info_as_list[2].append(os.path.join(nccl_root, get_software_libdir('NCCL')))
self.system_libs_info = tuple(system_libs_info_as_list)
Copy link
Contributor

@Flamefire Flamefire Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part looks very fishy and I had to read the full code to verify this is correct. We should make get_systems_libs and hence self.system_libs_info a named tuple instead to make it easier to understand.

However from a semantic POV this is the wrong place to add NCCL: "System libs" in the context of tensorflow are dependencies that can be vendored in a way TF understands. I.e. https://github.com/tensorflow/tensorflow/blob/master/third_party/systemlibs/syslibs_configure.bzl#L11

I'd rather put this into the build_step where action_env['LIBRARY_PATH'] is set. The easiest way would be to (conditionally) append to libpaths right after cpaths, libpaths = self.system_libs_info[1:]

This is also easier to understand due to the comment there: "Make TF find our modules. LD_LIBRARY_PATH gets automatically added by configure.py"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds right to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an attempt in f697d97

Not tested yet. Will let you know if it works or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in f697d97 worked.


else:
nccl_version = '1.3' # Use simple downloadable version
config_env_vars.update({
Expand Down
Loading