Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{2023.06}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1 #825

Open
wants to merge 12 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
easyconfigs:
- CUDA-12.1.1.eb
- cuDNN-8.9.2.26-CUDA-12.1.1.eb
- PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb
27 changes: 27 additions & 0 deletions eb_hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -604,6 +604,32 @@ def pre_configure_hook_LAMMPS_zen4(self, *args, **kwargs):
raise EasyBuildError("LAMMPS-specific hook triggered for non-LAMMPS easyconfig?!")



def pre_configure_hook_pytorch_add_cupti_libdir(self, *args, **kwargs):
"""
Pre-configure hook for PyTorch: add directory $EESSI_SOFTWARE_PATH/software/CUDA/12.1.1/extras/CUPTI/lib64 to LIBRARY_PATH
"""
if self.name == 'PyTorch' and self.version == '2.1.2':
if 'cudaver' in self.cfg.template_values and self.cfg.template_values['cudaver'] == '12.1.1':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to make this specific to a particular PyTorch version or CUDA version?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only know that the failure happens in this specific case. If we apply it to other cases, we will not know whether it was necessary or not.

_cudaver = self.cfg.template_values['cudaver']
print_msg("pre_configure_hook_pytorch_add_cupti_libdir: CUDA version: '%s'" % _cudaver)
_library_path = os.getenv('LIBRARY_PATH')
print_msg("pre_configure_hook_pytorch_add_cupti_libdir: library_path: '%s'", _library_path)
_eessi_software_path = os.getenv('EESSI_SOFTWARE_PATH')
print_msg("pre_configure_hook_pytorch_add_cupti_libdir: eessi_software_path: '%s'", _eessi_software_path)
_cupti_lib_dir = os.path.join(_eessi_software_path, 'software', 'CUDA', _cudaver, 'extras', 'CUPTI', 'lib64')
print_msg("pre_configure_hook_pytorch_add_cupti_libdir: cupti_lib_dir: '%s'", _cupti_lib_dir)
if _library_path:
env.setvar('LIBRARY_PATH', ':'.join([_library_path, _cupti_lib_dir]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bug in our CUDA installation/module, no?

I'm fine with proceeding like this for now, even if we also fix it somewhere else this won't cause trouble, but there's probably a more general fix for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we might find a better solution by changing the CUDA module, eg, by adding the directory to LIBRARY_PATH through the module.

It could be a worthwhile effort to try.

Copy link
Collaborator Author

@trz42 trz42 Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boegel could it be that lib_path (at least parts of it) is missing in https://github.com/easybuilders/easybuild-easyblocks/blob/57c0eaed8dc29e223fe68a75f7bf195cca0c2d04/easybuild/easyblocks/c/cuda.py#L362

A little before that line lib_path is constructed as list ['lib64', 'extras/CUPTI/lib64', 'nvvm/lib64'], but in line 362 only ['lib64', 'stubs/lib64'] is used.

else:
env.setvar('LIBRARY_PATH', _cupti_lib_dir)
print_msg("pre_configure_hook_pytorch_add_cupti_libdir: LIBRARY_PATH: '%s'", os.getenv('LIBRARY_PATH'))
else:
print_msg("PyTorch/2.1.2-specific pre_configure hook triggered for non-CUDA or non-CUDA/12.1.1 easyconfig triggered; NOT adding CUPTI lib64 dir to LIBRARY_PATH")
else:
raise EasyBuildError("PyTorch/2.1.2-specific pre_configure hook triggered for non-PyTorch/2.1.2 easyconfig?!")


def pre_test_hook(self, *args, **kwargs):
"""Main pre-test hook: trigger custom functions based on software name."""
if self.name in PRE_TEST_HOOKS:
Expand Down Expand Up @@ -995,6 +1021,7 @@ def inject_gpu_property(ec):
'OpenBLAS': pre_configure_hook_openblas_optarch_generic,
'WRF': pre_configure_hook_wrf_aarch64,
'LAMMPS': pre_configure_hook_LAMMPS_zen4,
'PyTorch': pre_configure_hook_pytorch_add_cupti_libdir,
'Score-P': pre_configure_hook_score_p,
}

Expand Down
Loading