-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{2023.06}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1 #825
base: 2023.06-software.eessi.io
Are you sure you want to change the base?
{2023.06}[foss/2023a] PyTorch v2.1.2 w/ CUDA 12.1.1 #825
Conversation
Instance
|
Instance
|
1 similar comment
Instance
|
Instance
|
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
Updates by the bot instance
|
1 similar comment
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Build again after applying fix to find bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Also build for bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
eb_hooks.py
Outdated
if self.name == 'PyTorch' and self.version == '2.1.2': | ||
if 'cudaver' in self.cfg.template_values and self.cfg.template_values['cudaver'] == '12.1.1': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason to make this specific to a particular PyTorch
version or CUDA
version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only know that the failure happens in this specific case. If we apply it to other cases, we will not know whether it was necessary or not.
eb_hooks.py
Outdated
_cupti_lib_dir = os.path.join(_eessi_software_path, 'software', 'CUDA', _cudaver, 'extras', 'CUPTI', 'lib64') | ||
print_msg("pre_configure_hook_pytorch_add_cupti_libdir: cupti_lib_dir: '%s'", _cupti_lib_dir) | ||
if _library_path: | ||
env.setvar('LIBRARY_PATH', ':'.join([_library_path, _cupti_lib_dir])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a bug in our CUDA installation/module, no?
I'm fine with proceeding like this for now, even if we also fix it somewhere else this won't cause trouble, but there's probably a more general fix for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we might find a better solution by changing the CUDA module, eg, by adding the directory to LIBRARY_PATH through the module.
It could be a worthwhile effort to try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boegel could it be that lib_path
(at least parts of it) is missing in https://github.com/easybuilders/easybuild-easyblocks/blob/57c0eaed8dc29e223fe68a75f7bf195cca0c2d04/easybuild/easyblocks/c/cuda.py#L362
A little before that line lib_path
is constructed as list ['lib64', 'extras/CUPTI/lib64', 'nvvm/lib64']
, but in line 362 only ['lib64', 'stubs/lib64']
is used.
Try a different approach where we rebuild the CUDA module such that it prepends the directory containing the libcupti library to LIBRARY_PATH and then not using the hook used in the previous builds... bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
New attempt to add bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Add missing double quotes... bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
1 similar comment
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Repeat build with updated easyblock (see easybuilders/easybuild-easyblocks#3516) instead of bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Repeat build using the latest commit in easybuilders/easybuild-easyblocks#3516... bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
1 similar comment
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
bot: help |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Builds
Superseedes #718