Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai,lib}[GCCcore/12.2.0,foss/2022b] PyTorch v2.1.2, NCCL v2.18.3 w/ CUDA 12.0.0 #20520

Draft
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented May 13, 2024

(created using eb --new-pr)
This is meant as an alternative to #20155 using a newer NCCL version as the older one currently included in foss/2022b doesn't seem to work with PyTorch 2.1.2

Update: Seems #20155 works now. So putting this one on hold

@SebastianAchilles SebastianAchilles added this to the 4.x milestone May 14, 2024
@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8
See https://gist.github.com/SebastianAchilles/7ddc2f02e198c9e93730651648ea6a65 for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 550.54.15, Python 3.9.18
See https://gist.github.com/SebastianAchilles/caa73902c24edfc4a9f09a1104e38750 for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8
See https://gist.github.com/SebastianAchilles/c2693ff5dacd31a35769e1bca1515fc6 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @SebastianAchilles FAILED Build succeeded for 1 out of 2 (2 easyconfigs in total) skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8 See https://gist.github.com/SebastianAchilles/7ddc2f02e198c9e93730651648ea6a65 for a full test report.

That first one failed with

distributed/_tensor/test_dtensor_ops 1/1 failed! Received signal: SIGSEGV

I see that every now and then in various different tests especially test_jit*. Seems to happen randomly, not sure why.

I'll do a larger repeated run for both PRs over the weekend so I'll have the results to compare on Tuesday (Monday is a public holiday here)

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8002 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/bb9a1cd446347299a3a18282e8c52f29 for a full test report.

Copy link

github-actions bot commented Nov 22, 2024

Updated software NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb

Diff against NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb

easybuild/easyconfigs/n/NCCL/NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb

diff --git a/easybuild/easyconfigs/n/NCCL/NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
index 0534e538fa..a25e786210 100644
--- a/easybuild/easyconfigs/n/NCCL/NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb
+++ b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
@@ -1,23 +1,28 @@
 name = 'NCCL'
-version = '2.22.3'
+version = '2.18.3'
 versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://developer.nvidia.com/nccl'
 description = """The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective
 communication primitives that are performance optimized for NVIDIA GPUs."""
 
-toolchain = {'name': 'GCCcore', 'version': '13.3.0'}
+toolchain = {'name': 'GCCcore', 'version': '12.2.0'}
 
 github_account = 'NVIDIA'
 source_urls = [GITHUB_SOURCE]
 sources = ['v%(version)s-1.tar.gz']
-checksums = ['45151629a9494460e73375281e8b0fe379141528879301899ece9b776faca024']
+patches = ['NCCL-2.16.2_fix-cpuid.patch']
+checksums = [
+    ('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9',
+     'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96'),
+    {'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'},
+]
 
-builddependencies = [('binutils', '2.42')]
+builddependencies = [('binutils', '2.39')]
 
 dependencies = [
-    ('CUDA', '12.6.0', '', SYSTEM),
-    ('UCX-CUDA', '1.16.0', versionsuffix),
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('UCX-CUDA', '1.13.1', versionsuffix),
 ]
 
 # default CUDA compute capabilities to use (override via --cuda-compute-capabilities)
Diff against NCCL-2.20.5-GCCcore-13.2.0-CUDA-12.4.0.eb

easybuild/easyconfigs/n/NCCL/NCCL-2.20.5-GCCcore-13.2.0-CUDA-12.4.0.eb

diff --git a/easybuild/easyconfigs/n/NCCL/NCCL-2.20.5-GCCcore-13.2.0-CUDA-12.4.0.eb b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
index 90634952ad..a25e786210 100644
--- a/easybuild/easyconfigs/n/NCCL/NCCL-2.20.5-GCCcore-13.2.0-CUDA-12.4.0.eb
+++ b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
@@ -1,23 +1,28 @@
 name = 'NCCL'
-version = '2.20.5'
+version = '2.18.3'
 versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://developer.nvidia.com/nccl'
 description = """The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective
 communication primitives that are performance optimized for NVIDIA GPUs."""
 
-toolchain = {'name': 'GCCcore', 'version': '13.2.0'}
+toolchain = {'name': 'GCCcore', 'version': '12.2.0'}
 
 github_account = 'NVIDIA'
 source_urls = [GITHUB_SOURCE]
 sources = ['v%(version)s-1.tar.gz']
-checksums = ['d11ad65c1df3cbe4447eaddceec71569f5c0497e27b3b8369cf79f18d2b2ad8c']
+patches = ['NCCL-2.16.2_fix-cpuid.patch']
+checksums = [
+    ('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9',
+     'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96'),
+    {'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'},
+]
 
-builddependencies = [('binutils', '2.40')]
+builddependencies = [('binutils', '2.39')]
 
 dependencies = [
-    ('CUDA', '12.4.0', '', SYSTEM),
-    ('UCX-CUDA', '1.15.0', versionsuffix),
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('UCX-CUDA', '1.13.1', versionsuffix),
 ]
 
 # default CUDA compute capabilities to use (override via --cuda-compute-capabilities)
Diff against NCCL-2.16.2-GCCcore-12.2.0-CUDA-11.7.0.eb

easybuild/easyconfigs/n/NCCL/NCCL-2.16.2-GCCcore-12.2.0-CUDA-11.7.0.eb

diff --git a/easybuild/easyconfigs/n/NCCL/NCCL-2.16.2-GCCcore-12.2.0-CUDA-11.7.0.eb b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
index ebbd822138..a25e786210 100644
--- a/easybuild/easyconfigs/n/NCCL/NCCL-2.16.2-GCCcore-12.2.0-CUDA-11.7.0.eb
+++ b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
@@ -1,5 +1,5 @@
 name = 'NCCL'
-version = '2.16.2'
+version = '2.18.3'
 versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://developer.nvidia.com/nccl'
@@ -13,21 +13,19 @@ source_urls = [GITHUB_SOURCE]
 sources = ['v%(version)s-1.tar.gz']
 patches = ['NCCL-2.16.2_fix-cpuid.patch']
 checksums = [
-    {'v2.16.2-1.tar.gz': '7f7c738511a8876403fc574d13d48e7c250d934d755598d82e14bab12236fc64'},
+    ('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9',
+     'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96'),
     {'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'},
 ]
 
 builddependencies = [('binutils', '2.39')]
 
 dependencies = [
-    ('CUDA', '11.7.0', '', SYSTEM),
+    ('CUDA', '12.0.0', '', SYSTEM),
     ('UCX-CUDA', '1.13.1', versionsuffix),
 ]
 
-prebuildopts = "sed -i 's/NVCUFLAGS  := /NVCUFLAGS  := -allow-unsupported-compiler /' makefiles/common.mk && "
-buildopts = "VERBOSE=1"
-
 # default CUDA compute capabilities to use (override via --cuda-compute-capabilities)
-cuda_compute_capabilities = ['3.5', '5.0', '6.0', '7.0', '7.5', '8.0', '8.6']
+cuda_compute_capabilities = ['5.0', '6.0', '7.0', '7.5', '8.0', '8.6', '9.0']
 
 moduleclass = 'lib'

Updated software PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb

Diff against PyTorch-2.1.2-foss-2023b.eb

easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023b.eb

diff --git a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023b.eb b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
index bce1b68aa7..9cbcda474f 100644
--- a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023b.eb
+++ b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
@@ -1,11 +1,12 @@
 name = 'PyTorch'
 version = '2.1.2'
+versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://pytorch.org/'
 description = """Tensors and Dynamic neural networks in Python with strong GPU acceleration.
 PyTorch is a deep learning framework that puts Python first."""
 
-toolchain = {'name': 'foss', 'version': '2023b'}
+toolchain = {'name': 'foss', 'version': '2022b'}
 
 source_urls = [GITHUB_RELEASE]
 sources = ['%(namelower)s-v%(version)s.tar.gz']
@@ -30,6 +31,7 @@ patches = [
     'PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch',
     'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch',
     'PyTorch-2.1.0_disable-gcc12-warning.patch',
+    'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch',
     'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch',
     'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch',
     'PyTorch-2.1.0_fix-validationError-output-test.patch',
@@ -42,13 +44,26 @@ patches = [
     'PyTorch-2.1.0_skip-test_jvp_linalg_det_singular.patch',
     'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch',
     'PyTorch-2.1.0_skip-test_wrap_bad.patch',
+    'PyTorch-2.1.2_add-cuda-skip-markers.patch',
+    'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch',
+    'PyTorch-2.1.2_fix-device-mesh-check.patch',
+    'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch',
+    'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch',
+    'PyTorch-2.1.2_fix-test_cuda-non-x86.patch',
     'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch',
     'PyTorch-2.1.2_fix-test_memory_profiler.patch',
+    'PyTorch-2.1.2_fix-test_parallelize_api.patch',
     'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch',
     'PyTorch-2.1.2_fix-vsx-vector-abs.patch',
     'PyTorch-2.1.2_fix-vsx-vector-div.patch',
+    'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch',
+    'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch',
+    'PyTorch-2.1.2_relax-cuda-tolerances.patch',
+    'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch',
     'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch',
-    'PyTorch-2.1.2_skip-memory-leak-test.patch',
+    'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch',
+    'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch',
+    'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch',
     'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch',
 ]
 checksums = [
@@ -85,6 +100,8 @@ checksums = [
     {'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch':
      '166c134573a95230e39b9ea09ece3ad8072f39d370c9a88fb2a1e24f6aaac2b5'},
     {'PyTorch-2.1.0_disable-gcc12-warning.patch': 'c858b8db0010f41005dc06f9a50768d0d3dc2d2d499ccbdd5faf8a518869a421'},
+    {'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch':
+     'd895018ebdfd46e65d9f7645444a3b4c5bbfe3d533a08db559a04be34e01e478'},
     {'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch':
      'b15b1291a3c37bf6a4982cfbb3483f693acb46a67bc0912b383fd98baf540ccf'},
     {'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch':
@@ -107,17 +124,40 @@ checksums = [
     {'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch':
      '5dcc79883b6e3ec0a281a8e110db5e0a5880de843bb05653589891f16473ead5'},
     {'PyTorch-2.1.0_skip-test_wrap_bad.patch': 'b8583125ee94e553b6f77c4ab4bfa812b89416175dc7e9b7390919f3b485cb63'},
+    {'PyTorch-2.1.2_add-cuda-skip-markers.patch': 'd007d6d0cdb533e7d01f503e9055218760123a67c1841c57585385144be18c9a'},
+    {'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch':
+     'c164357efa4ce88095376e590ba508fc1daa87161e1e59544eda56daac7f2847'},
+    {'PyTorch-2.1.2_fix-device-mesh-check.patch': 'c0efc288bf3d9a9a3c8bbd2691348a589a2677ea43880a8c987db91c8de4806b'},
+    {'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch':
+     'f583532c59f35f36998851957d501b3ac8c883884efd61bbaa308db55cb6bdcd'},
+    {'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch':
+     'f7adafb4e4d3b724b93237a259797b6ed6f535f83be0e34a7b759c71c6a8ddf2'},
+    {'PyTorch-2.1.2_fix-test_cuda-non-x86.patch': '1ed76fcc87e6c50606ac286487292a3d534707068c94af74c3a5de8153fa2c2c'},
     {'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch':
      'cd1455495886a7d6b2d30d48736eb0103fded21e2e36de6baac719b9c52a1c92'},
     {'PyTorch-2.1.2_fix-test_memory_profiler.patch':
      '30b0c9355636c0ab3dedae02399789053825dc3835b4d7dac6e696767772b1ce'},
+    {'PyTorch-2.1.2_fix-test_parallelize_api.patch':
+     'f8387a1693af344099c806981ca38df1306d7f4847d7d44713306338384b1cfd'},
     {'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch':
      'a0ef99192ee2ad1509c78a8377023d5be2b5fddb16f84063b7c9a0b53d979090'},
     {'PyTorch-2.1.2_fix-vsx-vector-abs.patch': 'd67d32407faed7dc1dbab4bba0e2f7de36c3db04560ced35c94caf8d84ade886'},
     {'PyTorch-2.1.2_fix-vsx-vector-div.patch': '11f497a6892eb49b249a15320e4218e0d7ac8ae4ce67de39e4a018a064ca1acc'},
+    {'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch':
+     '90bd001e034095329277d70c6facc4026b4ce6d7f8b8d6aa81c0176eeb462eb1'},
+    {'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch':
+     '07a5e4233d02fb6348872838f4d69573c777899c6f0ea4e39ae23c08660d41e5'},
+    {'PyTorch-2.1.2_relax-cuda-tolerances.patch': '554ad09787f61080fafdb84216e711e32327aa357e2a9c40bb428eb6503dee6e'},
+    {'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch':
+     'e6a1efe3d127fcbf4723476a7a1c01cfcf2ccb16d1fb250f478192623e8b6a15'},
     {'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch':
      '7ace835af60c58d9e0754a34c19d4b9a0c3a531f19e5d0eba8e2e49206eaa7eb'},
-    {'PyTorch-2.1.2_skip-memory-leak-test.patch': '8d9841208e8a00a498295018aead380c360cf56e500ef23ca740adb5b36de142'},
+    {'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch':
+     '6cf711bf26518550903b09ed4431de9319791e79d61aab065785d6608fd5cc88'},
+    {'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch':
+     '943ee92f5fd518f608a59e43fe426b9bb45d7e7ad0ba04639e516db2d61fa57d'},
+    {'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch':
+     '7f5befddcb006b6ab5377de6ee3c29df375c5f8ef5e42b998d35113585b983f3'},
     {'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch':
      'fb96eefabf394617bbb3fbd3a7a7c1aa5991b3836edc2e5d2a30e708bfe49ba1'},
 ]
@@ -125,32 +165,35 @@ checksums = [
 osdependencies = [OS_PKG_IBVERBS_DEV]
 
 builddependencies = [
-    ('CMake', '3.27.6'),
-    ('hypothesis', '6.90.0'),
+    ('CMake', '3.24.3'),
+    ('hypothesis', '6.68.2'),
     # For tests
     ('pytest-flakefinder', '1.1.0'),
-    ('pytest-rerunfailures', '14.0'),
+    ('pytest-rerunfailures', '12.0'),
     ('pytest-shard', '0.1.2'),
 ]
 
 dependencies = [
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('cuDNN', '8.8.0.121', '-CUDA-%(cudaver)s', SYSTEM),
+    ('magma', '2.7.1', '-CUDA-%(cudaver)s'),
+    ('NCCL', '2.18.3', '-CUDA-%(cudaver)s'),
     ('Ninja', '1.11.1'),  # Required for JIT compilation of C++ extensions
-    ('Python', '3.11.5'),
-    ('Python-bundle-PyPI', '2023.10'),
-    ('protobuf', '25.3'),
-    ('protobuf-python', '4.25.3'),
-    ('pybind11', '2.11.1'),
-    ('SciPy-bundle', '2023.11'),
-    ('PyYAML', '6.0.1'),
-    ('MPFR', '4.2.1'),
-    ('GMP', '6.3.0'),
+    ('Python', '3.10.8'),
+    ('protobuf', '23.0'),
+    ('protobuf-python', '4.23.0'),
+    ('pybind11', '2.10.3'),
+    ('SciPy-bundle', '2023.02'),
+    ('PyYAML', '6.0'),
+    ('MPFR', '4.2.0'),
+    ('GMP', '6.2.1'),
     ('numactl', '2.0.16'),
-    ('FFmpeg', '6.0'),
-    ('Pillow', '10.2.0'),
-    ('expecttest', '0.2.1'),
-    ('networkx', '3.2.1'),
+    ('FFmpeg', '5.1.2'),
+    ('Pillow', '9.4.0'),
+    ('expecttest', '0.1.3'),
+    ('networkx', '3.0'),
     ('sympy', '1.12'),
-    ('Z3', '4.13.0',),
+    ('Z3', '4.12.2', '-Python-%(pyver)s'),
 ]
 
 use_pip = True
@@ -170,6 +213,16 @@ excluded_tests = {
         # intermittent failures on various systems
         # See https://github.com/easybuilders/easybuild-easyconfigs/issues/17712
         'distributed/rpc/test_tensorpipe_agent',
+        # Broken test, can't ever succeed, see https://github.com/pytorch/pytorch/issues/122184
+        'distributed/tensor/parallel/test_tp_random_state',
+        # failures on OmniPath systems, which don't support some optional InfiniBand features
+        # See https://github.com/pytorch/tensorpipe/issues/413
+        'distributed/pipeline/sync/skip/test_gpipe',
+        'distributed/pipeline/sync/skip/test_leak',
+        'distributed/pipeline/sync/test_bugs',
+        'distributed/pipeline/sync/test_inplace',
+        'distributed/pipeline/sync/test_pipe',
+        'distributed/pipeline/sync/test_transparency',
     ]
 }
 
@@ -177,8 +230,16 @@ runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py --continue-throu
 
 # Especially test_quantization has a few corner cases that are triggered by the random input values,
 # those cannot be easily avoided, see https://github.com/pytorch/pytorch/issues/107030
+# test_nn is also prone to spurious failures: https://github.com/pytorch/pytorch/issues/118294
 # So allow a low number of tests to fail as the tests "usually" succeed
-max_failed_tests = 2
+max_failed_tests = 10
+
+# The readelf sanity check command can be taken out once the TestRPATH test from
+# https://github.com/pytorch/pytorch/pull/122318 is accepted, since it is then checked as part of the PyTorch test suite
+local_libcaffe2 = "$EBROOTPYTORCH/lib/python%%(pyshortver)s/site-packages/torch/lib/libcaffe2_nvrtc.%s" % SHLIB_EXT
+sanity_check_commands = [
+    "readelf -d %s | egrep 'RPATH|RUNPATH' | grep -v stubs" % local_libcaffe2,
+]
 
 tests = ['PyTorch-check-cpp-extension.py']
 
Diff against PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb

easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb

diff --git a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
index 65dfced170..9cbcda474f 100644
--- a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb
+++ b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
@@ -6,7 +6,7 @@ homepage = 'https://pytorch.org/'
 description = """Tensors and Dynamic neural networks in Python with strong GPU acceleration.
 PyTorch is a deep learning framework that puts Python first."""
 
-toolchain = {'name': 'foss', 'version': '2023a'}
+toolchain = {'name': 'foss', 'version': '2022b'}
 
 source_urls = [GITHUB_RELEASE]
 sources = ['%(namelower)s-v%(version)s.tar.gz']
@@ -47,9 +47,12 @@ patches = [
     'PyTorch-2.1.2_add-cuda-skip-markers.patch',
     'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch',
     'PyTorch-2.1.2_fix-device-mesh-check.patch',
+    'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch',
     'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch',
+    'PyTorch-2.1.2_fix-test_cuda-non-x86.patch',
     'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch',
     'PyTorch-2.1.2_fix-test_memory_profiler.patch',
+    'PyTorch-2.1.2_fix-test_parallelize_api.patch',
     'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch',
     'PyTorch-2.1.2_fix-vsx-vector-abs.patch',
     'PyTorch-2.1.2_fix-vsx-vector-div.patch',
@@ -59,8 +62,8 @@ patches = [
     'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch',
     'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch',
     'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch',
-    'PyTorch-2.1.2_skip-memory-leak-test.patch',
     'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch',
+    'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch',
     'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch',
 ]
 checksums = [
@@ -125,12 +128,17 @@ checksums = [
     {'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch':
      'c164357efa4ce88095376e590ba508fc1daa87161e1e59544eda56daac7f2847'},
     {'PyTorch-2.1.2_fix-device-mesh-check.patch': 'c0efc288bf3d9a9a3c8bbd2691348a589a2677ea43880a8c987db91c8de4806b'},
+    {'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch':
+     'f583532c59f35f36998851957d501b3ac8c883884efd61bbaa308db55cb6bdcd'},
     {'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch':
      'f7adafb4e4d3b724b93237a259797b6ed6f535f83be0e34a7b759c71c6a8ddf2'},
+    {'PyTorch-2.1.2_fix-test_cuda-non-x86.patch': '1ed76fcc87e6c50606ac286487292a3d534707068c94af74c3a5de8153fa2c2c'},
     {'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch':
      'cd1455495886a7d6b2d30d48736eb0103fded21e2e36de6baac719b9c52a1c92'},
     {'PyTorch-2.1.2_fix-test_memory_profiler.patch':
      '30b0c9355636c0ab3dedae02399789053825dc3835b4d7dac6e696767772b1ce'},
+    {'PyTorch-2.1.2_fix-test_parallelize_api.patch':
+     'f8387a1693af344099c806981ca38df1306d7f4847d7d44713306338384b1cfd'},
     {'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch':
      'a0ef99192ee2ad1509c78a8377023d5be2b5fddb16f84063b7c9a0b53d979090'},
     {'PyTorch-2.1.2_fix-vsx-vector-abs.patch': 'd67d32407faed7dc1dbab4bba0e2f7de36c3db04560ced35c94caf8d84ade886'},
@@ -146,9 +154,10 @@ checksums = [
      '7ace835af60c58d9e0754a34c19d4b9a0c3a531f19e5d0eba8e2e49206eaa7eb'},
     {'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch':
      '6cf711bf26518550903b09ed4431de9319791e79d61aab065785d6608fd5cc88'},
-    {'PyTorch-2.1.2_skip-memory-leak-test.patch': '8d9841208e8a00a498295018aead380c360cf56e500ef23ca740adb5b36de142'},
     {'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch':
      '943ee92f5fd518f608a59e43fe426b9bb45d7e7ad0ba04639e516db2d61fa57d'},
+    {'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch':
+     '7f5befddcb006b6ab5377de6ee3c29df375c5f8ef5e42b998d35113585b983f3'},
     {'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch':
      'fb96eefabf394617bbb3fbd3a7a7c1aa5991b3836edc2e5d2a30e708bfe49ba1'},
 ]
@@ -156,8 +165,8 @@ checksums = [
 osdependencies = [OS_PKG_IBVERBS_DEV]
 
 builddependencies = [
-    ('CMake', '3.26.3'),
-    ('hypothesis', '6.82.0'),
+    ('CMake', '3.24.3'),
+    ('hypothesis', '6.68.2'),
     # For tests
     ('pytest-flakefinder', '1.1.0'),
     ('pytest-rerunfailures', '12.0'),
@@ -165,27 +174,26 @@ builddependencies = [
 ]
 
 dependencies = [
-    ('CUDA', '12.1.1', '', SYSTEM),
-    ('cuDNN', '8.9.2.26', '-CUDA-%(cudaver)s', SYSTEM),
-    ('magma', '2.7.2', '-CUDA-%(cudaver)s'),
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('cuDNN', '8.8.0.121', '-CUDA-%(cudaver)s', SYSTEM),
+    ('magma', '2.7.1', '-CUDA-%(cudaver)s'),
     ('NCCL', '2.18.3', '-CUDA-%(cudaver)s'),
     ('Ninja', '1.11.1'),  # Required for JIT compilation of C++ extensions
-    ('Python', '3.11.3'),
-    ('Python-bundle-PyPI', '2023.06'),
-    ('protobuf', '24.0'),
-    ('protobuf-python', '4.24.0'),
-    ('pybind11', '2.11.1'),
-    ('SciPy-bundle', '2023.07'),
+    ('Python', '3.10.8'),
+    ('protobuf', '23.0'),
+    ('protobuf-python', '4.23.0'),
+    ('pybind11', '2.10.3'),
+    ('SciPy-bundle', '2023.02'),
     ('PyYAML', '6.0'),
     ('MPFR', '4.2.0'),
     ('GMP', '6.2.1'),
     ('numactl', '2.0.16'),
-    ('FFmpeg', '6.0'),
-    ('Pillow', '10.0.0'),
-    ('expecttest', '0.1.5'),
-    ('networkx', '3.1'),
+    ('FFmpeg', '5.1.2'),
+    ('Pillow', '9.4.0'),
+    ('expecttest', '0.1.3'),
+    ('networkx', '3.0'),
     ('sympy', '1.12'),
-    ('Z3', '4.12.2'),
+    ('Z3', '4.12.2', '-Python-%(pyver)s'),
 ]
 
 use_pip = True
@@ -224,10 +232,10 @@ runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py --continue-throu
 # those cannot be easily avoided, see https://github.com/pytorch/pytorch/issues/107030
 # test_nn is also prone to spurious failures: https://github.com/pytorch/pytorch/issues/118294
 # So allow a low number of tests to fail as the tests "usually" succeed
-max_failed_tests = 2
+max_failed_tests = 10
 
 # The readelf sanity check command can be taken out once the TestRPATH test from
-# https://github.com/pytorch/pytorch/pull/109493 is accepted, since it is then checked as part of the PyTorch test suite
+# https://github.com/pytorch/pytorch/pull/122318 is accepted, since it is then checked as part of the PyTorch test suite
 local_libcaffe2 = "$EBROOTPYTORCH/lib/python%%(pyshortver)s/site-packages/torch/lib/libcaffe2_nvrtc.%s" % SHLIB_EXT
 sanity_check_commands = [
     "readelf -d %s | egrep 'RPATH|RUNPATH' | grep -v stubs" % local_libcaffe2,
Diff against PyTorch-2.1.2-foss-2023a.eb

easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a.eb

diff --git a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a.eb b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
index a79f709480..9cbcda474f 100644
--- a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a.eb
+++ b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
@@ -1,11 +1,12 @@
 name = 'PyTorch'
 version = '2.1.2'
+versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://pytorch.org/'
 description = """Tensors and Dynamic neural networks in Python with strong GPU acceleration.
 PyTorch is a deep learning framework that puts Python first."""
 
-toolchain = {'name': 'foss', 'version': '2023a'}
+toolchain = {'name': 'foss', 'version': '2022b'}
 
 source_urls = [GITHUB_RELEASE]
 sources = ['%(namelower)s-v%(version)s.tar.gz']
@@ -30,6 +31,7 @@ patches = [
     'PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch',
     'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch',
     'PyTorch-2.1.0_disable-gcc12-warning.patch',
+    'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch',
     'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch',
     'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch',
     'PyTorch-2.1.0_fix-validationError-output-test.patch',
@@ -42,13 +44,26 @@ patches = [
     'PyTorch-2.1.0_skip-test_jvp_linalg_det_singular.patch',
     'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch',
     'PyTorch-2.1.0_skip-test_wrap_bad.patch',
+    'PyTorch-2.1.2_add-cuda-skip-markers.patch',
+    'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch',
+    'PyTorch-2.1.2_fix-device-mesh-check.patch',
+    'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch',
+    'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch',
+    'PyTorch-2.1.2_fix-test_cuda-non-x86.patch',
     'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch',
     'PyTorch-2.1.2_fix-test_memory_profiler.patch',
+    'PyTorch-2.1.2_fix-test_parallelize_api.patch',
     'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch',
     'PyTorch-2.1.2_fix-vsx-vector-abs.patch',
     'PyTorch-2.1.2_fix-vsx-vector-div.patch',
+    'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch',
+    'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch',
+    'PyTorch-2.1.2_relax-cuda-tolerances.patch',
+    'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch',
     'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch',
-    'PyTorch-2.1.2_skip-memory-leak-test.patch',
+    'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch',
+    'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch',
+    'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch',
     'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch',
 ]
 checksums = [
@@ -85,6 +100,8 @@ checksums = [
     {'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch':
      '166c134573a95230e39b9ea09ece3ad8072f39d370c9a88fb2a1e24f6aaac2b5'},
     {'PyTorch-2.1.0_disable-gcc12-warning.patch': 'c858b8db0010f41005dc06f9a50768d0d3dc2d2d499ccbdd5faf8a518869a421'},
+    {'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch':
+     'd895018ebdfd46e65d9f7645444a3b4c5bbfe3d533a08db559a04be34e01e478'},
     {'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch':
      'b15b1291a3c37bf6a4982cfbb3483f693acb46a67bc0912b383fd98baf540ccf'},
     {'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch':
@@ -107,17 +124,40 @@ checksums = [
     {'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch':
      '5dcc79883b6e3ec0a281a8e110db5e0a5880de843bb05653589891f16473ead5'},
     {'PyTorch-2.1.0_skip-test_wrap_bad.patch': 'b8583125ee94e553b6f77c4ab4bfa812b89416175dc7e9b7390919f3b485cb63'},
+    {'PyTorch-2.1.2_add-cuda-skip-markers.patch': 'd007d6d0cdb533e7d01f503e9055218760123a67c1841c57585385144be18c9a'},
+    {'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch':
+     'c164357efa4ce88095376e590ba508fc1daa87161e1e59544eda56daac7f2847'},
+    {'PyTorch-2.1.2_fix-device-mesh-check.patch': 'c0efc288bf3d9a9a3c8bbd2691348a589a2677ea43880a8c987db91c8de4806b'},
+    {'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch':
+     'f583532c59f35f36998851957d501b3ac8c883884efd61bbaa308db55cb6bdcd'},
+    {'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch':
+     'f7adafb4e4d3b724b93237a259797b6ed6f535f83be0e34a7b759c71c6a8ddf2'},
+    {'PyTorch-2.1.2_fix-test_cuda-non-x86.patch': '1ed76fcc87e6c50606ac286487292a3d534707068c94af74c3a5de8153fa2c2c'},
     {'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch':
      'cd1455495886a7d6b2d30d48736eb0103fded21e2e36de6baac719b9c52a1c92'},
     {'PyTorch-2.1.2_fix-test_memory_profiler.patch':
      '30b0c9355636c0ab3dedae02399789053825dc3835b4d7dac6e696767772b1ce'},
+    {'PyTorch-2.1.2_fix-test_parallelize_api.patch':
+     'f8387a1693af344099c806981ca38df1306d7f4847d7d44713306338384b1cfd'},
     {'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch':
      'a0ef99192ee2ad1509c78a8377023d5be2b5fddb16f84063b7c9a0b53d979090'},
     {'PyTorch-2.1.2_fix-vsx-vector-abs.patch': 'd67d32407faed7dc1dbab4bba0e2f7de36c3db04560ced35c94caf8d84ade886'},
     {'PyTorch-2.1.2_fix-vsx-vector-div.patch': '11f497a6892eb49b249a15320e4218e0d7ac8ae4ce67de39e4a018a064ca1acc'},
+    {'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch':
+     '90bd001e034095329277d70c6facc4026b4ce6d7f8b8d6aa81c0176eeb462eb1'},
+    {'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch':
+     '07a5e4233d02fb6348872838f4d69573c777899c6f0ea4e39ae23c08660d41e5'},
+    {'PyTorch-2.1.2_relax-cuda-tolerances.patch': '554ad09787f61080fafdb84216e711e32327aa357e2a9c40bb428eb6503dee6e'},
+    {'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch':
+     'e6a1efe3d127fcbf4723476a7a1c01cfcf2ccb16d1fb250f478192623e8b6a15'},
     {'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch':
      '7ace835af60c58d9e0754a34c19d4b9a0c3a531f19e5d0eba8e2e49206eaa7eb'},
-    {'PyTorch-2.1.2_skip-memory-leak-test.patch': '8d9841208e8a00a498295018aead380c360cf56e500ef23ca740adb5b36de142'},
+    {'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch':
+     '6cf711bf26518550903b09ed4431de9319791e79d61aab065785d6608fd5cc88'},
+    {'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch':
+     '943ee92f5fd518f608a59e43fe426b9bb45d7e7ad0ba04639e516db2d61fa57d'},
+    {'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch':
+     '7f5befddcb006b6ab5377de6ee3c29df375c5f8ef5e42b998d35113585b983f3'},
     {'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch':
      'fb96eefabf394617bbb3fbd3a7a7c1aa5991b3836edc2e5d2a30e708bfe49ba1'},
 ]
@@ -125,8 +165,8 @@ checksums = [
 osdependencies = [OS_PKG_IBVERBS_DEV]
 
 builddependencies = [
-    ('CMake', '3.26.3'),
-    ('hypothesis', '6.82.0'),
+    ('CMake', '3.24.3'),
+    ('hypothesis', '6.68.2'),
     # For tests
     ('pytest-flakefinder', '1.1.0'),
     ('pytest-rerunfailures', '12.0'),
@@ -134,26 +174,30 @@ builddependencies = [
 ]
 
 dependencies = [
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('cuDNN', '8.8.0.121', '-CUDA-%(cudaver)s', SYSTEM),
+    ('magma', '2.7.1', '-CUDA-%(cudaver)s'),
+    ('NCCL', '2.18.3', '-CUDA-%(cudaver)s'),
     ('Ninja', '1.11.1'),  # Required for JIT compilation of C++ extensions
-    ('Python', '3.11.3'),
-    ('Python-bundle-PyPI', '2023.06'),
-    ('protobuf', '24.0'),
-    ('protobuf-python', '4.24.0'),
-    ('pybind11', '2.11.1'),
-    ('SciPy-bundle', '2023.07'),
+    ('Python', '3.10.8'),
+    ('protobuf', '23.0'),
+    ('protobuf-python', '4.23.0'),
+    ('pybind11', '2.10.3'),
+    ('SciPy-bundle', '2023.02'),
     ('PyYAML', '6.0'),
     ('MPFR', '4.2.0'),
     ('GMP', '6.2.1'),
     ('numactl', '2.0.16'),
-    ('FFmpeg', '6.0'),
-    ('Pillow', '10.0.0'),
-    ('expecttest', '0.1.5'),
-    ('networkx', '3.1'),
+    ('FFmpeg', '5.1.2'),
+    ('Pillow', '9.4.0'),
+    ('expecttest', '0.1.3'),
+    ('networkx', '3.0'),
     ('sympy', '1.12'),
-    ('Z3', '4.12.2',),
+    ('Z3', '4.12.2', '-Python-%(pyver)s'),
 ]
 
 use_pip = True
+buildcmd = '%(python)s setup.py build'  # Run the (long) build in the build step
 
 excluded_tests = {
     '': [
@@ -169,6 +213,16 @@ excluded_tests = {
         # intermittent failures on various systems
         # See https://github.com/easybuilders/easybuild-easyconfigs/issues/17712
         'distributed/rpc/test_tensorpipe_agent',
+        # Broken test, can't ever succeed, see https://github.com/pytorch/pytorch/issues/122184
+        'distributed/tensor/parallel/test_tp_random_state',
+        # failures on OmniPath systems, which don't support some optional InfiniBand features
+        # See https://github.com/pytorch/tensorpipe/issues/413
+        'distributed/pipeline/sync/skip/test_gpipe',
+        'distributed/pipeline/sync/skip/test_leak',
+        'distributed/pipeline/sync/test_bugs',
+        'distributed/pipeline/sync/test_inplace',
+        'distributed/pipeline/sync/test_pipe',
+        'distributed/pipeline/sync/test_transparency',
     ]
 }
 
@@ -176,8 +230,16 @@ runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py --continue-throu
 
 # Especially test_quantization has a few corner cases that are triggered by the random input values,
 # those cannot be easily avoided, see https://github.com/pytorch/pytorch/issues/107030
+# test_nn is also prone to spurious failures: https://github.com/pytorch/pytorch/issues/118294
 # So allow a low number of tests to fail as the tests "usually" succeed
-max_failed_tests = 2
+max_failed_tests = 10
+
+# The readelf sanity check command can be taken out once the TestRPATH test from
+# https://github.com/pytorch/pytorch/pull/122318 is accepted, since it is then checked as part of the PyTorch test suite
+local_libcaffe2 = "$EBROOTPYTORCH/lib/python%%(pyshortver)s/site-packages/torch/lib/libcaffe2_nvrtc.%s" % SHLIB_EXT
+sanity_check_commands = [
+    "readelf -d %s | egrep 'RPATH|RUNPATH' | grep -v stubs" % local_libcaffe2,
+]
 
 tests = ['PyTorch-check-cpp-extension.py']
 

@Flamefire Flamefire marked this pull request as draft November 28, 2024 14:31
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 55 out of 55 (2 easyconfigs in total)
ml30 - Linux AlmaLinux 8.7 (Stone Smilodon), POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 530.30.02, Python 3.8.13
See https://gist.github.com/Flamefire/674307e6a21da75203eea9819bec205c for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8034 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/0015043e032f9631948d9db5be864f2c for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8003 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/822b64b6fdcc8ee170fc9bbd65460c02 for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants