{ai,lib}[GCCcore/12.2.0,foss/2022b] PyTorch v2.1.2, NCCL v2.18.3 w/ CUDA 12.0.0 #20520

Flamefire · 2024-05-13T14:56:53Z

(created using eb --new-pr)
This is meant as an alternative to #20155 using a newer NCCL version as the older one currently included in foss/2022b doesn't seem to work with PyTorch 2.1.2

Update: Seems #20155 works now. So putting this one on hold

…8.3-GCCcore-12.2.0-CUDA-12.0.0.eb

SebastianAchilles · 2024-05-15T16:09:40Z

Test report by @SebastianAchilles
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8
See https://gist.github.com/SebastianAchilles/7ddc2f02e198c9e93730651648ea6a65 for a full test report.

SebastianAchilles · 2024-05-15T16:41:08Z

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 550.54.15, Python 3.9.18
See https://gist.github.com/SebastianAchilles/caa73902c24edfc4a9f09a1104e38750 for a full test report.

SebastianAchilles · 2024-05-16T00:43:24Z

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8
See https://gist.github.com/SebastianAchilles/c2693ff5dacd31a35769e1bca1515fc6 for a full test report.

Flamefire · 2024-05-16T07:56:25Z

Test report by @SebastianAchilles FAILED Build succeeded for 1 out of 2 (2 easyconfigs in total) skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8 See https://gist.github.com/SebastianAchilles/7ddc2f02e198c9e93730651648ea6a65 for a full test report.

That first one failed with

distributed/_tensor/test_dtensor_ops 1/1 failed! Received signal: SIGSEGV

I see that every now and then in various different tests especially test_jit*. Seems to happen randomly, not sure why.

I'll do a larger repeated run for both PRs over the weekend so I'll have the results to compare on Tuesday (Monday is a public holiday here)

Flamefire · 2024-11-22T01:54:51Z

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8002 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/bb9a1cd446347299a3a18282e8c52f29 for a full test report.

github-actions · 2024-11-22T13:38:48Z

Updated software `NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb`

Diff against NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb

easybuild/easyconfigs/n/NCCL/NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb

diff --git a/easybuild/easyconfigs/n/NCCL/NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
index 0534e538fa..a25e786210 100644
--- a/easybuild/easyconfigs/n/NCCL/NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb
+++ b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
@@ -1,23 +1,28 @@
 name = 'NCCL'
-version = '2.22.3'
+version = '2.18.3'
 versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://developer.nvidia.com/nccl'
 description = """The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective
 communication primitives that are performance optimized for NVIDIA GPUs."""
 
-toolchain = {'name': 'GCCcore', 'version': '13.3.0'}
+toolchain = {'name': 'GCCcore', 'version': '12.2.0'}
 
 github_account = 'NVIDIA'
 source_urls = [GITHUB_SOURCE]
 sources = ['v%(version)s-1.tar.gz']
-checksums = ['45151629a9494460e73375281e8b0fe379141528879301899ece9b776faca024']
+patches = ['NCCL-2.16.2_fix-cpuid.patch']
+checksums = [
+    ('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9',
+     'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96'),
+    {'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'},
+]
 
-builddependencies = [('binutils', '2.42')]
+builddependencies = [('binutils', '2.39')]
 
 dependencies = [
-    ('CUDA', '12.6.0', '', SYSTEM),
-    ('UCX-CUDA', '1.16.0', versionsuffix),
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('UCX-CUDA', '1.13.1', versionsuffix),
 ]
 
 # default CUDA compute capabilities to use (override via --cuda-compute-capabilities)

Diff against NCCL-2.20.5-GCCcore-13.2.0-CUDA-12.4.0.eb

easybuild/easyconfigs/n/NCCL/NCCL-2.20.5-GCCcore-13.2.0-CUDA-12.4.0.eb

diff --git a/easybuild/easyconfigs/n/NCCL/NCCL-2.20.5-GCCcore-13.2.0-CUDA-12.4.0.eb b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
index 90634952ad..a25e786210 100644
--- a/easybuild/easyconfigs/n/NCCL/NCCL-2.20.5-GCCcore-13.2.0-CUDA-12.4.0.eb
+++ b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
@@ -1,23 +1,28 @@
 name = 'NCCL'
-version = '2.20.5'
+version = '2.18.3'
 versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://developer.nvidia.com/nccl'
 description = """The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective
 communication primitives that are performance optimized for NVIDIA GPUs."""
 
-toolchain = {'name': 'GCCcore', 'version': '13.2.0'}
+toolchain = {'name': 'GCCcore', 'version': '12.2.0'}
 
 github_account = 'NVIDIA'
 source_urls = [GITHUB_SOURCE]
 sources = ['v%(version)s-1.tar.gz']
-checksums = ['d11ad65c1df3cbe4447eaddceec71569f5c0497e27b3b8369cf79f18d2b2ad8c']
+patches = ['NCCL-2.16.2_fix-cpuid.patch']
+checksums = [
+    ('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9',
+     'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96'),
+    {'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'},
+]
 
-builddependencies = [('binutils', '2.40')]
+builddependencies = [('binutils', '2.39')]
 
 dependencies = [
-    ('CUDA', '12.4.0', '', SYSTEM),
-    ('UCX-CUDA', '1.15.0', versionsuffix),
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('UCX-CUDA', '1.13.1', versionsuffix),
 ]
 
 # default CUDA compute capabilities to use (override via --cuda-compute-capabilities)

Diff against NCCL-2.16.2-GCCcore-12.2.0-CUDA-11.7.0.eb

easybuild/easyconfigs/n/NCCL/NCCL-2.16.2-GCCcore-12.2.0-CUDA-11.7.0.eb

diff --git a/easybuild/easyconfigs/n/NCCL/NCCL-2.16.2-GCCcore-12.2.0-CUDA-11.7.0.eb b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
index ebbd822138..a25e786210 100644
--- a/easybuild/easyconfigs/n/NCCL/NCCL-2.16.2-GCCcore-12.2.0-CUDA-11.7.0.eb
+++ b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
@@ -1,5 +1,5 @@
 name = 'NCCL'
-version = '2.16.2'
+version = '2.18.3'
 versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://developer.nvidia.com/nccl'
@@ -13,21 +13,19 @@ source_urls = [GITHUB_SOURCE]
 sources = ['v%(version)s-1.tar.gz']
 patches = ['NCCL-2.16.2_fix-cpuid.patch']
 checksums = [
-    {'v2.16.2-1.tar.gz': '7f7c738511a8876403fc574d13d48e7c250d934d755598d82e14bab12236fc64'},
+    ('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9',
+     'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96'),
     {'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'},
 ]
 
 builddependencies = [('binutils', '2.39')]
 
 dependencies = [
-    ('CUDA', '11.7.0', '', SYSTEM),
+    ('CUDA', '12.0.0', '', SYSTEM),
     ('UCX-CUDA', '1.13.1', versionsuffix),
 ]
 
-prebuildopts = "sed -i 's/NVCUFLAGS  := /NVCUFLAGS  := -allow-unsupported-compiler /' makefiles/common.mk && "
-buildopts = "VERBOSE=1"
-
 # default CUDA compute capabilities to use (override via --cuda-compute-capabilities)
-cuda_compute_capabilities = ['3.5', '5.0', '6.0', '7.0', '7.5', '8.0', '8.6']
+cuda_compute_capabilities = ['5.0', '6.0', '7.0', '7.5', '8.0', '8.6', '9.0']
 
 moduleclass = 'lib'

Updated software `PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb`

Diff against PyTorch-2.1.2-foss-2023b.eb

easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023b.eb

diff --git a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023b.eb b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
index bce1b68aa7..9cbcda474f 100644
--- a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023b.eb
+++ b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
@@ -1,11 +1,12 @@
 name = 'PyTorch'
 version = '2.1.2'
+versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://pytorch.org/'
 description = """Tensors and Dynamic neural networks in Python with strong GPU acceleration.
 PyTorch is a deep learning framework that puts Python first."""
 
-toolchain = {'name': 'foss', 'version': '2023b'}
+toolchain = {'name': 'foss', 'version': '2022b'}
 
 source_urls = [GITHUB_RELEASE]
 sources = ['%(namelower)s-v%(version)s.tar.gz']
@@ -30,6 +31,7 @@ patches = [
     'PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch',
     'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch',
     'PyTorch-2.1.0_disable-gcc12-warning.patch',
+    'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch',
     'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch',
     'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch',
     'PyTorch-2.1.0_fix-validationError-output-test.patch',
@@ -42,13 +44,26 @@ patches = [
     'PyTorch-2.1.0_skip-test_jvp_linalg_det_singular.patch',
     'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch',
     'PyTorch-2.1.0_skip-test_wrap_bad.patch',
+    'PyTorch-2.1.2_add-cuda-skip-markers.patch',
+    'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch',
+    'PyTorch-2.1.2_fix-device-mesh-check.patch',
+    'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch',
+    'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch',
+    'PyTorch-2.1.2_fix-test_cuda-non-x86.patch',
     'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch',
     'PyTorch-2.1.2_fix-test_memory_profiler.patch',
+    'PyTorch-2.1.2_fix-test_parallelize_api.patch',
     'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch',
     'PyTorch-2.1.2_fix-vsx-vector-abs.patch',
     'PyTorch-2.1.2_fix-vsx-vector-div.patch',
+    'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch',
+    'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch',
+    'PyTorch-2.1.2_relax-cuda-tolerances.patch',
+    'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch',
     'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch',
-    'PyTorch-2.1.2_skip-memory-leak-test.patch',
+    'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch',
+    'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch',
+    'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch',
     'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch',
 ]
 checksums = [
@@ -85,6 +100,8 @@ checksums = [
     {'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch':
      '166c134573a95230e39b9ea09ece3ad8072f39d370c9a88fb2a1e24f6aaac2b5'},
     {'PyTorch-2.1.0_disable-gcc12-warning.patch': 'c858b8db0010f41005dc06f9a50768d0d3dc2d2d499ccbdd5faf8a518869a421'},
+    {'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch':
+     'd895018ebdfd46e65d9f7645444a3b4c5bbfe3d533a08db559a04be34e01e478'},
     {'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch':
      'b15b1291a3c37bf6a4982cfbb3483f693acb46a67bc0912b383fd98baf540ccf'},
     {'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch':
@@ -107,17 +124,40 @@ checksums = [
     {'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch':
      '5dcc79883b6e3ec0a281a8e110db5e0a5880de843bb05653589891f16473ead5'},
     {'PyTorch-2.1.0_skip-test_wrap_bad.patch': 'b8583125ee94e553b6f77c4ab4bfa812b89416175dc7e9b7390919f3b485cb63'},
+    {'PyTorch-2.1.2_add-cuda-skip-markers.patch': 'd007d6d0cdb533e7d01f503e9055218760123a67c1841c57585385144be18c9a'},
+    {'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch':
+     'c164357efa4ce88095376e590ba508fc1daa87161e1e59544eda56daac7f2847'},
+    {'PyTorch-2.1.2_fix-device-mesh-check.patch': 'c0efc288bf3d9a9a3c8bbd2691348a589a2677ea43880a8c987db91c8de4806b'},
+    {'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch':
+     'f583532c59f35f36998851957d501b3ac8c883884efd61bbaa308db55cb6bdcd'},
+    {'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch':
+     'f7adafb4e4d3b724b93237a259797b6ed6f535f83be0e34a7b759c71c6a8ddf2'},
+    {'PyTorch-2.1.2_fix-test_cuda-non-x86.patch': '1ed76fcc87e6c50606ac286487292a3d534707068c94af74c3a5de8153fa2c2c'},
     {'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch':
      'cd1455495886a7d6b2d30d48736eb0103fded21e2e36de6baac719b9c52a1c92'},
     {'PyTorch-2.1.2_fix-test_memory_profiler.patch':
      '30b0c9355636c0ab3dedae02399789053825dc3835b4d7dac6e696767772b1ce'},
+    {'PyTorch-2.1.2_fix-test_parallelize_api.patch':
+     'f8387a1693af344099c806981ca38df1306d7f4847d7d44713306338384b1cfd'},
     {'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch':
      'a0ef99192ee2ad1509c78a8377023d5be2b5fddb16f84063b7c9a0b53d979090'},
     {'PyTorch-2.1.2_fix-vsx-vector-abs.patch': 'd67d32407faed7dc1dbab4bba0e2f7de36c3db04560ced35c94caf8d84ade886'},
     {'PyTorch-2.1.2_fix-vsx-vector-div.patch': '11f497a6892eb49b249a15320e4218e0d7ac8ae4ce67de39e4a018a064ca1acc'},
+    {'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch':
+     '90bd001e034095329277d70c6facc4026b4ce6d7f8b8d6aa81c0176eeb462eb1'},
+    {'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch':
+     '07a5e4233d02fb6348872838f4d69573c777899c6f0ea4e39ae23c08660d41e5'},
+    {'PyTorch-2.1.2_relax-cuda-tolerances.patch': '554ad09787f61080fafdb84216e711e32327aa357e2a9c40bb428eb6503dee6e'},
+    {'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch':
+     'e6a1efe3d127fcbf4723476a7a1c01cfcf2ccb16d1fb250f478192623e8b6a15'},
     {'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch':
      '7ace835af60c58d9e0754a34c19d4b9a0c3a531f19e5d0eba8e2e49206eaa7eb'},
-    {'PyTorch-2.1.2_skip-memory-leak-test.patch': '8d9841208e8a00a498295018aead380c360cf56e500ef23ca740adb5b36de142'},
+    {'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch':
+     '6cf711bf26518550903b09ed4431de9319791e79d61aab065785d6608fd5cc88'},
+    {'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch':
+     '943ee92f5fd518f608a59e43fe426b9bb45d7e7ad0ba04639e516db2d61fa57d'},
+    {'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch':
+     '7f5befddcb006b6ab5377de6ee3c29df375c5f8ef5e42b998d35113585b983f3'},
     {'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch':
      'fb96eefabf394617bbb3fbd3a7a7c1aa5991b3836edc2e5d2a30e708bfe49ba1'},
 ]
@@ -125,32 +165,35 @@ checksums = [
 osdependencies = [OS_PKG_IBVERBS_DEV]
 
 builddependencies = [
-    ('CMake', '3.27.6'),
-    ('hypothesis', '6.90.0'),
+    ('CMake', '3.24.3'),
+    ('hypothesis', '6.68.2'),
     # For tests
     ('pytest-flakefinder', '1.1.0'),
-    ('pytest-rerunfailures', '14.0'),
+    ('pytest-rerunfailures', '12.0'),
     ('pytest-shard', '0.1.2'),
 ]
 
 dependencies = [
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('cuDNN', '8.8.0.121', '-CUDA-%(cudaver)s', SYSTEM),
+    ('magma', '2.7.1', '-CUDA-%(cudaver)s'),
+    ('NCCL', '2.18.3', '-CUDA-%(cudaver)s'),
     ('Ninja', '1.11.1'),  # Required for JIT compilation of C++ extensions
-    ('Python', '3.11.5'),
-    ('Python-bundle-PyPI', '2023.10'),
-    ('protobuf', '25.3'),
-    ('protobuf-python', '4.25.3'),
-    ('pybind11', '2.11.1'),
-    ('SciPy-bundle', '2023.11'),
-    ('PyYAML', '6.0.1'),
-    ('MPFR', '4.2.1'),
-    ('GMP', '6.3.0'),
+    ('Python', '3.10.8'),
+    ('protobuf', '23.0'),
+    ('protobuf-python', '4.23.0'),
+    ('pybind11', '2.10.3'),
+    ('SciPy-bundle', '2023.02'),
+    ('PyYAML', '6.0'),
+    ('MPFR', '4.2.0'),
+    ('GMP', '6.2.1'),
     ('numactl', '2.0.16'),
-    ('FFmpeg', '6.0'),
-    ('Pillow', '10.2.0'),
-    ('expecttest', '0.2.1'),
-    ('networkx', '3.2.1'),
+    ('FFmpeg', '5.1.2'),
+    ('Pillow', '9.4.0'),
+    ('expecttest', '0.1.3'),
+    ('networkx', '3.0'),
     ('sympy', '1.12'),
-    ('Z3', '4.13.0',),
+    ('Z3', '4.12.2', '-Python-%(pyver)s'),
 ]
 
 use_pip = True
@@ -170,6 +213,16 @@ excluded_tests = {
         # intermittent failures on various systems
         # See https://github.com/easybuilders/easybuild-easyconfigs/issues/17712
         'distributed/rpc/test_tensorpipe_agent',
+        # Broken test, can't ever succeed, see https://github.com/pytorch/pytorch/issues/122184
+        'distributed/tensor/parallel/test_tp_random_state',
+        # failures on OmniPath systems, which don't support some optional InfiniBand features
+        # See https://github.com/pytorch/tensorpipe/issues/413
+        'distributed/pipeline/sync/skip/test_gpipe',
+        'distributed/pipeline/sync/skip/test_leak',
+        'distributed/pipeline/sync/test_bugs',
+        'distributed/pipeline/sync/test_inplace',
+        'distributed/pipeline/sync/test_pipe',
+        'distributed/pipeline/sync/test_transparency',
     ]
 }
 
@@ -177,8 +230,16 @@ runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py --continue-throu
 
 # Especially test_quantization has a few corner cases that are triggered by the random input values,
 # those cannot be easily avoided, see https://github.com/pytorch/pytorch/issues/107030
+# test_nn is also prone to spurious failures: https://github.com/pytorch/pytorch/issues/118294
 # So allow a low number of tests to fail as the tests "usually" succeed
-max_failed_tests = 2
+max_failed_tests = 10
+
+# The readelf sanity check command can be taken out once the TestRPATH test from
+# https://github.com/pytorch/pytorch/pull/122318 is accepted, since it is then checked as part of the PyTorch test suite
+local_libcaffe2 = "$EBROOTPYTORCH/lib/python%%(pyshortver)s/site-packages/torch/lib/libcaffe2_nvrtc.%s" % SHLIB_EXT
+sanity_check_commands = [
+    "readelf -d %s | egrep 'RPATH|RUNPATH' | grep -v stubs" % local_libcaffe2,
+]
 
 tests = ['PyTorch-check-cpp-extension.py']

Diff against PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb

easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb

diff --git a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
index 65dfced170..9cbcda474f 100644
--- a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a-CUDA-12.1.1.eb
+++ b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
@@ -6,7 +6,7 @@ homepage = 'https://pytorch.org/'
 description = """Tensors and Dynamic neural networks in Python with strong GPU acceleration.
 PyTorch is a deep learning framework that puts Python first."""
 
-toolchain = {'name': 'foss', 'version': '2023a'}
+toolchain = {'name': 'foss', 'version': '2022b'}
 
 source_urls = [GITHUB_RELEASE]
 sources = ['%(namelower)s-v%(version)s.tar.gz']
@@ -47,9 +47,12 @@ patches = [
     'PyTorch-2.1.2_add-cuda-skip-markers.patch',
     'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch',
     'PyTorch-2.1.2_fix-device-mesh-check.patch',
+    'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch',
     'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch',
+    'PyTorch-2.1.2_fix-test_cuda-non-x86.patch',
     'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch',
     'PyTorch-2.1.2_fix-test_memory_profiler.patch',
+    'PyTorch-2.1.2_fix-test_parallelize_api.patch',
     'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch',
     'PyTorch-2.1.2_fix-vsx-vector-abs.patch',
     'PyTorch-2.1.2_fix-vsx-vector-div.patch',
@@ -59,8 +62,8 @@ patches = [
     'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch',
     'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch',
     'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch',
-    'PyTorch-2.1.2_skip-memory-leak-test.patch',
     'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch',
+    'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch',
     'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch',
 ]
 checksums = [
@@ -125,12 +128,17 @@ checksums = [
     {'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch':
      'c164357efa4ce88095376e590ba508fc1daa87161e1e59544eda56daac7f2847'},
     {'PyTorch-2.1.2_fix-device-mesh-check.patch': 'c0efc288bf3d9a9a3c8bbd2691348a589a2677ea43880a8c987db91c8de4806b'},
+    {'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch':
+     'f583532c59f35f36998851957d501b3ac8c883884efd61bbaa308db55cb6bdcd'},
     {'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch':
      'f7adafb4e4d3b724b93237a259797b6ed6f535f83be0e34a7b759c71c6a8ddf2'},
+    {'PyTorch-2.1.2_fix-test_cuda-non-x86.patch': '1ed76fcc87e6c50606ac286487292a3d534707068c94af74c3a5de8153fa2c2c'},
     {'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch':
      'cd1455495886a7d6b2d30d48736eb0103fded21e2e36de6baac719b9c52a1c92'},
     {'PyTorch-2.1.2_fix-test_memory_profiler.patch':
      '30b0c9355636c0ab3dedae02399789053825dc3835b4d7dac6e696767772b1ce'},
+    {'PyTorch-2.1.2_fix-test_parallelize_api.patch':
+     'f8387a1693af344099c806981ca38df1306d7f4847d7d44713306338384b1cfd'},
     {'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch':
      'a0ef99192ee2ad1509c78a8377023d5be2b5fddb16f84063b7c9a0b53d979090'},
     {'PyTorch-2.1.2_fix-vsx-vector-abs.patch': 'd67d32407faed7dc1dbab4bba0e2f7de36c3db04560ced35c94caf8d84ade886'},
@@ -146,9 +154,10 @@ checksums = [
      '7ace835af60c58d9e0754a34c19d4b9a0c3a531f19e5d0eba8e2e49206eaa7eb'},
     {'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch':
      '6cf711bf26518550903b09ed4431de9319791e79d61aab065785d6608fd5cc88'},
-    {'PyTorch-2.1.2_skip-memory-leak-test.patch': '8d9841208e8a00a498295018aead380c360cf56e500ef23ca740adb5b36de142'},
     {'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch':
      '943ee92f5fd518f608a59e43fe426b9bb45d7e7ad0ba04639e516db2d61fa57d'},
+    {'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch':
+     '7f5befddcb006b6ab5377de6ee3c29df375c5f8ef5e42b998d35113585b983f3'},
     {'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch':
      'fb96eefabf394617bbb3fbd3a7a7c1aa5991b3836edc2e5d2a30e708bfe49ba1'},
 ]
@@ -156,8 +165,8 @@ checksums = [
 osdependencies = [OS_PKG_IBVERBS_DEV]
 
 builddependencies = [
-    ('CMake', '3.26.3'),
-    ('hypothesis', '6.82.0'),
+    ('CMake', '3.24.3'),
+    ('hypothesis', '6.68.2'),
     # For tests
     ('pytest-flakefinder', '1.1.0'),
     ('pytest-rerunfailures', '12.0'),
@@ -165,27 +174,26 @@ builddependencies = [
 ]
 
 dependencies = [
-    ('CUDA', '12.1.1', '', SYSTEM),
-    ('cuDNN', '8.9.2.26', '-CUDA-%(cudaver)s', SYSTEM),
-    ('magma', '2.7.2', '-CUDA-%(cudaver)s'),
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('cuDNN', '8.8.0.121', '-CUDA-%(cudaver)s', SYSTEM),
+    ('magma', '2.7.1', '-CUDA-%(cudaver)s'),
     ('NCCL', '2.18.3', '-CUDA-%(cudaver)s'),
     ('Ninja', '1.11.1'),  # Required for JIT compilation of C++ extensions
-    ('Python', '3.11.3'),
-    ('Python-bundle-PyPI', '2023.06'),
-    ('protobuf', '24.0'),
-    ('protobuf-python', '4.24.0'),
-    ('pybind11', '2.11.1'),
-    ('SciPy-bundle', '2023.07'),
+    ('Python', '3.10.8'),
+    ('protobuf', '23.0'),
+    ('protobuf-python', '4.23.0'),
+    ('pybind11', '2.10.3'),
+    ('SciPy-bundle', '2023.02'),
     ('PyYAML', '6.0'),
     ('MPFR', '4.2.0'),
     ('GMP', '6.2.1'),
     ('numactl', '2.0.16'),
-    ('FFmpeg', '6.0'),
-    ('Pillow', '10.0.0'),
-    ('expecttest', '0.1.5'),
-    ('networkx', '3.1'),
+    ('FFmpeg', '5.1.2'),
+    ('Pillow', '9.4.0'),
+    ('expecttest', '0.1.3'),
+    ('networkx', '3.0'),
     ('sympy', '1.12'),
-    ('Z3', '4.12.2'),
+    ('Z3', '4.12.2', '-Python-%(pyver)s'),
 ]
 
 use_pip = True
@@ -224,10 +232,10 @@ runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py --continue-throu
 # those cannot be easily avoided, see https://github.com/pytorch/pytorch/issues/107030
 # test_nn is also prone to spurious failures: https://github.com/pytorch/pytorch/issues/118294
 # So allow a low number of tests to fail as the tests "usually" succeed
-max_failed_tests = 2
+max_failed_tests = 10
 
 # The readelf sanity check command can be taken out once the TestRPATH test from
-# https://github.com/pytorch/pytorch/pull/109493 is accepted, since it is then checked as part of the PyTorch test suite
+# https://github.com/pytorch/pytorch/pull/122318 is accepted, since it is then checked as part of the PyTorch test suite
 local_libcaffe2 = "$EBROOTPYTORCH/lib/python%%(pyshortver)s/site-packages/torch/lib/libcaffe2_nvrtc.%s" % SHLIB_EXT
 sanity_check_commands = [
     "readelf -d %s | egrep 'RPATH|RUNPATH' | grep -v stubs" % local_libcaffe2,

Diff against PyTorch-2.1.2-foss-2023a.eb

easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a.eb

diff --git a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a.eb b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
index a79f709480..9cbcda474f 100644
--- a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023a.eb
+++ b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
@@ -1,11 +1,12 @@
 name = 'PyTorch'
 version = '2.1.2'
+versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://pytorch.org/'
 description = """Tensors and Dynamic neural networks in Python with strong GPU acceleration.
 PyTorch is a deep learning framework that puts Python first."""
 
-toolchain = {'name': 'foss', 'version': '2023a'}
+toolchain = {'name': 'foss', 'version': '2022b'}
 
 source_urls = [GITHUB_RELEASE]
 sources = ['%(namelower)s-v%(version)s.tar.gz']
@@ -30,6 +31,7 @@ patches = [
     'PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch',
     'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch',
     'PyTorch-2.1.0_disable-gcc12-warning.patch',
+    'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch',
     'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch',
     'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch',
     'PyTorch-2.1.0_fix-validationError-output-test.patch',
@@ -42,13 +44,26 @@ patches = [
     'PyTorch-2.1.0_skip-test_jvp_linalg_det_singular.patch',
     'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch',
     'PyTorch-2.1.0_skip-test_wrap_bad.patch',
+    'PyTorch-2.1.2_add-cuda-skip-markers.patch',
+    'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch',
+    'PyTorch-2.1.2_fix-device-mesh-check.patch',
+    'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch',
+    'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch',
+    'PyTorch-2.1.2_fix-test_cuda-non-x86.patch',
     'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch',
     'PyTorch-2.1.2_fix-test_memory_profiler.patch',
+    'PyTorch-2.1.2_fix-test_parallelize_api.patch',
     'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch',
     'PyTorch-2.1.2_fix-vsx-vector-abs.patch',
     'PyTorch-2.1.2_fix-vsx-vector-div.patch',
+    'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch',
+    'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch',
+    'PyTorch-2.1.2_relax-cuda-tolerances.patch',
+    'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch',
     'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch',
-    'PyTorch-2.1.2_skip-memory-leak-test.patch',
+    'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch',
+    'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch',
+    'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch',
     'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch',
 ]
 checksums = [
@@ -85,6 +100,8 @@ checksums = [
     {'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch':
      '166c134573a95230e39b9ea09ece3ad8072f39d370c9a88fb2a1e24f6aaac2b5'},
     {'PyTorch-2.1.0_disable-gcc12-warning.patch': 'c858b8db0010f41005dc06f9a50768d0d3dc2d2d499ccbdd5faf8a518869a421'},
+    {'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch':
+     'd895018ebdfd46e65d9f7645444a3b4c5bbfe3d533a08db559a04be34e01e478'},
     {'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch':
      'b15b1291a3c37bf6a4982cfbb3483f693acb46a67bc0912b383fd98baf540ccf'},
     {'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch':
@@ -107,17 +124,40 @@ checksums = [
     {'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch':
      '5dcc79883b6e3ec0a281a8e110db5e0a5880de843bb05653589891f16473ead5'},
     {'PyTorch-2.1.0_skip-test_wrap_bad.patch': 'b8583125ee94e553b6f77c4ab4bfa812b89416175dc7e9b7390919f3b485cb63'},
+    {'PyTorch-2.1.2_add-cuda-skip-markers.patch': 'd007d6d0cdb533e7d01f503e9055218760123a67c1841c57585385144be18c9a'},
+    {'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch':
+     'c164357efa4ce88095376e590ba508fc1daa87161e1e59544eda56daac7f2847'},
+    {'PyTorch-2.1.2_fix-device-mesh-check.patch': 'c0efc288bf3d9a9a3c8bbd2691348a589a2677ea43880a8c987db91c8de4806b'},
+    {'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch':
+     'f583532c59f35f36998851957d501b3ac8c883884efd61bbaa308db55cb6bdcd'},
+    {'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch':
+     'f7adafb4e4d3b724b93237a259797b6ed6f535f83be0e34a7b759c71c6a8ddf2'},
+    {'PyTorch-2.1.2_fix-test_cuda-non-x86.patch': '1ed76fcc87e6c50606ac286487292a3d534707068c94af74c3a5de8153fa2c2c'},
     {'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch':
      'cd1455495886a7d6b2d30d48736eb0103fded21e2e36de6baac719b9c52a1c92'},
     {'PyTorch-2.1.2_fix-test_memory_profiler.patch':
      '30b0c9355636c0ab3dedae02399789053825dc3835b4d7dac6e696767772b1ce'},
+    {'PyTorch-2.1.2_fix-test_parallelize_api.patch':
+     'f8387a1693af344099c806981ca38df1306d7f4847d7d44713306338384b1cfd'},
     {'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch':
      'a0ef99192ee2ad1509c78a8377023d5be2b5fddb16f84063b7c9a0b53d979090'},
     {'PyTorch-2.1.2_fix-vsx-vector-abs.patch': 'd67d32407faed7dc1dbab4bba0e2f7de36c3db04560ced35c94caf8d84ade886'},
     {'PyTorch-2.1.2_fix-vsx-vector-div.patch': '11f497a6892eb49b249a15320e4218e0d7ac8ae4ce67de39e4a018a064ca1acc'},
+    {'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch':
+     '90bd001e034095329277d70c6facc4026b4ce6d7f8b8d6aa81c0176eeb462eb1'},
+    {'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch':
+     '07a5e4233d02fb6348872838f4d69573c777899c6f0ea4e39ae23c08660d41e5'},
+    {'PyTorch-2.1.2_relax-cuda-tolerances.patch': '554ad09787f61080fafdb84216e711e32327aa357e2a9c40bb428eb6503dee6e'},
+    {'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch':
+     'e6a1efe3d127fcbf4723476a7a1c01cfcf2ccb16d1fb250f478192623e8b6a15'},
     {'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch':
      '7ace835af60c58d9e0754a34c19d4b9a0c3a531f19e5d0eba8e2e49206eaa7eb'},
-    {'PyTorch-2.1.2_skip-memory-leak-test.patch': '8d9841208e8a00a498295018aead380c360cf56e500ef23ca740adb5b36de142'},
+    {'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch':
+     '6cf711bf26518550903b09ed4431de9319791e79d61aab065785d6608fd5cc88'},
+    {'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch':
+     '943ee92f5fd518f608a59e43fe426b9bb45d7e7ad0ba04639e516db2d61fa57d'},
+    {'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch':
+     '7f5befddcb006b6ab5377de6ee3c29df375c5f8ef5e42b998d35113585b983f3'},
     {'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch':
      'fb96eefabf394617bbb3fbd3a7a7c1aa5991b3836edc2e5d2a30e708bfe49ba1'},
 ]
@@ -125,8 +165,8 @@ checksums = [
 osdependencies = [OS_PKG_IBVERBS_DEV]
 
 builddependencies = [
-    ('CMake', '3.26.3'),
-    ('hypothesis', '6.82.0'),
+    ('CMake', '3.24.3'),
+    ('hypothesis', '6.68.2'),
     # For tests
     ('pytest-flakefinder', '1.1.0'),
     ('pytest-rerunfailures', '12.0'),
@@ -134,26 +174,30 @@ builddependencies = [
 ]
 
 dependencies = [
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('cuDNN', '8.8.0.121', '-CUDA-%(cudaver)s', SYSTEM),
+    ('magma', '2.7.1', '-CUDA-%(cudaver)s'),
+    ('NCCL', '2.18.3', '-CUDA-%(cudaver)s'),
     ('Ninja', '1.11.1'),  # Required for JIT compilation of C++ extensions
-    ('Python', '3.11.3'),
-    ('Python-bundle-PyPI', '2023.06'),
-    ('protobuf', '24.0'),
-    ('protobuf-python', '4.24.0'),
-    ('pybind11', '2.11.1'),
-    ('SciPy-bundle', '2023.07'),
+    ('Python', '3.10.8'),
+    ('protobuf', '23.0'),
+    ('protobuf-python', '4.23.0'),
+    ('pybind11', '2.10.3'),
+    ('SciPy-bundle', '2023.02'),
     ('PyYAML', '6.0'),
     ('MPFR', '4.2.0'),
     ('GMP', '6.2.1'),
     ('numactl', '2.0.16'),
-    ('FFmpeg', '6.0'),
-    ('Pillow', '10.0.0'),
-    ('expecttest', '0.1.5'),
-    ('networkx', '3.1'),
+    ('FFmpeg', '5.1.2'),
+    ('Pillow', '9.4.0'),
+    ('expecttest', '0.1.3'),
+    ('networkx', '3.0'),
     ('sympy', '1.12'),
-    ('Z3', '4.12.2',),
+    ('Z3', '4.12.2', '-Python-%(pyver)s'),
 ]
 
 use_pip = True
+buildcmd = '%(python)s setup.py build'  # Run the (long) build in the build step
 
 excluded_tests = {
     '': [
@@ -169,6 +213,16 @@ excluded_tests = {
         # intermittent failures on various systems
         # See https://github.com/easybuilders/easybuild-easyconfigs/issues/17712
         'distributed/rpc/test_tensorpipe_agent',
+        # Broken test, can't ever succeed, see https://github.com/pytorch/pytorch/issues/122184
+        'distributed/tensor/parallel/test_tp_random_state',
+        # failures on OmniPath systems, which don't support some optional InfiniBand features
+        # See https://github.com/pytorch/tensorpipe/issues/413
+        'distributed/pipeline/sync/skip/test_gpipe',
+        'distributed/pipeline/sync/skip/test_leak',
+        'distributed/pipeline/sync/test_bugs',
+        'distributed/pipeline/sync/test_inplace',
+        'distributed/pipeline/sync/test_pipe',
+        'distributed/pipeline/sync/test_transparency',
     ]
 }
 
@@ -176,8 +230,16 @@ runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py --continue-throu
 
 # Especially test_quantization has a few corner cases that are triggered by the random input values,
 # those cannot be easily avoided, see https://github.com/pytorch/pytorch/issues/107030
+# test_nn is also prone to spurious failures: https://github.com/pytorch/pytorch/issues/118294
 # So allow a low number of tests to fail as the tests "usually" succeed
-max_failed_tests = 2
+max_failed_tests = 10
+
+# The readelf sanity check command can be taken out once the TestRPATH test from
+# https://github.com/pytorch/pytorch/pull/122318 is accepted, since it is then checked as part of the PyTorch test suite
+local_libcaffe2 = "$EBROOTPYTORCH/lib/python%%(pyshortver)s/site-packages/torch/lib/libcaffe2_nvrtc.%s" % SHLIB_EXT
+sanity_check_commands = [
+    "readelf -d %s | egrep 'RPATH|RUNPATH' | grep -v stubs" % local_libcaffe2,
+]
 
 tests = ['PyTorch-check-cpp-extension.py']

Flamefire · 2024-11-30T13:06:59Z

Test report by @Flamefire
SUCCESS
Build succeeded for 55 out of 55 (2 easyconfigs in total)
ml30 - Linux AlmaLinux 8.7 (Stone Smilodon), POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 530.30.02, Python 3.8.13
See https://gist.github.com/Flamefire/674307e6a21da75203eea9819bec205c for a full test report.

Flamefire · 2024-11-30T16:05:50Z

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8034 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/0015043e032f9631948d9db5be864f2c for a full test report.

Flamefire · 2024-12-02T12:29:03Z

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8003 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/822b64b6fdcc8ee170fc9bbd65460c02 for a full test report.

adding easyconfigs: PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb, NCCL-2.1…

14aaabf

…8.3-GCCcore-12.2.0-CUDA-12.0.0.eb

SebastianAchilles added the update label May 14, 2024

SebastianAchilles added this to the 4.x milestone May 14, 2024

Flamefire mentioned this pull request May 16, 2024

{ai}[foss/2022b] PyTorch v2.1.2 w/ CUDA 12.0.0 #20155

Open

Add patch to skip some subtests of test_dtensor_ops

d31f6c2

Fix test failures on Power9 and systems with 6 GPUs

2c5cadd

Fix checksums

665d565

Flamefire marked this pull request as draft November 28, 2024 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{ai,lib}[GCCcore/12.2.0,foss/2022b] PyTorch v2.1.2, NCCL v2.18.3 w/ CUDA 12.0.0 #20520

{ai,lib}[GCCcore/12.2.0,foss/2022b] PyTorch v2.1.2, NCCL v2.18.3 w/ CUDA 12.0.0 #20520

Flamefire commented May 13, 2024 •

edited

Loading

SebastianAchilles commented May 15, 2024

SebastianAchilles commented May 15, 2024

SebastianAchilles commented May 16, 2024

Flamefire commented May 16, 2024

Flamefire commented Nov 22, 2024

github-actions bot commented Nov 22, 2024 •

edited

Loading

Flamefire commented Nov 30, 2024

Flamefire commented Nov 30, 2024

Flamefire commented Dec 2, 2024

{ai,lib}[GCCcore/12.2.0,foss/2022b] PyTorch v2.1.2, NCCL v2.18.3 w/ CUDA 12.0.0 #20520

Are you sure you want to change the base?

{ai,lib}[GCCcore/12.2.0,foss/2022b] PyTorch v2.1.2, NCCL v2.18.3 w/ CUDA 12.0.0 #20520

Conversation

Flamefire commented May 13, 2024 • edited Loading

SebastianAchilles commented May 15, 2024

SebastianAchilles commented May 15, 2024

SebastianAchilles commented May 16, 2024

Flamefire commented May 16, 2024

Flamefire commented Nov 22, 2024

github-actions bot commented Nov 22, 2024 • edited Loading

Updated software NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb

Updated software PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb

Flamefire commented Nov 30, 2024

Flamefire commented Nov 30, 2024

Flamefire commented Dec 2, 2024

Flamefire commented May 13, 2024 •

edited

Loading

github-actions bot commented Nov 22, 2024 •

edited

Loading

Updated software `NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb`

Updated software `PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb`