(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

valassi · 2024-08-27T15:37:06Z

WIP on removing template/inline from helas (related to splitting kernels)

…FVs and for compiling them as separate object files (related to splitting kernels)

…d MemoryAccessMomenta.h

…the P subdirectory (depends on npar) - build succeeds for cpp, link fails for cuda ccache /usr/local/cuda-12.0/bin/nvcc -I. -I../../src -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -std=c++17 -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -Xcompiler -fPIC -c -x cu CPPProcess.cc -o CPPProcess_cuda.o ptxas fatal : Unresolved extern function '_ZN9mg5amcGpu14helas_VVV1P0_1EPKdS1_S1_dddPd'

…cuda tests succeed The build issues some warnings however nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o'

…ption HELINL=L and '#ifdef MGONGPU_LINKER_HELAMPS'

…me on each log

…nd -inlLonly options

… to ease code generation

…y in the HELINL=L mode

…c++, a factor 3 slower for cuda... ./tput/teeThroughputX.sh -ggtt -makej -makeclean -inlLonly diff -u --color tput/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt tput/logs_ggtt_mad/log_ggtt_mad_d_inlL_hrd0.txt -Process = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 4.589473e+07 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 1.164485e+08 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 1.280951e+08 ) sec^-1 -MeanMatrixElemValue = ( 2.086689e+00 +- 3.413217e-03 ) GeV^0 -TOTAL : 0.528239 sec -INFO: No Floating Point Exceptions have been reported - 2,222,057,027 cycles # 2.887 GHz - 3,171,868,018 instructions # 1.43 insn per cycle - 0.826440817 seconds time elapsed -runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inl0_hrd0/check_cuda.exe -p 2048 256 1 -==PROF== Profiling "sigmaKin": launch__registers_per_thread 214 +EvtsPerSec[Rmb+ME] (23) = ( 2.667135e+07 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 4.116115e+07 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 4.251573e+07 ) sec^-1 +MeanMatrixElemValue = ( 2.086689e+00 +- 3.413217e-03 ) GeV^0 +TOTAL : 0.550450 sec +INFO: No Floating Point Exceptions have been reported + 2,272,219,097 cycles # 2.889 GHz + 3,361,475,195 instructions # 1.48 insn per cycle + 0.842685843 seconds time elapsed +runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inlL_hrd0/check_cuda.exe -p 2048 256 1 +==PROF== Profiling "sigmaKin": launch__registers_per_thread 190 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%

…lates in HELINL=L mode

…t.mad of HelAmps.h in HELINL=L mode

…t.mad of CPPProcess.cc in HELINL=L mode

…P* (the source is the same but it must be compiled in each P* separately)

… complete its backport

…L=L is complete)

valassi · 2024-08-28T17:57:38Z

The functionality is in principle completed including the backport to CODEGEN. I will run some functionality and performance tests.

…tions

git add *.mad/*/HelAmps.cc *.mad/*/*/HelAmps.cc *.sa/*/HelAmps.cc *.sa/*/*/HelAmps.cc

…ild failed? ./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlL ccache /usr/local/cuda-12.0/bin/nvcc -I. -I../../src -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -std=c++17 -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_INLINE_HELAMPS -Xcompiler -fPIC -c -x cu CPPProcess.cc -o build.cuda_d_inl1_hrd0/CPPProcess_cuda.o nvcc error : 'ptxas' died due to signal 9 (Kill signal) make[2]: *** [cudacpp.mk:754: build.cuda_d_inl1_hrd0/CPPProcess_cuda.o] Error 9 make[2]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg' make[1]: *** [makefile:142: build.cuda_d_inl1_hrd0/.cudacpplibs] Error 2 make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg' make: *** [makefile:282: bldcuda] Error 2 make: *** Waiting for unfinished jobs....

… build time is from cache ./tput/teeThroughputX.sh -ggttggg -makej -makeclean

…mode (use that from the previous run, not from cache) ./tput/teeThroughputX.sh -ggttggg -makej -makeclean

…factor x2 faster (c++? cuda?), runtime is 5-10% slower in C++, but 5-10% faster in cuda!? ./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlLonly diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt ... On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 4.338149e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 4.338604e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 4.338867e+02 ) sec^-1 -MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 2.242693 sec -INFO: No Floating Point Exceptions have been reported - 7,348,976,543 cycles # 2.902 GHz - 16,466,315,526 instructions # 2.24 insn per cycle - 2.591057214 seconds time elapsed -runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1 +EvtsPerSec[Rmb+ME] (23) = ( 4.063038e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 4.063437e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 4.063626e+02 ) sec^-1 +MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 +TOTAL : 2.552546 sec +INFO: No Floating Point Exceptions have been reported + 7,969,059,552 cycles # 2.893 GHz + 17,401,037,642 instructions # 2.18 insn per cycle + 2.954791685 seconds time elapsed +runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1 ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ... ========================================================================= -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] -EvtsPerSec[Rmb+ME] (23) = ( 3.459662e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 3.460086e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 3.460086e+02 ) sec^-1 +EvtsPerSec[Rmb+ME] (23) = ( 3.835352e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 3.836003e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 3.836003e+02 ) sec^-1 MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 1.528240 sec +TOTAL : 1.378567 sec INFO: No Floating Point Exceptions have been reported - 4,140,408,789 cycles # 2.703 GHz - 9,072,597,595 instructions # 2.19 insn per cycle - 1.532357792 seconds time elapsed -=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:94048) (512y: 91) (512z: 0) + 3,738,350,469 cycles # 2.705 GHz + 8,514,195,736 instructions # 2.28 insn per cycle + 1.382567882 seconds time elapsed +=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:80619) (512y: 89) (512z: 0) -------------------------------------------------------------------------

…itscrd90 - all ok STARTED AT Thu Aug 29 09:00:35 PM CEST 2024 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Thu Aug 29 11:03:48 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Thu Aug 29 11:24:34 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Thu Aug 29 11:33:08 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Thu Aug 29 11:35:56 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Thu Aug 29 11:38:41 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common ENDED(6) AT Thu Aug 29 11:41:32 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Fri Aug 30 12:12:36 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -inlLonly -mix -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(8) AT Fri Aug 30 12:48:22 AM CEST 2024 [Status=0] Note: inlL build times are reduced by a factor 2 to 3 in inlL with respect to inl0 in the complex processes like ggttggg ---------------- tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt Preliminary build completed in 0d 00h 07m 12s tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt Preliminary build completed in 0d 00h 14m 20s ---------------- tput/logs_ggttggg_mad/log_ggttggg_mad_f_inlL_hrd0.txt Preliminary build completed in 0d 00h 05m 39s tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt Preliminary build completed in 0d 00h 13m 34s ---------------- tput/logs_ggttggg_mad/log_ggttggg_mad_m_inlL_hrd0.txt Preliminary build completed in 0d 00h 05m 55s tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt Preliminary build completed in 0d 00h 14m 56s ---------------- Note also: there is a runtime performance slowdown of around 10% in both cuda and c++. (I had previously observed that cuda seems faster, but this was with a small grid! Using a large grid, cuda is also slower) diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt ------------------------------------------------ -Preliminary build completed in 0d 00h 07m 12s +Preliminary build completed in 0d 00h 14m 20s ------------------------------------------------ (CUDA small grid, HELINL=L is 10% faster) On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 4.337724e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 4.338199e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 4.338376e+02 ) sec^-1 -MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 2.243520 sec -INFO: No Floating Point Exceptions have been reported - 7,333,011,251 cycles # 2.895 GHz - 16,571,702,127 instructions # 2.26 insn per cycle - 2.591709636 seconds time elapsed -runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1 +EvtsPerSec[Rmb+ME] (23) = ( 4.074025e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 4.074408e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 4.074613e+02 ) sec^-1 +MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 +TOTAL : 2.427313 sec +INFO: No Floating Point Exceptions have been reported + 8,007,770,360 cycles # 2.905 GHz + 17,844,373,075 instructions # 2.23 insn per cycle + 2.813382822 seconds time elapsed +runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1 ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% (CUDA large grid, HELINL=L is 10% slower) -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 64 256 1 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 64 256 1 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 8.489870e+03 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 8.491766e+03 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 8.491994e+03 ) sec^-1 +EvtsPerSec[Rmb+ME] (23) = ( 9.214624e+03 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 9.216736e+03 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 9.217011e+03 ) sec^-1 MeanMatrixElemValue = ( 1.856249e-04 +- 8.329951e-05 ) GeV^-6 -TOTAL : 4.301800 sec +TOTAL : 4.008082 sec INFO: No Floating Point Exceptions have been reported - 13,363,583,535 cycles # 2.902 GHz - 29,144,223,391 instructions # 2.18 insn per cycle - 4.658949907 seconds time elapsed + 12,658,170,825 cycles # 2.916 GHz + 27,773,386,314 instructions # 2.19 insn per cycle + 4.398692801 seconds time elapsed (C++, HELINL=L is 10% slower) -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] -EvtsPerSec[Rmb+ME] (23) = ( 3.478898e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 3.479341e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 3.479341e+02 ) sec^-1 +EvtsPerSec[Rmb+ME] (23) = ( 3.848619e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 3.849166e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 3.849166e+02 ) sec^-1 MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 1.518979 sec +TOTAL : 1.373871 sec INFO: No Floating Point Exceptions have been reported - 4,109,801,969 cycles # 2.699 GHz - 9,072,472,376 instructions # 2.21 insn per cycle - 1.523113813 seconds time elapsed -=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:94048) (512y: 91) (512z: 0) + 3,731,717,521 cycles # 2.710 GHz + 8,514,052,827 instructions # 2.28 insn per cycle + 1.377919646 seconds time elapsed +=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:80619) (512y: 89) (512z: 0)

…n heft madgraph5#833) STARTED AT Fri Aug 30 12:48:22 AM CEST 2024 (SM tests) ENDED(1) AT Fri Aug 30 05:04:05 AM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Fri Aug 30 05:14:35 AM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

./tmad/teeMadX.sh -ggttggg +10x -makeclean -inlLonly STARTED AT Fri Aug 30 08:08:13 AM CEST 2024 ENDED AT Fri Aug 30 09:40:38 AM CEST 2024 Note: both CUDA and C++ are 5-15% slower in HELINL=L than in HELINL=0 For CUDA this can be seen both in the madevent test and in the check.exe test diff -u --color tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt (C++ madevent test, 15% slower) -Executing ' ./build.512y_d_inlL_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' +Executing ' ./build.512y_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' [OPENMPTH] omp_get_max_threads/nproc = 1/4 [NGOODHEL] ngoodhel/ncomb = 128/128 [XSECTION] VECSIZE_USED = 8192 @@ -401,10 +401,10 @@ [XSECTION] ChannelId = 1 [XSECTION] Cross section = 2.332e-07 [2.3322993086656014E-007] fbridge_mode=1 [UNWEIGHT] Wrote 303 events (found 1531 events) - [COUNTERS] PROGRAM TOTAL : 325.4847s - [COUNTERS] Fortran Overhead ( 0 ) : 4.5005s - [COUNTERS] CudaCpp MEs ( 2 ) : 320.9382s for 90112 events => throughput is 2.81E+02 events/s - [COUNTERS] CudaCpp HEL ( 3 ) : 0.0460s + [COUNTERS] PROGRAM TOTAL : 286.1989s + [COUNTERS] Fortran Overhead ( 0 ) : 4.4892s + [COUNTERS] CudaCpp MEs ( 2 ) : 281.6678s for 90112 events => throughput is 3.20E+02 events/s + [COUNTERS] CudaCpp HEL ( 3 ) : 0.0420s (CUDA madevent test, 10% slower) -Executing ' ./build.cuda_d_inlL_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' +Executing ' ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' [OPENMPTH] omp_get_max_threads/nproc = 1/4 [NGOODHEL] ngoodhel/ncomb = 128/128 [XSECTION] VECSIZE_USED = 8192 @@ -557,10 +557,10 @@ [XSECTION] ChannelId = 1 [XSECTION] Cross section = 2.332e-07 [2.3322993086656006E-007] fbridge_mode=1 [UNWEIGHT] Wrote 303 events (found 1531 events) - [COUNTERS] PROGRAM TOTAL : 19.6828s - [COUNTERS] Fortran Overhead ( 0 ) : 4.9752s - [COUNTERS] CudaCpp MEs ( 2 ) : 13.4712s for 90112 events => throughput is 6.69E+03 events/s - [COUNTERS] CudaCpp HEL ( 3 ) : 1.2365s + [COUNTERS] PROGRAM TOTAL : 17.9918s + [COUNTERS] Fortran Overhead ( 0 ) : 4.9757s + [COUNTERS] CudaCpp MEs ( 2 ) : 11.9277s for 90112 events => throughput is 7.55E+03 events/s + [COUNTERS] CudaCpp HEL ( 3 ) : 1.0883s (CUDA check test with large grid, 5% slower) *** EXECUTE GCHECK(MAX) -p 512 32 1 *** -Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK -EvtsPerSec[MECalcOnly] (3a) = ( 9.102842e+03 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 9.584992e+03 ) sec^-1

valassi · 2024-09-02T16:19:36Z

I add here now some comments that I had started last week. I have renamed this and put this in WIP. Many features are complete but I am passing to other things and I just want to document this so far before I move elsewhere.

(1) Description so far

Below is an update and a description before I move back to other things.

I added a new HELINL=L mode. This complements the default HELINL=0 mode and the experimental HELINL=1 mode.

HELINL=0 (default) aka "templates with moderate inlining".
This has templated helas functions FFV. The templates are in the memory access classes, i.e. essentially the template specialization depends on the AOSOA format used for momenta, wavefunctions and couplings. The sigmakin and calculate_wavefunction functions in CPPProcess.cc use these templated FFV functions, which are then implemented (and possibly inlined). The build times can be long, because the same templates are reevaluated all over the place, but the runtime speed is good.

HELINL=1 aka "templates with aggressive inlining".
This is the mode that I had introduced to mimic -flto i.e. link time optimizations. The FFV functions (and others) are inlined with always_inline. This significantly increases the build times because in practice it does the equiavelent of link time optimizations (while compiling CPPProcess.o). The runtime speed can get a signifcant boost for simple processes, where data access is important, but the speedups tend to decrease for complex processes, where arithmetic operations dominate. In a realistic madevent environment, this is probably not interesting: for simple processes, it can be ineresting, but the ME calculation is outnumbered by non-ME fortran parts and so it is not interesting to have faster MEs; in complex processes, the build times become just too large.

HELINL=L aka "linked objects".
This is the new mode I introduced here. The FFV functions are pre-compiled for the appropriate templates into .o object files. A technical detail: the HelAmps.cc file is common in Subprocess, but it must be compiled in each P* subdirectory, because the memory access classes may be different: for instance, a subprocess with 3 final state particles and one with 4 particles have different AOSOA, hence different memory access classes. My tests so far show that the build times can decrease/improve by a factor two, while the runtime can increase/degrade by around 10% for complex processes. (More detailed studies should show if it is the cuda or c++ build times that improve, or both). This is work that goes somewhat in the direction of splitting kernels and that I imagined in that context, but it is not exactly the same. It may become interesting for users especially for complex processes, and especially as long as the non-ME part is still important (eg DY+3j where cuda ME becomes 25% and sampling non-ME is over 50%, there having a ME that is 10% slower is acceptable).

(2) To do (non exhaustive list)

This is a non exhaustive list of things pending (unfortunately I was interrupted last week while writing this so I may be forgetting things)

move the ixxx templated functions to linked mode too
perform a more systematic study of build times, BACKEND one by one (now I only now the 'bldall' speedups)
[edited] in particular, measure the build times of HelAmps.o and CppProcess.o separately
test this mode on HIP (what is the rdc equivalent?)
(consider some mixed HELINL Mode where for instance C++ uses the standard mode 0, but cuda uses mode L? not sure)
and then the whole splitting kernel ideas, separate color from Feynman, separate FFVs individually etc

…er merging git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)

…Source/makefile madgraph5#980) into helas

git checkout upstream/master tput/logs_* tmad/logs_*

Fix conflicts (essentially, add -inlL and -inlLonly options to upstream/master scripts): - epochX/cudacpp/tmad/madX.sh - epochX/cudacpp/tmad/teeMadX.sh - epochX/cudacpp/tput/allTees.sh - epochX/cudacpp/tput/teeThroughputX.sh - epochX/cudacpp/tput/throughputX.sh

valassi · 2024-09-20T15:26:13Z

I updated this with the latest master as I am doing on all PRs

test this mode on HIP (what is the rdc equivalent?

I had some LUMI shell running and I tried this (after also merging in #1007 with various AMD things)

There is a -fgpu-rdc which succeeds compilation but the issues come at link time.

Using gfortran to link (as I do now due to Segfault in fgcheck.exe on LUMI (should we link hip, c++, fortran using hipcc or the fortran compiler?) #802) I am unable to link '__hip_fatbin'
If I go back to hipcc for linking and add -fgpu-rdc --hip-link then it links, but it fails at runtime with Segfault in fgcheck.exe on LUMI (should we link hip, c++, fortran using hipcc or the fortran compiler?) #802
I also tried to compile .f files with hipcc (I guess flang? Port to flang and F90 compliance of madgraph fortran codebase #804) and this succeeds, but then the 'main' cannot be found at link time, only the 'MAIN_'

Note that #802 is actually a 'shared object initialization failed' error

So the status is

HELINL=L works ok for C++ and (with rdc) for CUDA
HELINL=L does not work for HIP yet

… into helas

…=L) to cuda only as it does not apply to hip The hip compilation of CPPProcess.cc now fails as ccache /opt/rocm-6.0.3/bin/hipcc -I. -I../../src -O2 --offload-arch=gfx90a -target x86_64-linux-gnu -DHIP_PLATFORM=amd -DHIP_FAST_MATH -I/opt/rocm-6.0.3/include/ -std=c++17 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c -x hip CPPProcess.cc -o CPPProcess_hip.o lld: error: undefined hidden symbol: mg5amcGpu::linker_CD_FFV1_0(double const*, double const*, double const*, double const*, double, double*)

…ompilation on hip for HELINL=L The hip link of check_hip.exe now fails with ccache /opt/rocm-6.0.3/bin/hipcc -o check_hip.exe ./check_sa_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath='$ORIGIN/../../lib' -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o ./CurandRandomNumberKernel_hip.o ./HiprandRandomNumberKernel_hip.o -L/opt/rocm-6.0.3/lib/ -lhiprand ld.lld: error: undefined reference due to --no-allow-shlib-undefined: __hip_fatbin

…k_hip.exe link on hip for HELINL=L, the build succeeds but at runtime it fails The execution fails with ./check_hip.exe -p 1 8 1 ERROR! assertGpu: 'shared object initialization failed' (303) in CPPProcess.cc:558 In addition, the hip link of fcheck_hip.exe fails with ftn --cray-bypass-pkgconfig -craype-verbose -ffixed-line-length-132 -o fcheck_hip.exe ./fcheck_sa_fortran.o ./fsampler_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath='$ORIGIN/../../lib' -lgfortran -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o -lstdc++ -L/opt/rocm-6.0.3/lib -lamdhip64 gfortran-13 -march=znver3 -D__CRAY_X86_TRENTO -D__CRAY_AMD_GFX90A -D__CRAYXT_COMPUTE_LINUX_TARGET -D__TARGET_LINUX__ -ffixed-line-length-132 -o fcheck_hip.exe ./fcheck_sa_fortran.o ./fsampler_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath=$ORIGIN/../../lib -lgfortran -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o -lstdc++ -L/opt/rocm-6.0.3/lib -lamdhip64 -Wl,-Bdynamic -Wl,--as-needed,-lgfortran,-lquadmath,--no-as-needed -Wl,--as-needed,-lpthread,--no-as-needed -Wl,--disable-new-dtags /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: ../../lib/libmg5amc_gg_ttx_hip.so: undefined reference to `__hip_fatbin'

…ipcc instead of gfortran to link fcheck_hip.exe: this links but it fails at runtime, will revert Also add -gggdb for debugging. At runtime this fails with the usual madgraph5#802. It is now clear that this is in gpuMemcpyToSymbol (line 558) And the error is precisely 'shared object initialization failed' ./fcheck_hip.exe 1 32 1 ... WARNING! Instantiate device Bridge (nevt=32, gpublocks=1, gputhreads=32, gpublocks*gputhreads=32) ERROR! assertGpu: 'shared object initialization failed' (303) in CPPProcess.cc:558 fcheck_hip.exe: ./GpuRuntime.h:26: void assertGpu(hipError_t, const char *, int, bool): Assertion `code == gpuSuccess' failed. Program received signal SIGABRT: Process abort signal. Backtrace for this error: 0 0x14f947bff2e2 in ??? 1 0x14f947bfe475 in ??? 2 0x14f945f33dbf in ??? 3 0x14f945f33d2b in ??? 4 0x14f945f353e4 in ??? 5 0x14f945f2bc69 in ??? 6 0x14f945f2bcf1 in ??? 7 0x14f947bcef96 in _Z9assertGpu10hipError_tPKcib at ./GpuRuntime.h:26 8 0x14f947bcef96 in _ZN9mg5amcGpu10CPPProcessC2Ebb at /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:558 9 0x14f947bd2cf3 in _ZN9mg5amcGpu6BridgeIdEC2Ejjj at ./Bridge.h:268 10 0x14f947bd678e in fbridgecreate_ at /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/fbridge.cc:54 11 0x2168fd in ??? 12 0x216bfe in ??? 13 0x14f945f1e24c in ??? 14 0x216249 in _start at ../sysdeps/x86_64/start.S:120 15 0xffffffffffffffff in ??? Aborted

… hipcc to link fcheck_hip.exe Revert "[helas] in gg_tt.mad cudacpp.mk, temporarely go back and try to use hipcc instead of gfortran to link fcheck_hip.exe: this links but it fails at runtime, will revert" This reverts commit 988419b. NOTE: I tried to use FC=hipcc and this also compiles the fortran ok! Probably it internally uses flang from llvm madgraph5#804 The problem however is that there is no lowercase 'main' in fcheck_sa_fortran.o, only an uppercase 'MAIN_'. Summary of the status" HELINL=L "rdc" is not supported on our AMD GPUs for now.

…y and support HELINL=L on AMD GPUs via HIP (still incomplete)

…s from nobm_pp_ttW.mad (git add nobm_pp_ttW.mad)

…3 on AMD GPUs) into helas

…er merging git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)

….00.01 fixes) into helas Fix conflicts: epochX/cudacpp/tput/allTees.sh

valassi · 2024-10-05T06:25:27Z

Now including upstream/master with v1.00.00 and also the AMD and v1.00.01 patches #1014 and #1012

[helas] in gg_tt.mad, proof of concept for removing template/inline F…

475463b

…FVs and for compiling them as separate object files (related to splitting kernels)

valassi self-assigned this Aug 27, 2024

valassi marked this pull request as draft August 27, 2024 15:37

valassi added 20 commits August 28, 2024 10:37

[helas] in gg_tt.mad and CODEGEN, add comments in MemoryAccessGs.h an…

6b0ba37

…d MemoryAccessMomenta.h

[helas] in gg_tt.mad, avoid link warnings when using RDC

7aef7e2

[helas] in gg_tt.mad, clean up 'linked HelAmps' implementation: add o…

77d157c

…ption HELINL=L and '#ifdef MGONGPU_LINKER_HELAMPS'

[helas] in tput/teeThroughputX.sh, print out the preliminary build ti…

f105b9c

…me on each log

[helas] in tput throughputX.sh and teeThroughputX.sh, add the -inlL a…

5f73fbb

…nd -inlLonly options

[helas] in tput/allTees.sh, add 18 inlL tests

8fe9ba4

[helas] in gg_tt.mad, fix clang formatting

4ee2863

[helas] in gg_tt.mad, fix inlineHel=L printout in check_sa.cc

0b259a8

[helas] in gg_tt.mad CPPProcess.cc and HelAmps_sm.h, move code around…

7fb5a25

… to ease code generation

[helas] in gg_tt.mad cudacpp.mk, build HelAmps.o and use rdc=true onl…

716326c

…y in the HELINL=L mode

[helas] in CODEGEN, complete the backport from gg_tt.mad of file temp…

ee84d7d

…lates in HELINL=L mode

[helas] in CODEGEN model_handling.py, complete the backport from gg_t…

4c4198f

…t.mad of HelAmps.h in HELINL=L mode

[helas] in CODEGEN model_handling.py, complete the backport from gg_t…

ae7d18b

…t.mad of CPPProcess.cc in HELINL=L mode

[helas] in gg_tt.mad, move HelAmps.cc to SubProcesses and link it in …

a58cc9c

…P* (the source is the same but it must be compiled in each P* separately)

[helas] in CODEGEN and gg_tt.mad, fix HelAmps.cc in HELINL=L mode and…

64875e7

… complete its backport

[helas] regenerate gg_tt.mad, check that all is ok (codegen for HELIN…

9f1cfd2

…L=L is complete)

[helas] regenerate all processes with support for HELINL=L

5ca9d2d

valassi added 6 commits August 28, 2024 20:55

[helas] in tmad madX.sh and teeMadX.sh, add -inlonly and -inlLonly op…

f0a5105

…tions

[helas] add HelAmps.cc to all regenerated processes

348ebfd

git add *.mad/*/HelAmps.cc *.mad/*/*/HelAmps.cc *.sa/*/HelAmps.cc *.sa/*/*/HelAmps.cc

[helas] rerun the ggttggg tput test only in inl0 mode - note that the…

de8d452

… build time is from cache ./tput/teeThroughputX.sh -ggttggg -makej -makeclean

[helas] manually fix the build time in the ggttggg tput test in inl0 …

93f351b

…mode (use that from the previous run, not from cache) ./tput/teeThroughputX.sh -ggttggg -makej -makeclean

valassi added 4 commits August 30, 2024 13:20

[helas] in tmad/madX.sh, print the DATE also at the end of the test

7e930eb

valassi changed the title ~~WIP on removing template/inline from helas~~ (WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA) Aug 30, 2024

valassi added 7 commits September 2, 2024 18:21

[helas] move to CODEGEN logs from the latest upstream/master for easi…

fb0d91a

…er merging git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)

Merge remote-tracking branch 'upstream/master' (including new CI and …

062527c

…Source/makefile madgraph5#980) into helas

[helas] regenerate gg_tt.mad, check all is ok

d8bb2ca

[helas] move to upstream/master tput/tmad logs for easier merging

a9a93bb

git checkout upstream/master tput/logs_* tmad/logs_*

[helas] move to upstream/master gg_tt.mad codegen log for easier merging

716ebaf

Merge branch 'amd' (with OPTFLAGS=-O2 to fix madgraph5#806) into helas

fbf892e

valassi mentioned this pull request Sep 20, 2024

Segfault in fgcheck.exe on LUMI (should we link hip, c++, fortran using hipcc or the fortran compiler?) #802

Closed

valassi added 8 commits September 21, 2024 10:33

Merge branch 'amd' (go back to previous upstream/master codegen logs)…

8fce3b1

… into helas

[helas] regenerate all processes after merging master and amd

e6561e9

[helas] backport to CODEGEN the gg_tt.mad changes in cudacpp.mk to tr…

f021a89

…y and support HELINL=L on AMD GPUs via HIP (still incomplete)

valassi force-pushed the helas branch from 6567245 to f021a89 Compare September 21, 2024 08:36

valassi changed the title ~~(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA)~~ (WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) Sep 21, 2024

valassi added 4 commits September 25, 2024 15:01

[helas] regenerate all processes - also add to repo some missing file…

be5a8da

…s from nobm_pp_ttW.mad (git add nobm_pp_ttW.mad)

Merge remote-tracking branch 'upstream/master' (use -O2 instead of -O…

bb93d83

…3 on AMD GPUs) into helas

[helas] move to CODEGEN logs from the latest upstream/master for easi…

c4ec5df

…er merging git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)

Merge remote-tracking branch 'upstream/master' (v1.00.00, plus AMD/v1…

b2186b9

….00.01 fixes) into helas Fix conflicts: epochX/cudacpp/tput/allTees.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

valassi commented Aug 27, 2024

valassi commented Aug 28, 2024

valassi commented Sep 2, 2024 •

edited

Loading

valassi commented Sep 20, 2024

valassi commented Oct 5, 2024

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

Are you sure you want to change the base?

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

Conversation

valassi commented Aug 27, 2024

valassi commented Aug 28, 2024

valassi commented Sep 2, 2024 • edited Loading

valassi commented Sep 20, 2024

valassi commented Oct 5, 2024

valassi commented Sep 2, 2024 •

edited

Loading