Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

Draft
wants to merge 56 commits into
base: master
Choose a base branch
from

Conversation

valassi
Copy link
Member

@valassi valassi commented Aug 27, 2024

WIP on removing template/inline from helas (related to splitting kernels)

…FVs and for compiling them as separate object files (related to splitting kernels)
@valassi valassi self-assigned this Aug 27, 2024
@valassi valassi marked this pull request as draft August 27, 2024 15:37
valassi added 20 commits August 28, 2024 10:37
…the P subdirectory (depends on npar) - build succeeds for cpp, link fails for cuda

ccache /usr/local/cuda-12.0/bin/nvcc  -I. -I../../src  -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX  -std=c++17  -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -Xcompiler -fPIC -c -x cu CPPProcess.cc -o CPPProcess_cuda.o
ptxas fatal   : Unresolved extern function '_ZN9mg5amcGpu14helas_VVV1P0_1EPKdS1_S1_dddPd'
…cuda tests succeed

The build issues some warnings however
nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o'
nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o'
nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o'
nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o'
…ption HELINL=L and '#ifdef MGONGPU_LINKER_HELAMPS'
…c++, a factor 3 slower for cuda...

./tput/teeThroughputX.sh -ggtt -makej -makeclean -inlLonly

diff -u --color tput/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt tput/logs_ggtt_mad/log_ggtt_mad_d_inlL_hrd0.txt

-Process                     = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0]
 Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 4.589473e+07                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 1.164485e+08                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 1.280951e+08                 )  sec^-1
-MeanMatrixElemValue         = ( 2.086689e+00 +- 3.413217e-03 )  GeV^0
-TOTAL       :     0.528239 sec
-INFO: No Floating Point Exceptions have been reported
-     2,222,057,027      cycles                           #    2.887 GHz
-     3,171,868,018      instructions                     #    1.43  insn per cycle
-       0.826440817 seconds time elapsed
-runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inl0_hrd0/check_cuda.exe -p 2048 256 1
-==PROF== Profiling "sigmaKin": launch__registers_per_thread 214
+EvtsPerSec[Rmb+ME]     (23) = ( 2.667135e+07                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 4.116115e+07                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 4.251573e+07                 )  sec^-1
+MeanMatrixElemValue         = ( 2.086689e+00 +- 3.413217e-03 )  GeV^0
+TOTAL       :     0.550450 sec
+INFO: No Floating Point Exceptions have been reported
+     2,272,219,097      cycles                           #    2.889 GHz
+     3,361,475,195      instructions                     #    1.48  insn per cycle
+       0.842685843 seconds time elapsed
+runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inlL_hrd0/check_cuda.exe -p 2048 256 1
+==PROF== Profiling "sigmaKin": launch__registers_per_thread 190
 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
…P* (the source is the same but it must be compiled in each P* separately)
@valassi
Copy link
Member Author

valassi commented Aug 28, 2024

The functionality is in principle completed including the backport to CODEGEN. I will run some functionality and performance tests.

git add *.mad/*/HelAmps.cc *.mad/*/*/HelAmps.cc *.sa/*/HelAmps.cc *.sa/*/*/HelAmps.cc
…ild failed?

./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlL

ccache /usr/local/cuda-12.0/bin/nvcc  -I. -I../../src  -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX  -std=c++17  -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_INLINE_HELAMPS -Xcompiler -fPIC -c -x cu CPPProcess.cc -o build.cuda_d_inl1_hrd0/CPPProcess_cuda.o
nvcc error   : 'ptxas' died due to signal 9 (Kill signal)
make[2]: *** [cudacpp.mk:754: build.cuda_d_inl1_hrd0/CPPProcess_cuda.o] Error 9
make[2]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg'
make[1]: *** [makefile:142: build.cuda_d_inl1_hrd0/.cudacpplibs] Error 2
make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg'
make: *** [makefile:282: bldcuda] Error 2
make: *** Waiting for unfinished jobs....
… build time is from cache

./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…mode (use that from the previous run, not from cache)

./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…factor x2 faster (c++? cuda?), runtime is 5-10% slower in C++, but 5-10% faster in cuda!?

./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlLonly

diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt  tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
...
 On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
 =========================================================================
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 4.338149e+02                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 4.338604e+02                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 4.338867e+02                 )  sec^-1
-MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
-TOTAL       :     2.242693 sec
-INFO: No Floating Point Exceptions have been reported
-     7,348,976,543      cycles                           #    2.902 GHz
-    16,466,315,526      instructions                     #    2.24  insn per cycle
-       2.591057214 seconds time elapsed
-runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1
+EvtsPerSec[Rmb+ME]     (23) = ( 4.063038e+02                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 4.063437e+02                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 4.063626e+02                 )  sec^-1
+MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
+TOTAL       :     2.552546 sec
+INFO: No Floating Point Exceptions have been reported
+     7,969,059,552      cycles                           #    2.893 GHz
+    17,401,037,642      instructions                     #    2.18  insn per cycle
+       2.954791685 seconds time elapsed
+runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1
 ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
...
 =========================================================================
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
 Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
-EvtsPerSec[Rmb+ME]     (23) = ( 3.459662e+02                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 3.460086e+02                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 3.460086e+02                 )  sec^-1
+EvtsPerSec[Rmb+ME]     (23) = ( 3.835352e+02                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 3.836003e+02                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 3.836003e+02                 )  sec^-1
 MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
-TOTAL       :     1.528240 sec
+TOTAL       :     1.378567 sec
 INFO: No Floating Point Exceptions have been reported
-     4,140,408,789      cycles                           #    2.703 GHz
-     9,072,597,595      instructions                     #    2.19  insn per cycle
-       1.532357792 seconds time elapsed
-=Symbols in CPPProcess_cpp.o= (~sse4:    0) (avx2:94048) (512y:   91) (512z:    0)
+     3,738,350,469      cycles                           #    2.705 GHz
+     8,514,195,736      instructions                     #    2.28  insn per cycle
+       1.382567882 seconds time elapsed
+=Symbols in CPPProcess_cpp.o= (~sse4:    0) (avx2:80619) (512y:   89) (512z:    0)
 -------------------------------------------------------------------------
…itscrd90 - all ok

STARTED  AT Thu Aug 29 09:00:35 PM CEST 2024
./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Thu Aug 29 11:03:48 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Thu Aug 29 11:24:34 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean
ENDED(3) AT Thu Aug 29 11:33:08 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst
ENDED(4) AT Thu Aug 29 11:35:56 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst
ENDED(5) AT Thu Aug 29 11:38:41 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common
ENDED(6) AT Thu Aug 29 11:41:32 PM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(7) AT Fri Aug 30 12:12:36 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -inlLonly -mix -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(8) AT Fri Aug 30 12:48:22 AM CEST 2024 [Status=0]

Note: inlL build times are reduced by a factor 2 to 3 in inlL with respect to inl0 in the complex processes like ggttggg
----------------
tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt
Preliminary build completed in 0d 00h 07m 12s
tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
Preliminary build completed in 0d 00h 14m 20s
----------------
tput/logs_ggttggg_mad/log_ggttggg_mad_f_inlL_hrd0.txt
Preliminary build completed in 0d 00h 05m 39s
tput/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
Preliminary build completed in 0d 00h 13m 34s
----------------
tput/logs_ggttggg_mad/log_ggttggg_mad_m_inlL_hrd0.txt
Preliminary build completed in 0d 00h 05m 55s
tput/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
Preliminary build completed in 0d 00h 14m 56s
----------------

Note also: there is a runtime performance slowdown of around 10% in both cuda and c++.
(I had previously observed that cuda seems faster, but this was with a small grid! Using a large grid, cuda is also slower)

diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt  tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
 ------------------------------------------------
-Preliminary build completed in 0d 00h 07m 12s
+Preliminary build completed in 0d 00h 14m 20s
 ------------------------------------------------

(CUDA small grid, HELINL=L is 10% faster)
 On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
 =========================================================================
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 4.337724e+02                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 4.338199e+02                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 4.338376e+02                 )  sec^-1
-MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
-TOTAL       :     2.243520 sec
-INFO: No Floating Point Exceptions have been reported
-     7,333,011,251      cycles                           #    2.895 GHz
-    16,571,702,127      instructions                     #    2.26  insn per cycle
-       2.591709636 seconds time elapsed
-runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1
+EvtsPerSec[Rmb+ME]     (23) = ( 4.074025e+02                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 4.074408e+02                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 4.074613e+02                 )  sec^-1
+MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
+TOTAL       :     2.427313 sec
+INFO: No Floating Point Exceptions have been reported
+     8,007,770,360      cycles                           #    2.905 GHz
+    17,844,373,075      instructions                     #    2.23  insn per cycle
+       2.813382822 seconds time elapsed
+runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1
 ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%

(CUDA large grid, HELINL=L is 10% slower)
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 64 256 1 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 64 256 1 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 8.489870e+03                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 8.491766e+03                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 8.491994e+03                 )  sec^-1
+EvtsPerSec[Rmb+ME]     (23) = ( 9.214624e+03                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 9.216736e+03                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 9.217011e+03                 )  sec^-1
 MeanMatrixElemValue         = ( 1.856249e-04 +- 8.329951e-05 )  GeV^-6
-TOTAL       :     4.301800 sec
+TOTAL       :     4.008082 sec
 INFO: No Floating Point Exceptions have been reported
-    13,363,583,535      cycles                           #    2.902 GHz
-    29,144,223,391      instructions                     #    2.18  insn per cycle
-       4.658949907 seconds time elapsed
+    12,658,170,825      cycles                           #    2.916 GHz
+    27,773,386,314      instructions                     #    2.19  insn per cycle
+       4.398692801 seconds time elapsed

(C++, HELINL=L is 10% slower)
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
 Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
-EvtsPerSec[Rmb+ME]     (23) = ( 3.478898e+02                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 3.479341e+02                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 3.479341e+02                 )  sec^-1
+EvtsPerSec[Rmb+ME]     (23) = ( 3.848619e+02                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 3.849166e+02                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 3.849166e+02                 )  sec^-1
 MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
-TOTAL       :     1.518979 sec
+TOTAL       :     1.373871 sec
 INFO: No Floating Point Exceptions have been reported
-     4,109,801,969      cycles                           #    2.699 GHz
-     9,072,472,376      instructions                     #    2.21  insn per cycle
-       1.523113813 seconds time elapsed
-=Symbols in CPPProcess_cpp.o= (~sse4:    0) (avx2:94048) (512y:   91) (512z:    0)
+     3,731,717,521      cycles                           #    2.710 GHz
+     8,514,052,827      instructions                     #    2.28  insn per cycle
+       1.377919646 seconds time elapsed
+=Symbols in CPPProcess_cpp.o= (~sse4:    0) (avx2:80619) (512y:   89) (512z:    0)
…n heft madgraph5#833)

STARTED  AT Fri Aug 30 12:48:22 AM CEST 2024
(SM tests)
ENDED(1) AT Fri Aug 30 05:04:05 AM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Fri Aug 30 05:14:35 AM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
./tmad/teeMadX.sh -ggttggg +10x -makeclean -inlLonly
STARTED AT Fri Aug 30 08:08:13 AM CEST 2024
ENDED   AT Fri Aug 30 09:40:38 AM CEST 2024

Note: both CUDA and C++ are 5-15% slower in HELINL=L than in HELINL=0
For CUDA this can be seen both in the madevent test and in the check.exe test

diff -u --color tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt

(C++ madevent test, 15% slower)
-Executing ' ./build.512y_d_inlL_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
+Executing ' ./build.512y_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
  [OPENMPTH] omp_get_max_threads/nproc = 1/4
  [NGOODHEL] ngoodhel/ncomb = 128/128
  [XSECTION] VECSIZE_USED = 8192
@@ -401,10 +401,10 @@
  [XSECTION] ChannelId = 1
  [XSECTION] Cross section = 2.332e-07 [2.3322993086656014E-007] fbridge_mode=1
  [UNWEIGHT] Wrote 303 events (found 1531 events)
- [COUNTERS] PROGRAM TOTAL          :  325.4847s
- [COUNTERS] Fortran Overhead ( 0 ) :    4.5005s
- [COUNTERS] CudaCpp MEs      ( 2 ) :  320.9382s for    90112 events => throughput is 2.81E+02 events/s
- [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0460s
+ [COUNTERS] PROGRAM TOTAL          :  286.1989s
+ [COUNTERS] Fortran Overhead ( 0 ) :    4.4892s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :  281.6678s for    90112 events => throughput is 3.20E+02 events/s
+ [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0420s

(CUDA madevent test, 10% slower)
-Executing ' ./build.cuda_d_inlL_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
+Executing ' ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
  [OPENMPTH] omp_get_max_threads/nproc = 1/4
  [NGOODHEL] ngoodhel/ncomb = 128/128
  [XSECTION] VECSIZE_USED = 8192
@@ -557,10 +557,10 @@
  [XSECTION] ChannelId = 1
  [XSECTION] Cross section = 2.332e-07 [2.3322993086656006E-007] fbridge_mode=1
  [UNWEIGHT] Wrote 303 events (found 1531 events)
- [COUNTERS] PROGRAM TOTAL          :   19.6828s
- [COUNTERS] Fortran Overhead ( 0 ) :    4.9752s
- [COUNTERS] CudaCpp MEs      ( 2 ) :   13.4712s for    90112 events => throughput is 6.69E+03 events/s
- [COUNTERS] CudaCpp HEL      ( 3 ) :    1.2365s
+ [COUNTERS] PROGRAM TOTAL          :   17.9918s
+ [COUNTERS] Fortran Overhead ( 0 ) :    4.9757s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :   11.9277s for    90112 events => throughput is 7.55E+03 events/s
+ [COUNTERS] CudaCpp HEL      ( 3 ) :    1.0883s

(CUDA check test with large grid, 5% slower)
 *** EXECUTE GCHECK(MAX) -p 512 32 1 ***
-Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
-EvtsPerSec[MECalcOnly] (3a) = ( 9.102842e+03                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 9.584992e+03                 )  sec^-1
@valassi valassi changed the title WIP on removing template/inline from helas (WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA) Aug 30, 2024
@valassi
Copy link
Member Author

valassi commented Sep 2, 2024

I add here now some comments that I had started last week. I have renamed this and put this in WIP. Many features are complete but I am passing to other things and I just want to document this so far before I move elsewhere.

(1) Description so far

Below is an update and a description before I move back to other things.

I added a new HELINL=L mode. This complements the default HELINL=0 mode and the experimental HELINL=1 mode.

HELINL=0 (default) aka "templates with moderate inlining".
This has templated helas functions FFV. The templates are in the memory access classes, i.e. essentially the template specialization depends on the AOSOA format used for momenta, wavefunctions and couplings. The sigmakin and calculate_wavefunction functions in CPPProcess.cc use these templated FFV functions, which are then implemented (and possibly inlined). The build times can be long, because the same templates are reevaluated all over the place, but the runtime speed is good.

HELINL=1 aka "templates with aggressive inlining".
This is the mode that I had introduced to mimic -flto i.e. link time optimizations. The FFV functions (and others) are inlined with always_inline. This significantly increases the build times because in practice it does the equiavelent of link time optimizations (while compiling CPPProcess.o). The runtime speed can get a signifcant boost for simple processes, where data access is important, but the speedups tend to decrease for complex processes, where arithmetic operations dominate. In a realistic madevent environment, this is probably not interesting: for simple processes, it can be ineresting, but the ME calculation is outnumbered by non-ME fortran parts and so it is not interesting to have faster MEs; in complex processes, the build times become just too large.

HELINL=L aka "linked objects".
This is the new mode I introduced here. The FFV functions are pre-compiled for the appropriate templates into .o object files. A technical detail: the HelAmps.cc file is common in Subprocess, but it must be compiled in each P* subdirectory, because the memory access classes may be different: for instance, a subprocess with 3 final state particles and one with 4 particles have different AOSOA, hence different memory access classes. My tests so far show that the build times can decrease/improve by a factor two, while the runtime can increase/degrade by around 10% for complex processes. (More detailed studies should show if it is the cuda or c++ build times that improve, or both). This is work that goes somewhat in the direction of splitting kernels and that I imagined in that context, but it is not exactly the same. It may become interesting for users especially for complex processes, and especially as long as the non-ME part is still important (eg DY+3j where cuda ME becomes 25% and sampling non-ME is over 50%, there having a ME that is 10% slower is acceptable).

(2) To do (non exhaustive list)

This is a non exhaustive list of things pending (unfortunately I was interrupted last week while writing this so I may be forgetting things)

  • move the ixxx templated functions to linked mode too
  • perform a more systematic study of build times, BACKEND one by one (now I only now the 'bldall' speedups)
  • [edited] in particular, measure the build times of HelAmps.o and CppProcess.o separately
  • test this mode on HIP (what is the rdc equivalent?)
  • (consider some mixed HELINL Mode where for instance C++ uses the standard mode 0, but cuda uses mode L? not sure)
  • and then the whole splitting kernel ideas, separate color from Feynman, separate FFVs individually etc

…er merging

git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
git checkout upstream/master tput/logs_* tmad/logs_*
Fix conflicts (essentially, add -inlL and -inlLonly options to upstream/master scripts):
- epochX/cudacpp/tmad/madX.sh
- epochX/cudacpp/tmad/teeMadX.sh
- epochX/cudacpp/tput/allTees.sh
- epochX/cudacpp/tput/teeThroughputX.sh
- epochX/cudacpp/tput/throughputX.sh
@valassi
Copy link
Member Author

valassi commented Sep 20, 2024

I updated this with the latest master as I am doing on all PRs

  • test this mode on HIP (what is the rdc equivalent?

I had some LUMI shell running and I tried this (after also merging in #1007 with various AMD things)

There is a -fgpu-rdc which succeeds compilation but the issues come at link time.

Note that #802 is actually a 'shared object initialization failed' error

So the status is

  • HELINL=L works ok for C++ and (with rdc) for CUDA
  • HELINL=L does not work for HIP yet

…=L) to cuda only as it does not apply to hip

The hip compilation of CPPProcess.cc now fails as
ccache /opt/rocm-6.0.3/bin/hipcc  -I. -I../../src   -O2 --offload-arch=gfx90a -target x86_64-linux-gnu -DHIP_PLATFORM=amd -DHIP_FAST_MATH -I/opt/rocm-6.0.3/include/ -std=c++17 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS  -fPIC -c -x hip CPPProcess.cc -o CPPProcess_hip.o
lld: error: undefined hidden symbol: mg5amcGpu::linker_CD_FFV1_0(double const*, double const*, double const*, double const*, double, double*)
…ompilation on hip for HELINL=L

The hip link of check_hip.exe now fails with
ccache /opt/rocm-6.0.3/bin/hipcc -o check_hip.exe ./check_sa_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath='$ORIGIN/../../lib'  -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o ./CurandRandomNumberKernel_hip.o ./HiprandRandomNumberKernel_hip.o  -L/opt/rocm-6.0.3/lib/ -lhiprand
ld.lld: error: undefined reference due to --no-allow-shlib-undefined: __hip_fatbin
…k_hip.exe link on hip for HELINL=L, the build succeeds but at runtime it fails

The execution fails with
./check_hip.exe -p 1 8 1
ERROR! assertGpu: 'shared object initialization failed' (303) in CPPProcess.cc:558

In addition, the hip link of fcheck_hip.exe fails with
ftn --cray-bypass-pkgconfig -craype-verbose -ffixed-line-length-132 -o fcheck_hip.exe ./fcheck_sa_fortran.o ./fsampler_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath='$ORIGIN/../../lib'  -lgfortran -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o -lstdc++ -L/opt/rocm-6.0.3/lib -lamdhip64
gfortran-13 -march=znver3 -D__CRAY_X86_TRENTO -D__CRAY_AMD_GFX90A -D__CRAYXT_COMPUTE_LINUX_TARGET -D__TARGET_LINUX__ -ffixed-line-length-132 -o fcheck_hip.exe ./fcheck_sa_fortran.o ./fsampler_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath=$ORIGIN/../../lib -lgfortran -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o -lstdc++ -L/opt/rocm-6.0.3/lib -lamdhip64 -Wl,-Bdynamic -Wl,--as-needed,-lgfortran,-lquadmath,--no-as-needed -Wl,--as-needed,-lpthread,--no-as-needed -Wl,--disable-new-dtags
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: ../../lib/libmg5amc_gg_ttx_hip.so: undefined reference to `__hip_fatbin'
…ipcc instead of gfortran to link fcheck_hip.exe: this links but it fails at runtime, will revert

Also add -gggdb for debugging. At runtime this fails with the usual madgraph5#802.
It is now clear that this is in gpuMemcpyToSymbol (line 558)
And the error is precisely 'shared object initialization failed'

./fcheck_hip.exe 1 32 1
...
WARNING! Instantiate device Bridge (nevt=32, gpublocks=1, gputhreads=32, gpublocks*gputhreads=32)
ERROR! assertGpu: 'shared object initialization failed' (303) in CPPProcess.cc:558
fcheck_hip.exe: ./GpuRuntime.h:26: void assertGpu(hipError_t, const char *, int, bool): Assertion `code == gpuSuccess' failed.

Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
0  0x14f947bff2e2 in ???
1  0x14f947bfe475 in ???
2  0x14f945f33dbf in ???
3  0x14f945f33d2b in ???
4  0x14f945f353e4 in ???
5  0x14f945f2bc69 in ???
6  0x14f945f2bcf1 in ???
7  0x14f947bcef96 in _Z9assertGpu10hipError_tPKcib
        at ./GpuRuntime.h:26
8  0x14f947bcef96 in _ZN9mg5amcGpu10CPPProcessC2Ebb
        at /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:558
9  0x14f947bd2cf3 in _ZN9mg5amcGpu6BridgeIdEC2Ejjj
        at ./Bridge.h:268
10  0x14f947bd678e in fbridgecreate_
        at /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/fbridge.cc:54
11  0x2168fd in ???
12  0x216bfe in ???
13  0x14f945f1e24c in ???
14  0x216249 in _start
        at ../sysdeps/x86_64/start.S:120
15  0xffffffffffffffff in ???
Aborted
… hipcc to link fcheck_hip.exe

Revert "[helas] in gg_tt.mad cudacpp.mk, temporarely go back and try to use hipcc instead of gfortran to link fcheck_hip.exe: this links but it fails at runtime, will revert"
This reverts commit 988419b.

NOTE: I tried to use FC=hipcc and this also compiles the fortran ok!
Probably it internally uses flang from llvm madgraph5#804

The problem however is that there is no lowercase 'main' in fcheck_sa_fortran.o, only an uppercase 'MAIN_'.

Summary of the status" HELINL=L "rdc" is not supported on our AMD GPUs for now.
…y and support HELINL=L on AMD GPUs via HIP (still incomplete)
@valassi valassi changed the title (WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA) (WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) Sep 21, 2024
…s from nobm_pp_ttW.mad (git add nobm_pp_ttW.mad)
…er merging

git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
….00.01 fixes) into helas

Fix conflicts: epochX/cudacpp/tput/allTees.sh
@valassi
Copy link
Member Author

valassi commented Oct 5, 2024

Now including upstream/master with v1.00.00 and also the AMD and v1.00.01 patches #1014 and #1012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant