-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tmad test crashes in rotxxx (SIGFPE erroneous arithmetic operation) #855
Comments
In #852 (comment) Olivier suggested "you/we should compile with the C equivalent of -fbounds-check which is super usefull to spot segfault who by definition are hardware specific". I had a look but I am not sure there is an equivalent. Instead I have run valgrind, this is interesting. This is a reproducer which mimics the tmad test above, but without using tmad tests
The valgrind output includes things like
Also I have rebuilt with -O3 -g in make_opts:
The crash now prints out where it happens, it is in rotxxx
Note, rotxxx is what I had already foun dalso in susy tests |
As discussed in #826 this is again a weird optimization issue: gdb gives
This was with -O3 -g. If I use lower optimization levels, the issue disappears. As I have done withy many SIGFPEs in cudacpp, I tried adding volatile
Strangely enough. this prevents SIGFPE. But now the code seems stuck in an infinite loop? |
I tried cuda to make it faster. Again something strange, the code crashes without valgrind but does not crash with valgrind... (NB this is WITHOUT volatile)
|
Ok. In the cuda version, adding volatile in the Fortran removes SIGFPE and allows the program to reach the end. So IS THIS A POSSIBLE FIX? With cpp maybe I just needed to wait? Or is this going slower? I will try to rerun more tests and leave them running. (In the meantime I will also try the susy_gg_t1t1 channel which in the past seemed problematic with SIGFPE). |
… test a different iconfig In particular: the following triggers a SIGFPE reported in madgraph5#855 (crash in rotxxx that can be fixed adding volatile?) ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean This also triggers a similar SIGFPE (initially reported in madgraph5#826) ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
…SIGFPE madgraph5#855, and add volatile in aloha_functions.f to try to fix it The SIGFPE crash madgraph5#855 does seem to disappear in ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean However, there is now a DIFFERENT issue, an lhe file mismatch between fortran and cpp (madgraph5#856) This is probably due to the iconfig/channel mapping issue reported by Olivier in madgraph5#852
…ebug SIGFPE madgraph5#855, and add volatile in aloha_functions.f to try to fix it The SIGFPE crash madgraph5#855 does seem to disappear in ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean Then no cross section is printed also for this iconfig (same as madgraph5#826 for iconfig 1), but this is a DIFFERENT issue
…: note that SIGFPE madgraph5#855 is still fixed because volatile has been added
…adgraph5#855 and prepare codegen backport
…dgraph5#855 in rotxxx The issue was observed and fixed in gg_ttgg (iconfig 104) and susy_gg_t1t1 (iconfig 2), the backport as usual is from gg_tt Note that aloha_functions.f is now added to the list of files to include when preparing patch.common ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/dsample.f gg_tt.mad/Source/DHELAS/aloha_functions.f gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/bin/internal/banner.py gg_tt.mad/bin/internal/gen_ximprove.py gg_tt.mad/bin/internal/madevent_interface.py >> CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad
…syggt1t1 to test madgraph5#855 fix while still exposing madgraph5#826 and madgraph5#856
…o fix SIGFPE madgraph5#855 in rotxxx
This is fixed by #857 by adding volatile, as I had done for similar SIGFPE in cudacpp |
…de with no volatile, to rerun tmad and expose SIGFPE madgraph5#855 git checkout upstream/master susy_gg_t1t1.mad gg_ttgg.mad
…se SIGFPE madgraph5#855 - will revert ./tmad/teeMadX.sh -mix -makeclean +10x -ggttgg -susyggt1t1
…h confirmed that SIGFPE madgraph5#855 was present and is now fixed Revert "[tmad] temporarely rerun tmad tests for ggttgg and susyggt1t1 to expose SIGFPE madgraph5#855 - will revert" This reverts commit 4fa1790. Revert "[tmad] in gg_ttgg.mad and susy_gg_t1t1.mad, temporarely go back to code with no volatile, to rerun tmad and expose SIGFPE madgraph5#855" This reverts commit 2f32ffd.
I completed my tests in PR #857 and I confirm that it fixes this issue, closing |
Reopening until PR #857 is merged - or until this is otherwise clarified |
… test a different iconfig In particular: the following triggers a SIGFPE reported in madgraph5#855 (crash in rotxxx that can be fixed adding volatile?) ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean This also triggers a similar SIGFPE (initially reported in madgraph5#826) ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
…g AS-IS Olivier's patches from the latest fix_826 branch for PR madgraph5#850 The gg_ttgg test still crashes (rotxxx madgraph5#855?) ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) *** Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7fce5ec23860 in ??? 1 0x7fce5ec22a05 in ??? 2 0x7fce5e854def in ??? 3 0x44b5ff in ??? 4 0x4087df in ??? 5 0x409848 in ??? 6 0x40bb83 in ??? 7 0x40d1a9 in ??? 8 0x45c804 in ??? 9 0x434269 in ??? 10 0x40371e in ??? 11 0x7fce5e83feaf in ??? 12 0x7fce5e83ff5f in ??? 13 0x403844 in ??? 14 0xffffffffffffffff in ??? ./tmad/madX.sh: line 387: 3913008 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} The susy_gg_t1t1 test also still crashes (see madgraph5#826?), this looks like the same crash as ggttgg above ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) *** Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7f9f03423860 in ??? 1 0x7f9f03422a05 in ??? 2 0x7f9f03054def in ??? 3 0x43809f in ??? 4 0x40581f in ??? 5 0x4067b1 in ??? 6 0x408c71 in ??? 7 0x40a0a9 in ??? 8 0x444fdf in ??? 9 0x42bb38 in ??? 10 0x40371e in ??? 11 0x7f9f0303feaf in ??? 12 0x7f9f0303ff5f in ??? 13 0x403844 in ??? 14 0xffffffffffffffff in ??? ./tmad/madX.sh: line 387: 3907179 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} The gqttq test also still crashes intermittently, i.e. only on the second execution (madgraph5#845?) ./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean ./tmad/teeMadX.sh -gqttq +10x -fltonly Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp' Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7fbafa623860 in ??? 1 0x7fbafa622a05 in ??? 2 0x7fbafa254def in ??? 3 0x7fbafad24034 in ??? 4 0x7fbafa9a1575 in ??? 5 0x7fbafad20c89 in ??? 6 0x7fbafad2abfd in ??? 7 0x7fbafad30491 in ??? 8 0x43008b in ??? 9 0x431c10 in ??? 10 0x432d47 in ??? 11 0x433b1e in ??? 12 0x44a921 in ??? 13 0x42ebbf in ??? 14 0x40371e in ??? 15 0x7fbafa23feaf in ??? 16 0x7fbafa23ff5f in ??? 17 0x403844 in ??? 18 0xffffffffffffffff in ??? ./madX.sh: line 387: 3922797 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp' failed
…nd cudacpp.mk to improve the crash dumps The susyggt1t1 test clearly crashes in rotxxx (madgraph5#855): ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) *** Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7fb7e1223860 in ??? 1 0x7fb7e1222a05 in ??? 2 0x7fb7e0e54def in ??? 3 0x43809f in rotxxx_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/DHELAS/aloha_functions.f:1247 4 0x40581f in gentcms_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1480 5 0x4067b1 in one_tree_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1167 6 0x408c71 in gen_mom_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:68 7 0x40a0a9 in x_to_f_arg_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:60 8 0x444fdf in sample_full_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/dsample.f:172 9 0x42bb38 in driver at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:256 10 0x40371e in main at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:301 ./tmad/madX.sh: line 387: 3928626 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_susyggt1t1_x1_cudacpp > /tmp/avalassi/output_susyggt1t1_x1_cudacpp' failed The ggttgg test also clearly crashes in rotxxx (madgraph5#855): ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean^C *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) *** Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7fb141c23860 in ??? 1 0x7fb141c22a05 in ??? 2 0x7fb141854def in ??? 3 0x44b5ff in rotxxx_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f:1247 4 0x4087df in gentcms_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1480 5 0x409848 in one_tree_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1167 6 0x40bb83 in gen_mom_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:68 7 0x40d1a9 in x_to_f_arg_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:60 8 0x45c804 in sample_full_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/dsample.f:172 9 0x434269 in driver at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:256 10 0x40371e in main at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:301 ./tmad/madX.sh: line 387: 3933302 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttgg_x1_cudacpp > /tmp/avalassi/output_ggttgg_x1_cudacpp' failed The gqttq test instead clearly crashes in sigmaKin (madgraph5#845): ./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean ./tmad/teeMadX.sh -gqttq +10x -fltonly Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7f607ee23860 in ??? 1 0x7f607ee22a05 in ??? 2 0x7f607ea54def in ??? 3 0x7f607f607008 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i._omp_fn.0 at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1190 4 0x7f607f4ab575 in ??? 5 0x7f607f603c89 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1093 6 0x7f607f60dbfd in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115 7 0x7f607f613491 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390 8 0x7f607f613491 in fbridgesequence_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106 9 0x43008b in smatrix1_multi_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618 10 0x431c10 in dsig1_vec_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445 11 0x432d47 in dsigproc_vec_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034 12 0x433b1e in dsig_vec_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327 13 0x44a921 in sample_full_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208 14 0x42ebbf in driver at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:256 15 0x40371e in main at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:301 ./madX.sh: line 387: 3941122 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed
…g AS-IS Olivier's patches from the latest fix_826 branch for PR madgraph5#852 The gg_ttgg test still crashes (rotxxx madgraph5#855?) ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) *** Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7fce5ec23860 in ??? 1 0x7fce5ec22a05 in ??? 2 0x7fce5e854def in ??? 3 0x44b5ff in ??? 4 0x4087df in ??? 5 0x409848 in ??? 6 0x40bb83 in ??? 7 0x40d1a9 in ??? 8 0x45c804 in ??? 9 0x434269 in ??? 10 0x40371e in ??? 11 0x7fce5e83feaf in ??? 12 0x7fce5e83ff5f in ??? 13 0x403844 in ??? 14 0xffffffffffffffff in ??? ./tmad/madX.sh: line 387: 3913008 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} The susy_gg_t1t1 test also still crashes (see madgraph5#826?), this looks like the same crash as ggttgg above ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) *** Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7f9f03423860 in ??? 1 0x7f9f03422a05 in ??? 2 0x7f9f03054def in ??? 3 0x43809f in ??? 4 0x40581f in ??? 5 0x4067b1 in ??? 6 0x408c71 in ??? 7 0x40a0a9 in ??? 8 0x444fdf in ??? 9 0x42bb38 in ??? 10 0x40371e in ??? 11 0x7f9f0303feaf in ??? 12 0x7f9f0303ff5f in ??? 13 0x403844 in ??? 14 0xffffffffffffffff in ??? ./tmad/madX.sh: line 387: 3907179 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} The gqttq test also still crashes intermittently, i.e. only on the second execution (madgraph5#845?) ./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean ./tmad/teeMadX.sh -gqttq +10x -fltonly Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp' Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7fbafa623860 in ??? 1 0x7fbafa622a05 in ??? 2 0x7fbafa254def in ??? 3 0x7fbafad24034 in ??? 4 0x7fbafa9a1575 in ??? 5 0x7fbafad20c89 in ??? 6 0x7fbafad2abfd in ??? 7 0x7fbafad30491 in ??? 8 0x43008b in ??? 9 0x431c10 in ??? 10 0x432d47 in ??? 11 0x433b1e in ??? 12 0x44a921 in ??? 13 0x42ebbf in ??? 14 0x40371e in ??? 15 0x7fbafa23feaf in ??? 16 0x7fbafa23ff5f in ??? 17 0x403844 in ??? 18 0xffffffffffffffff in ??? ./madX.sh: line 387: 3922797 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp' failed
…nd cudacpp.mk to improve the crash dumps The susyggt1t1 test clearly crashes in rotxxx (madgraph5#855): ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) *** Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7fb7e1223860 in ??? 1 0x7fb7e1222a05 in ??? 2 0x7fb7e0e54def in ??? 3 0x43809f in rotxxx_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/DHELAS/aloha_functions.f:1247 4 0x40581f in gentcms_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1480 5 0x4067b1 in one_tree_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1167 6 0x408c71 in gen_mom_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:68 7 0x40a0a9 in x_to_f_arg_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:60 8 0x444fdf in sample_full_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/dsample.f:172 9 0x42bb38 in driver at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:256 10 0x40371e in main at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:301 ./tmad/madX.sh: line 387: 3928626 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_susyggt1t1_x1_cudacpp > /tmp/avalassi/output_susyggt1t1_x1_cudacpp' failed The ggttgg test also clearly crashes in rotxxx (madgraph5#855): ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean^C *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) *** Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7fb141c23860 in ??? 1 0x7fb141c22a05 in ??? 2 0x7fb141854def in ??? 3 0x44b5ff in rotxxx_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f:1247 4 0x4087df in gentcms_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1480 5 0x409848 in one_tree_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1167 6 0x40bb83 in gen_mom_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:68 7 0x40d1a9 in x_to_f_arg_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:60 8 0x45c804 in sample_full_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/dsample.f:172 9 0x434269 in driver at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:256 10 0x40371e in main at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:301 ./tmad/madX.sh: line 387: 3933302 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttgg_x1_cudacpp > /tmp/avalassi/output_ggttgg_x1_cudacpp' failed The gqttq test instead clearly crashes in sigmaKin (madgraph5#845): ./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean ./tmad/teeMadX.sh -gqttq +10x -fltonly Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7f607ee23860 in ??? 1 0x7f607ee22a05 in ??? 2 0x7f607ea54def in ??? 3 0x7f607f607008 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i._omp_fn.0 at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1190 4 0x7f607f4ab575 in ??? 5 0x7f607f603c89 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1093 6 0x7f607f60dbfd in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115 7 0x7f607f613491 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390 8 0x7f607f613491 in fbridgesequence_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106 9 0x43008b in smatrix1_multi_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618 10 0x431c10 in dsig1_vec_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445 11 0x432d47 in dsigproc_vec_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034 12 0x433b1e in dsig_vec_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327 13 0x44a921 in sample_full_ at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208 14 0x42ebbf in driver at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:256 15 0x40371e in main at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:301 ./madX.sh: line 387: 3941122 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed Conclusion: I would not merge 852 as it does not fix issues yet. Instead I would merge 857 to fix the rotxxx crash 855 using volatile, and reassess from there...
…sm to bypass known issues in tmad tests Currently the following 12 (4 processes x 3 fptypes) issues are bypassed - "No cross section in ${proc%.mad} for FPTYPE=d,f,m (madgraph5#826)" for susy_gg_t1t1 - "SIGFPE crash in rotxxx in ${proc%.mad} for FPTYPE=d,f,m (madgraph5#855)" for gq_ttq, pp_tt012j, nobm_pp_ttW
… will now fail on rotxx crashes madgraph5#855 and on zero cross section madgraph5#826
…gpu#855): add 'volatile' to prevent optimizations
…raph5#855 (prepare to move upstream to mg5amcnlo gpucpp)
…raph5#855 crashes in rotxxx (move this upstream as suggested by Olivier)
I change the name of this to indicate that this is ONLY about rotxxx crashes. This can be fixed using 'volatile' in PR #857 and mg5amcnlo/mg5amcnlo#113 Conversely I removed "channel/iconfig mapping issues" from the name of this issue. Those "channel/iconfig mapping issues" are behind the LHE mismatch #856 and possibly the intermittent sigmakin crash #845. |
…adgraph4gpu#855 crash in rotxxx) into gpucpp_826
…syggt1t1 to test madgraph5#855 fix while still exposing madgraph5#826 and madgraph5#856
… test a different iconfig In particular: the following triggers a SIGFPE reported in madgraph5#855 (crash in rotxxx that can be fixed adding volatile?) ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean This also triggers a similar SIGFPE (initially reported in madgraph5#826) ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
…syggt1t1 to test madgraph5#855 fix while still exposing madgraph5#826 and madgraph5#856
…p and gpucpp_826, to allow cherry-picking Olivier's fix_826 changes (later on, will include Olivier's gpucpp_826 change into gpucpp directly) Revert "[tmad] update mg5amcnlo to f274cab55, adding volatile to prevent madgraph5#855 crashes in rotxxx (move this upstream as suggested by Olivier)" This reverts commit 720ae02. Revert "[valgrind] upgrade MG5AMC to include the merge of PR madgraph5#110 and PR madgraph5#112 into the gpucpp branch" This reverts commit 7d3dc34. Revert "[valgrind] upgrade MG5AMC to include the workaround for uninitialised values mg5amcnlo/mg5amcnlo#111" This reverts commit f355965. Revert "[valgrind] upgrade MG5AMC to include the fix for memory leak mg5amcnlo/mg5amcnlo#109" This reverts commit 7bb4142.
…t on Olivier's latest fix_826 commit d23e773 1) Note about Olivier's latest fix_826 commit d23e773 Olivier's 75c05c5 includes his initial 6 commits in fix_826: git log upstream/master --oneline -n1 0992927 (upstream/master, origin/color2, origin/actions) Merge pull request madgraph5#857 from valassi/tmad git log --oneline 0992927..75c05c5 75c05c5 Merge branch 'master_june24' into fix_826 92a8284 better comment in coloramps 2bcea76 trying to fix git issue 63494ef change to Andrea convention of naming (but removing step variable) 5b6d065 increase readibility and move from map to array 41ddc38 fix a issue for omp compilation bed2e12 try to fix the segfault on issue 826 Olivier's d23e773 is then a merge of the latest upstream/master in 75c05c5, fixing the MG5AMC conflict by setting it to 74fd166c1 git show d23e773 Merge: 75c05c5 0992927 update this branch with andrea fix in master diff --cc MG5aMC/mg5amcnlo - Subproject commit 10378b3c0971e1a241fd9dc365e592c92d1f13ba -Subproject commit f274cab55d5d983c5612ca7ab3417ee796aa1a8c ++Subproject commit 74fd166c1e22bde2dfe01b2e001ac3b177628165 2) Note that, in MG5AMC, 74fd166c1 (obsolete branch gpucpp_826) is the same as 09c96dd17 (branch gpucpp): git diff 74fd166c1 09c96dd17 [NO DIFF] git log --oneline e428e38c6..09c96dd17 09c96dd17 (origin/gpucpp) allow for second exporter to have access to all variable used in the fortran exporter 9abf6a3ad Merge pull request madgraph5#113 from valassi/valassi_volatile f274cab55 (ghav/valassi_volatile, valassi_volatile) Workaround for SIGFPE crashes in function rotxxx (madgraph5#855): add 'volatile' to prevent optimizations 0b8678984 Merge pull request madgraph5#112 from valassi/valassi_uninitialised111 18696c1cf Merge pull request madgraph5#110 from valassi/valassi_leak109 4f8fbb7f3 (ghav/valassi_uninitialised111) Workaround for issue madgraph5#111 reported by valgrind (initialise goodjet array in function setclscales in reweight.f) f6d90fa58 (ghav/valassi_leak109, valassi_leak109) Fix memory leak madgraph5#109 in madevent_driver.f (close file dname.mg) f9f957918 (valgrind) Fix validity time check for UFO pickle (madgraph5#97) 619f5db45 avoid that some parameter switch type when loading model git log --oneline e428e38c6..74fd166c 74fd166c1 (HEAD, origin/gpucpp_826, gpucpp_826) Merge remote-tracking branch 'origin/gpucpp' (PR madgraph5#113 for madgraph5#855 crash in rotxxx) into gpucpp_826 9abf6a3ad Merge pull request madgraph5#113 from valassi/valassi_volatile f274cab55 (ghav/valassi_volatile, valassi_volatile) Workaround for SIGFPE crashes in function rotxxx (madgraph5#855): add 'volatile' to prevent optimizations e4d9df4ab Merge remote-tracking branch 'origin/gpucpp' (PRs madgraph5#110 and madgraph5#112 for issues madgraph5#109 and madgraph5#111) into gpucpp_826 0b8678984 Merge pull request madgraph5#112 from valassi/valassi_uninitialised111 18696c1cf Merge pull request madgraph5#110 from valassi/valassi_leak109 4f8fbb7f3 (ghav/valassi_uninitialised111) Workaround for issue madgraph5#111 reported by valgrind (initialise goodjet array in function setclscales in reweight.f) f6d90fa58 (ghav/valassi_leak109, valassi_leak109) Fix memory leak madgraph5#109 in madevent_driver.f (close file dname.mg) 10378b3c0 allow for second exporter to have access to all variable used in the fortran exporter f9f957918 (valgrind) Fix validity time check for UFO pickle (madgraph5#97) 619f5db45 avoid that some parameter switch type when loading model 3) Note that color includes the following submodule updates, passing through 09c96dd17 to ba54a4153 git show --oneline upstream/master..color ../../MG5aMC/ 4b29496 [color] update MG5AMC to ba54a4153 in th egpuccp branch, with a minor fix in a comment for my icolamp patch Submodule MG5aMC/mg5amcnlo 99e064157..ba54a4153: > minor fix in a printout in my previous patch in export_cpp.py 1c2a02d [color] update MG5AMC to 99e064157, fixing bug madgraph5#856 (and related ones) about the icolamp array in coloramps.h Submodule MG5aMC/mg5amcnlo 09c96dd17..99e064157: > In export_cpp.py fix bug madgraph5#114 in get_icolamp_lines, resulting in different icolamp arrays for F77 and CPP (see madgraph5#873) 0a60262 [color] update MG5AMC to 09c96dd17: this is the latest gpucpp branch, now including Olivier's extra commit previously in gpucpp_826 Submodule MG5aMC/mg5amcnlo 10378b3c0...09c96dd17: > allow for second exporter to have access to all variable used in the fortran exporter > Merge pull request madgraph5#113 from valassi/valassi_volatile > Merge pull request madgraph5#112 from valassi/valassi_uninitialised111 > Merge pull request madgraph5#110 from valassi/valassi_leak109 < allow for second exporter to have access to all variable used in the fortran exporter 16ff942 try to fix the segfault on issue 826 Submodule MG5aMC/mg5amcnlo f9f957918..10378b3c0: > allow for second exporter to have access to all variable used in the fortran exporter 4b12e79 [color] temporarely downgrade back MG5AMC to the common base of gpucpp and gpucpp_826, to allow cherry-picking Olivier's fix_826 changes > Submodule MG5aMC/mg5amcnlo f274cab55..f9f957918 (rewind): < Workaround for SIGFPE crashes in function rotxxx (madgraph5#855): add 'volatile' to prevent optimizations < Merge pull request madgraph5#112 from valassi/valassi_uninitialised111 < Merge pull request madgraph5#110 from valassi/valassi_leak109 => Therefore I can simply merge origin/color into color2 and fix the MG5AMC conflict by setting it to ba54a4153 (valassi_icolamp114, before more recent changes)
Note, there is a crash #885 in master_june40 that I thought was related to this, but it most likely is unrelated (and is instead speciufic to master_june40) |
"tmad test crashes for some iconfig (channel/iconfig mapping issues and SIGFPE erroneous arithmetic operation)"
Hi @oliviermattelaer this is a follow up to the discussions in #826 and PR #853.
I prefer to open this as a clean issue and investigate this independently of SUSY, or in any case of zero cross section #826.
In these discussions from your patch #853 I realised that we risk having a MAJOR problem not only for BSM but also for SM, namely: all of my 'tmad' tests test only iconfig=1. These were ok so far (in some cases by luck maybe), but for different iconfig (i.e. if we put a number different from 1 in the input_app.txt piped to madevent.
Indeed I found a crash on the first test I executed, ggttgg with iconfig=104.
This uses a sightly modified script, I will pur it in a PR.
I guess that the solution goes through what you proposed in #852 and the additional modifications you and I discussed there.
(Note: the 'tlau' tests that I proposed in July last year just before my absence were supposed to test exactly this (see #711), i.e. test all possible iconfig at the same time in a user-like enviornment, for all processes, but using a short manageable time. I continue to think that allowing the possibility to run shorter generate_events tests is necessary to allow better testing. There was disagreement last year, I hope we can come back and agree on this).
The text was updated successfully, but these errors were encountered: