Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tmad test crashes in rotxxx (SIGFPE erroneous arithmetic operation) #855

Closed
valassi opened this issue Jun 2, 2024 · 9 comments · Fixed by #857
Closed

tmad test crashes in rotxxx (SIGFPE erroneous arithmetic operation) #855

valassi opened this issue Jun 2, 2024 · 9 comments · Fixed by #857
Assignees

Comments

@valassi
Copy link
Member

valassi commented Jun 2, 2024

"tmad test crashes for some iconfig (channel/iconfig mapping issues and SIGFPE erroneous arithmetic operation)"

Hi @oliviermattelaer this is a follow up to the discussions in #826 and PR #853.

I prefer to open this as a clean issue and investigate this independently of SUSY, or in any case of zero cross section #826.

In these discussions from your patch #853 I realised that we risk having a MAJOR problem not only for BSM but also for SM, namely: all of my 'tmad' tests test only iconfig=1. These were ok so far (in some cases by luck maybe), but for different iconfig (i.e. if we put a number different from 1 in the input_app.txt piped to madevent.

Indeed I found a crash on the first test I executed, ggttgg with iconfig=104.

 ./tmad/madX.sh -ggttgg -iconfig 104
...
On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
Working directory (run): /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg

*** (1) EXECUTE MADEVENT_FORTRAN (create results.dat) ***
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 64/64
 [XSECTION] VECSIZE_USED = 8192
 [XSECTION] MultiChannel = TRUE
 [XSECTION] Configuration = 104
 [XSECTION] ChannelId = 112
 [XSECTION] Cross section = 0.4632 [0.46320556621222242] fbridge_mode=0
 [UNWEIGHT] Wrote 11 events (found 187 events)
 [COUNTERS] PROGRAM TOTAL          :    4.4430s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2478s
 [COUNTERS] Fortran MEs      ( 1 ) :    4.1953s for     8192 events => throughput is 1.95E+03 events/s

*** (1) EXECUTE MADEVENT_FORTRAN x1 (create events.lhe) ***
 [OPENMPTH] omp_get_max_threads/nproc = 1/4
 [NGOODHEL] ngoodhel/ncomb = 64/64
 [XSECTION] VECSIZE_USED = 8192
 [XSECTION] MultiChannel = TRUE
 [XSECTION] Configuration = 104
 [XSECTION] ChannelId = 112
 [XSECTION] Cross section = 0.4632 [0.46320556621222242] fbridge_mode=0
 [UNWEIGHT] Wrote 11 events (found 168 events)
 [COUNTERS] PROGRAM TOTAL          :    4.4488s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2487s
 [COUNTERS] Fortran MEs      ( 1 ) :    4.2002s for     8192 events => throughput is 1.95E+03 events/s

*** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7effbd423860 in ???
#1  0x7effbd422a05 in ???
#2  0x7effbd054def in ???
#3  0x44b5ff in ???
#4  0x4087df in ???
#5  0x409848 in ???
#6  0x40bb83 in ???
#7  0x40d1a9 in ???
#8  0x45c804 in ???
#9  0x434269 in ???
#10  0x40371e in ???
#11  0x7effbd03feaf in ???
#12  0x7effbd03ff5f in ???
#13  0x403844 in ???
#14  0xffffffffffffffff in ???
./tmad/madX.sh: line 387: 780951 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttgg_x1_cudacpp > /tmp/avalassi/output_ggttgg_x1_cudacpp' failed

This uses a sightly modified script, I will pur it in a PR.

I guess that the solution goes through what you proposed in #852 and the additional modifications you and I discussed there.

(Note: the 'tlau' tests that I proposed in July last year just before my absence were supposed to test exactly this (see #711), i.e. test all possible iconfig at the same time in a user-like enviornment, for all processes, but using a short manageable time. I continue to think that allowing the possibility to run shorter generate_events tests is necessary to allow better testing. There was disagreement last year, I hope we can come back and agree on this).

@valassi
Copy link
Member Author

valassi commented Jun 2, 2024

In #852 (comment) Olivier suggested "you/we should compile with the C equivalent of -fbounds-check which is super usefull to spot segfault who by definition are hardware specific". I had a look but I am not sure there is an equivalent.

Instead I have run valgrind, this is interesting. This is a reproducer which mimics the tmad test above, but without using tmad tests

cd gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
make cleanall
make -j BACKEND=cppnone -f cudacpp.mk debug
make -j BACKEND=cppnone
cat > input_cudacpp_104 << EOF
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
104 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
EOF
./madevent_cpp < input_cudacpp_104
valgrind ./madevent_cpp < input_cudacpp_104

The valgrind output includes things like

...
==794089== Conditional jump or move depends on uninitialised value(s)
==794089==    at 0x426F03: setclscales_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x429569: update_scale_coupling_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x438857: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x45CC7A: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x434269: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x40371E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089== 
==794089== Warning: client switching stacks?  SP change: 0x1ffeffeeb8 --> 0x1ffec3eb80
==794089==          to suppress, use: --max-stackframe=3932984 or greater
==794089== Invalid write of size 8
==794089==    at 0x4366D4: dsig1_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x437C97: dsigproc_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x4388A7: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x45CC7A: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x434269: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x40371E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==  Address 0x1ffec3eba8 is on thread 1's stack
==794089==  in frame #0, created by dsig1_vec_ (???:)
==794089== 
==794089== Invalid write of size 8
==794089==    at 0x4366D9: dsig1_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x437C97: dsigproc_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x4388A7: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x45CC7A: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x434269: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x40371E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==  Address 0x1ffec3ebb0 is on thread 1's stack
==794089==  in frame #0, created by dsig1_vec_ (???:)
...
==794089== Invalid read of size 4
==794089==    at 0x436AE5: dsig1_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x437C97: dsigproc_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x4388A7: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x45CC7A: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x434269: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x40371E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==  Address 0x1ffec3ebcc is on thread 1's stack
==794089==  in frame #0, created by dsig1_vec_ (???:)
...
==794089== Invalid read of size 8
==794089==    at 0x6E032EF: memmove (vg_replace_strmem.c:1385)
==794089==    by 0x6E6D811: mg5amcCpu::Bridge<double>::cpu_sequence(double const*, double const*, double const*, double const*, unsigned int, double*, int*, int*, bool) (Bridge.h:376)
==794089==    by 0x6E6F37B: fbridgesequence_ (fbridge.cc:106)
==794089==    by 0x6E6F3F2: fbridgesequence_nomultichannel_ (fbridge.cc:132)
==794089==    by 0x4358D9: smatrix1_multi_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x436C74: dsig1_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x437C97: dsigproc_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x4388A7: dsig_vec_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x45CC7A: sample_full_ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x434269: MAIN__ (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==    by 0x40371E: main (in /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/madevent_cpp)
==794089==  Address 0x1ffec7eec8 is on thread 1's stack
==794089==  in frame #5, created by dsig1_vec_ (???:)
...

Also I have rebuilt with -O3 -g in make_opts:

epochX/cudacpp/gg_ttgg.mad/Source/make_opts /tmp/git-blob-ieuRtt/make_opts e4b87ee6ad40ecb97ecbb40ae1811714ce5f1b46 100644 epochX/cudacpp/gg_ttgg.mad/Source/make_opts 0000000000000000000000000000000000000000 100644
4c4,5
< GLOBAL_FLAG=-O3 -ffast-math -fbounds-check
---
> ###GLOBAL_FLAG=-O3 -ffast-math -fbounds-check
> GLOBAL_FLAG=-O3 -g -ffast-math -fbounds-check

The crash now prints out where it happens, it is in rotxxx

Setting grid   1    0.17709E-03   1
Setting grid   2    0.17709E-03   1
Setting grid   3    0.22041E-03   1
 Transforming s_hat 1/s            9   8.8163313609467475E-004   119716.00000000000        168999999.99999997     
 Error opening symfact.dat. No permutations used.
Using random seed offsets   104 :      1
  with seed                   21
 Ranmar initialization seeds       27505        9395

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f6471c23860 in ???
#1  0x7f6471c22a05 in ???
#2  0x7f6471854def in ???
#3  0x44b5ff in rotxxx_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f:1247
#4  0x4087df in gentcms_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1480
#5  0x409848 in one_tree_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1167
#6  0x40bb83 in gen_mom_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:68
#7  0x40d1a9 in x_to_f_arg_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:60
#8  0x45c804 in sample_full_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/dsample.f:172
#9  0x434269 in driver
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:256
#10  0x40371e in main
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:301
Floating point exception (core dumped)

Note, rotxxx is what I had already foun dalso in susy tests
#826 (comment)

@valassi valassi changed the title tmad test crashes for some iconfig (channel/iconfig mapping issues and SIGFPE erroneous arithmetic operation) tmad test crashes for some iconfig (SIGFPE erroneous arithmetic operation: crash in rotxxx and/or channel/iconfig mapping issues?) Jun 2, 2024
@valassi
Copy link
Member Author

valassi commented Jun 2, 2024

As discussed in #826 this is again a weird optimization issue: gdb gives

Program received signal SIGFPE, Arithmetic exception.
rotxxx (p=..., q=..., prot=...) at aloha_functions.f:1247
1247              prot(1) = q(1)*q(3)/qq/qt*p1 -q(2)/qt*p(2) +q(1)/qq*p(3)
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) p qq qt p1
A syntax error in expression, near `qt p1'.
(gdb) p qq
$1 = <optimized out>
(gdb) p qt
$2 = <optimized out>
(gdb) p p1
$3 = <optimized out>

This was with -O3 -g. If I use lower optimization levels, the issue disappears.

As I have done withy many SIGFPEs in cudacpp, I tried adding volatile

--- a/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f
+++ b/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f
@@ -1201,7 +1201,7 @@ c       real    prot(0:3)      : four-momentum p in the rotated frame
 c
       implicit none
       double precision p(0:3),q(0:3),prot(0:3),qt2,qt,psgn,qq,p1
-
+      volatile qt, p1, qq
       double precision rZero, rOne
       parameter( rZero = 0.0d0, rOne = 1.0d0 )

Strangely enough. this prevents SIGFPE. But now the code seems stuck in an infinite loop?

@valassi
Copy link
Member Author

valassi commented Jun 2, 2024

I tried cuda to make it faster.

Again something strange, the code crashes without valgrind but does not crash with valgrind... (NB this is WITHOUT volatile)

cd gg_ttgg.mad/SubProcesses/P1_gg_ttxgg
make cleanall
make -j BACKEND=cuda -f cudacpp.mk debug
make -j BACKEND=cuda
cat > input_cudacpp_104 << EOF
8192 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
104 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
EOF
./madevent_cuda < input_cudacpp_104
valgrind ./madevent_cuda < input_cudacpp_104

@valassi
Copy link
Member Author

valassi commented Jun 2, 2024

Ok. In the cuda version, adding volatile in the Fortran removes SIGFPE and allows the program to reach the end.

So IS THIS A POSSIBLE FIX?

With cpp maybe I just needed to wait? Or is this going slower? I will try to rerun more tests and leave them running.

(In the meantime I will also try the susy_gg_t1t1 channel which in the past seemed problematic with SIGFPE).

valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 2, 2024
… test a different iconfig

In particular: the following triggers a SIGFPE reported in madgraph5#855 (crash in rotxxx that can be fixed adding volatile?)
  ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean

This also triggers a similar SIGFPE (initially reported in madgraph5#826)
  ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 2, 2024
…SIGFPE madgraph5#855, and add volatile in aloha_functions.f to try to fix it

The SIGFPE crash madgraph5#855 does seem to disappear in
  ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean
However, there is now a DIFFERENT issue, an lhe file mismatch between fortran and cpp (madgraph5#856)
This is probably due to the iconfig/channel mapping issue reported by Olivier in madgraph5#852
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 2, 2024
…ebug SIGFPE madgraph5#855, and add volatile in aloha_functions.f to try to fix it

The SIGFPE crash madgraph5#855 does seem to disappear in
  ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
Then no cross section is printed also for this iconfig (same as madgraph5#826 for iconfig 1), but this is a DIFFERENT issue
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 2, 2024
…: note that SIGFPE madgraph5#855 is still fixed because volatile has been added
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 2, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 2, 2024
…dgraph5#855 in rotxxx

The issue was observed and fixed in gg_ttgg (iconfig 104) and susy_gg_t1t1 (iconfig 2), the backport as usual is from gg_tt

Note that aloha_functions.f is now added to the list of files to include when preparing patch.common

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/dsample.f gg_tt.mad/Source/DHELAS/aloha_functions.f gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/bin/internal/banner.py gg_tt.mad/bin/internal/gen_ximprove.py gg_tt.mad/bin/internal/madevent_interface.py >> CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 2, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 2, 2024
@valassi
Copy link
Member Author

valassi commented Jun 2, 2024

This is fixed by #857 by adding volatile, as I had done for similar SIGFPE in cudacpp

valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 3, 2024
…de with no volatile, to rerun tmad and expose SIGFPE madgraph5#855

git checkout upstream/master susy_gg_t1t1.mad gg_ttgg.mad
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 3, 2024
…se SIGFPE madgraph5#855 - will revert

./tmad/teeMadX.sh -mix -makeclean +10x -ggttgg -susyggt1t1
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 3, 2024
…h confirmed that SIGFPE madgraph5#855 was present and is now fixed

Revert "[tmad] temporarely rerun tmad tests for ggttgg and susyggt1t1 to expose SIGFPE madgraph5#855 - will revert"
This reverts commit 4fa1790.

Revert "[tmad] in gg_ttgg.mad and susy_gg_t1t1.mad, temporarely go back to code with no volatile, to rerun tmad and expose SIGFPE madgraph5#855"
This reverts commit 2f32ffd.
@valassi
Copy link
Member Author

valassi commented Jun 3, 2024

I completed my tests in PR #857 and I confirm that it fixes this issue, closing

@valassi
Copy link
Member Author

valassi commented Jun 24, 2024

Reopening until PR #857 is merged - or until this is otherwise clarified

@valassi valassi reopened this Jun 24, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 26, 2024
… test a different iconfig

In particular: the following triggers a SIGFPE reported in madgraph5#855 (crash in rotxxx that can be fixed adding volatile?)
  ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean

This also triggers a similar SIGFPE (initially reported in madgraph5#826)
  ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 26, 2024
…g AS-IS Olivier's patches from the latest fix_826 branch for PR madgraph5#850

The gg_ttgg test still crashes (rotxxx madgraph5#855?)
./tmad/madX.sh -ggttgg -iconfig 104 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fce5ec23860 in ???
   1  0x7fce5ec22a05 in ???
   2  0x7fce5e854def in ???
   3  0x44b5ff in ???
   4  0x4087df in ???
   5  0x409848 in ???
   6  0x40bb83 in ???
   7  0x40d1a9 in ???
   8  0x45c804 in ???
   9  0x434269 in ???
   10  0x40371e in ???
   11  0x7fce5e83feaf in ???
   12  0x7fce5e83ff5f in ???
   13  0x403844 in ???
   14  0xffffffffffffffff in ???
  ./tmad/madX.sh: line 387: 3913008 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}

The susy_gg_t1t1 test also still crashes (see madgraph5#826?), this looks like the same crash as ggttgg above
./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7f9f03423860 in ???
   1  0x7f9f03422a05 in ???
   2  0x7f9f03054def in ???
   3  0x43809f in ???
   4  0x40581f in ???
   5  0x4067b1 in ???
   6  0x408c71 in ???
   7  0x40a0a9 in ???
   8  0x444fdf in ???
   9  0x42bb38 in ???
   10  0x40371e in ???
   11  0x7f9f0303feaf in ???
   12  0x7f9f0303ff5f in ???
   13  0x403844 in ???
   14  0xffffffffffffffff in ???
  ./tmad/madX.sh: line 387: 3907179 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}

The gqttq test also still crashes intermittently, i.e. only on the second execution (madgraph5#845?)
./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean
./tmad/teeMadX.sh -gqttq +10x -fltonly
  Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp'
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fbafa623860 in ???
   1  0x7fbafa622a05 in ???
   2  0x7fbafa254def in ???
   3  0x7fbafad24034 in ???
   4  0x7fbafa9a1575 in ???
   5  0x7fbafad20c89 in ???
   6  0x7fbafad2abfd in ???
   7  0x7fbafad30491 in ???
   8  0x43008b in ???
   9  0x431c10 in ???
   10  0x432d47 in ???
   11  0x433b1e in ???
   12  0x44a921 in ???
   13  0x42ebbf in ???
   14  0x40371e in ???
   15  0x7fbafa23feaf in ???
   16  0x7fbafa23ff5f in ???
   17  0x403844 in ???
   18  0xffffffffffffffff in ???
  ./madX.sh: line 387: 3922797 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp' failed
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 26, 2024
…nd cudacpp.mk to improve the crash dumps

The susyggt1t1 test clearly crashes in rotxxx (madgraph5#855):
./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fb7e1223860 in ???
   1  0x7fb7e1222a05 in ???
   2  0x7fb7e0e54def in ???
   3  0x43809f in rotxxx_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/DHELAS/aloha_functions.f:1247
   4  0x40581f in gentcms_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1480
   5  0x4067b1 in one_tree_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1167
   6  0x408c71 in gen_mom_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:68
   7  0x40a0a9 in x_to_f_arg_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:60
   8  0x444fdf in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/dsample.f:172
   9  0x42bb38 in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:256
   10  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:301
  ./tmad/madX.sh: line 387: 3928626 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_susyggt1t1_x1_cudacpp > /tmp/avalassi/output_susyggt1t1_x1_cudacpp' failed

The ggttgg test also clearly crashes in rotxxx (madgraph5#855):
./tmad/madX.sh -ggttgg -iconfig 104 -makeclean^C
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fb141c23860 in ???
   1  0x7fb141c22a05 in ???
   2  0x7fb141854def in ???
   3  0x44b5ff in rotxxx_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f:1247
   4  0x4087df in gentcms_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1480
   5  0x409848 in one_tree_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1167
   6  0x40bb83 in gen_mom_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:68
   7  0x40d1a9 in x_to_f_arg_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:60
   8  0x45c804 in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/dsample.f:172
   9  0x434269 in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:256
   10  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:301
  ./tmad/madX.sh: line 387: 3933302 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttgg_x1_cudacpp > /tmp/avalassi/output_ggttgg_x1_cudacpp' failed

The gqttq test instead clearly crashes in sigmaKin (madgraph5#845):
./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean
./tmad/teeMadX.sh -gqttq +10x -fltonly
  Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp'
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7f607ee23860 in ???
   1  0x7f607ee22a05 in ???
   2  0x7f607ea54def in ???
   3  0x7f607f607008 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i._omp_fn.0
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1190
   4  0x7f607f4ab575 in ???
   5  0x7f607f603c89 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1093
   6  0x7f607f60dbfd in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
   7  0x7f607f613491 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
   8  0x7f607f613491 in fbridgesequence_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
   9  0x43008b in smatrix1_multi_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
   10  0x431c10 in dsig1_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
   11  0x432d47 in dsigproc_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
   12  0x433b1e in dsig_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
   13  0x44a921 in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
   14  0x42ebbf in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:256
   15  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:301
  ./madX.sh: line 387: 3941122 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 26, 2024
…g AS-IS Olivier's patches from the latest fix_826 branch for PR madgraph5#852

The gg_ttgg test still crashes (rotxxx madgraph5#855?)
./tmad/madX.sh -ggttgg -iconfig 104 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fce5ec23860 in ???
   1  0x7fce5ec22a05 in ???
   2  0x7fce5e854def in ???
   3  0x44b5ff in ???
   4  0x4087df in ???
   5  0x409848 in ???
   6  0x40bb83 in ???
   7  0x40d1a9 in ???
   8  0x45c804 in ???
   9  0x434269 in ???
   10  0x40371e in ???
   11  0x7fce5e83feaf in ???
   12  0x7fce5e83ff5f in ???
   13  0x403844 in ???
   14  0xffffffffffffffff in ???
  ./tmad/madX.sh: line 387: 3913008 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}

The susy_gg_t1t1 test also still crashes (see madgraph5#826?), this looks like the same crash as ggttgg above
./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7f9f03423860 in ???
   1  0x7f9f03422a05 in ???
   2  0x7f9f03054def in ???
   3  0x43809f in ???
   4  0x40581f in ???
   5  0x4067b1 in ???
   6  0x408c71 in ???
   7  0x40a0a9 in ???
   8  0x444fdf in ???
   9  0x42bb38 in ???
   10  0x40371e in ???
   11  0x7f9f0303feaf in ???
   12  0x7f9f0303ff5f in ???
   13  0x403844 in ???
   14  0xffffffffffffffff in ???
  ./tmad/madX.sh: line 387: 3907179 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}

The gqttq test also still crashes intermittently, i.e. only on the second execution (madgraph5#845?)
./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean
./tmad/teeMadX.sh -gqttq +10x -fltonly
  Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp'
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fbafa623860 in ???
   1  0x7fbafa622a05 in ???
   2  0x7fbafa254def in ???
   3  0x7fbafad24034 in ???
   4  0x7fbafa9a1575 in ???
   5  0x7fbafad20c89 in ???
   6  0x7fbafad2abfd in ???
   7  0x7fbafad30491 in ???
   8  0x43008b in ???
   9  0x431c10 in ???
   10  0x432d47 in ???
   11  0x433b1e in ???
   12  0x44a921 in ???
   13  0x42ebbf in ???
   14  0x40371e in ???
   15  0x7fbafa23feaf in ???
   16  0x7fbafa23ff5f in ???
   17  0x403844 in ???
   18  0xffffffffffffffff in ???
  ./madX.sh: line 387: 3922797 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp' failed
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 26, 2024
…nd cudacpp.mk to improve the crash dumps

The susyggt1t1 test clearly crashes in rotxxx (madgraph5#855):
./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fb7e1223860 in ???
   1  0x7fb7e1222a05 in ???
   2  0x7fb7e0e54def in ???
   3  0x43809f in rotxxx_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/DHELAS/aloha_functions.f:1247
   4  0x40581f in gentcms_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1480
   5  0x4067b1 in one_tree_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1167
   6  0x408c71 in gen_mom_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:68
   7  0x40a0a9 in x_to_f_arg_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:60
   8  0x444fdf in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/dsample.f:172
   9  0x42bb38 in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:256
   10  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:301
  ./tmad/madX.sh: line 387: 3928626 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_susyggt1t1_x1_cudacpp > /tmp/avalassi/output_susyggt1t1_x1_cudacpp' failed

The ggttgg test also clearly crashes in rotxxx (madgraph5#855):
./tmad/madX.sh -ggttgg -iconfig 104 -makeclean^C
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fb141c23860 in ???
   1  0x7fb141c22a05 in ???
   2  0x7fb141854def in ???
   3  0x44b5ff in rotxxx_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f:1247
   4  0x4087df in gentcms_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1480
   5  0x409848 in one_tree_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1167
   6  0x40bb83 in gen_mom_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:68
   7  0x40d1a9 in x_to_f_arg_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:60
   8  0x45c804 in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/dsample.f:172
   9  0x434269 in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:256
   10  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:301
  ./tmad/madX.sh: line 387: 3933302 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttgg_x1_cudacpp > /tmp/avalassi/output_ggttgg_x1_cudacpp' failed

The gqttq test instead clearly crashes in sigmaKin (madgraph5#845):
./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean
./tmad/teeMadX.sh -gqttq +10x -fltonly
  Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp'
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7f607ee23860 in ???
   1  0x7f607ee22a05 in ???
   2  0x7f607ea54def in ???
   3  0x7f607f607008 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i._omp_fn.0
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1190
   4  0x7f607f4ab575 in ???
   5  0x7f607f603c89 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1093
   6  0x7f607f60dbfd in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
   7  0x7f607f613491 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
   8  0x7f607f613491 in fbridgesequence_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
   9  0x43008b in smatrix1_multi_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
   10  0x431c10 in dsig1_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
   11  0x432d47 in dsigproc_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
   12  0x433b1e in dsig_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
   13  0x44a921 in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
   14  0x42ebbf in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:256
   15  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:301
  ./madX.sh: line 387: 3941122 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed

Conclusion: I would not merge 852 as it does not fix issues yet.
Instead I would merge 857 to fix the rotxxx crash 855 using volatile, and reassess from there...
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 27, 2024
…sm to bypass known issues in tmad tests

Currently the following 12 (4 processes x 3 fptypes) issues are bypassed
- "No cross section in ${proc%.mad} for FPTYPE=d,f,m (madgraph5#826)" for susy_gg_t1t1
- "SIGFPE crash in rotxxx in ${proc%.mad} for FPTYPE=d,f,m (madgraph5#855)" for gq_ttq, pp_tt012j, nobm_pp_ttW
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 27, 2024
valassi added a commit to valassi/mg5amcnlo that referenced this issue Jun 27, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 27, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 27, 2024
…raph5#855 crashes in rotxxx (move this upstream as suggested by Olivier)
@valassi valassi changed the title tmad test crashes for some iconfig (SIGFPE erroneous arithmetic operation: crash in rotxxx and/or channel/iconfig mapping issues?) tmad test crashes in rotxxx (SIGFPE erroneous arithmetic operation) Jun 27, 2024
@valassi
Copy link
Member Author

valassi commented Jun 27, 2024

I change the name of this to indicate that this is ONLY about rotxxx crashes. This can be fixed using 'volatile' in PR #857 and mg5amcnlo/mg5amcnlo#113

Conversely I removed "channel/iconfig mapping issues" from the name of this issue. Those "channel/iconfig mapping issues" are behind the LHE mismatch #856 and possibly the intermittent sigmakin crash #845.

valassi added a commit to mg5amcnlo/mg5amcnlo that referenced this issue Jun 27, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 27, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 27, 2024
… test a different iconfig

In particular: the following triggers a SIGFPE reported in madgraph5#855 (crash in rotxxx that can be fixed adding volatile?)
  ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean

This also triggers a similar SIGFPE (initially reported in madgraph5#826)
  ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 27, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 28, 2024
…p and gpucpp_826, to allow cherry-picking Olivier's fix_826 changes

(later on, will include Olivier's gpucpp_826 change into gpucpp directly)

Revert "[tmad] update mg5amcnlo to f274cab55, adding volatile to prevent madgraph5#855 crashes in rotxxx (move this upstream as suggested by Olivier)"
This reverts commit 720ae02.

Revert "[valgrind] upgrade MG5AMC to include the merge of PR madgraph5#110 and PR madgraph5#112 into the gpucpp branch"
This reverts commit 7d3dc34.

Revert "[valgrind] upgrade MG5AMC to include the workaround for uninitialised values mg5amcnlo/mg5amcnlo#111"
This reverts commit f355965.

Revert "[valgrind] upgrade MG5AMC to include the fix for memory leak mg5amcnlo/mg5amcnlo#109"
This reverts commit 7bb4142.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 1, 2024
…t on Olivier's latest fix_826 commit d23e773

1) Note about Olivier's latest fix_826 commit d23e773

Olivier's 75c05c5 includes his initial 6 commits in fix_826:

git log upstream/master --oneline -n1
  0992927 (upstream/master, origin/color2, origin/actions) Merge pull request madgraph5#857 from valassi/tmad
git log --oneline 0992927..75c05c5
  75c05c5 Merge branch 'master_june24' into fix_826
  92a8284 better comment in coloramps
  2bcea76 trying to fix git issue
  63494ef change to Andrea convention of naming (but removing step variable)
  5b6d065 increase readibility and move from map to array
  41ddc38 fix a issue for omp compilation
  bed2e12 try to fix the segfault on issue 826

Olivier's d23e773 is then a merge of the latest upstream/master in 75c05c5, fixing the MG5AMC conflict by setting it to 74fd166c1

git show d23e773
  Merge: 75c05c5 0992927
  update this branch with andrea fix in master
  diff --cc MG5aMC/mg5amcnlo
   - Subproject commit 10378b3c0971e1a241fd9dc365e592c92d1f13ba
    -Subproject commit f274cab55d5d983c5612ca7ab3417ee796aa1a8c
   ++Subproject commit 74fd166c1e22bde2dfe01b2e001ac3b177628165

2) Note that, in MG5AMC, 74fd166c1 (obsolete branch gpucpp_826) is the same as 09c96dd17 (branch gpucpp):

git diff 74fd166c1 09c96dd17
  [NO DIFF]

git log --oneline e428e38c6..09c96dd17
  09c96dd17 (origin/gpucpp) allow for second exporter to have access to all variable used in the fortran exporter
  9abf6a3ad Merge pull request madgraph5#113 from valassi/valassi_volatile
  f274cab55 (ghav/valassi_volatile, valassi_volatile) Workaround for SIGFPE crashes in function rotxxx (madgraph5#855): add 'volatile' to prevent optimizations
  0b8678984 Merge pull request madgraph5#112 from valassi/valassi_uninitialised111
  18696c1cf Merge pull request madgraph5#110 from valassi/valassi_leak109
  4f8fbb7f3 (ghav/valassi_uninitialised111) Workaround for issue madgraph5#111 reported by valgrind (initialise goodjet array in function setclscales in reweight.f)
  f6d90fa58 (ghav/valassi_leak109, valassi_leak109) Fix memory leak madgraph5#109 in madevent_driver.f (close file dname.mg)
  f9f957918 (valgrind) Fix validity time check for UFO pickle (madgraph5#97)
  619f5db45 avoid that some parameter switch type when loading model

git log --oneline e428e38c6..74fd166c
  74fd166c1 (HEAD, origin/gpucpp_826, gpucpp_826) Merge remote-tracking branch 'origin/gpucpp' (PR madgraph5#113 for madgraph5#855 crash in rotxxx) into gpucpp_826
  9abf6a3ad Merge pull request madgraph5#113 from valassi/valassi_volatile
  f274cab55 (ghav/valassi_volatile, valassi_volatile) Workaround for SIGFPE crashes in function rotxxx (madgraph5#855): add 'volatile' to prevent optimizations
  e4d9df4ab Merge remote-tracking branch 'origin/gpucpp' (PRs madgraph5#110 and madgraph5#112 for issues madgraph5#109 and madgraph5#111) into gpucpp_826
  0b8678984 Merge pull request madgraph5#112 from valassi/valassi_uninitialised111
  18696c1cf Merge pull request madgraph5#110 from valassi/valassi_leak109
  4f8fbb7f3 (ghav/valassi_uninitialised111) Workaround for issue madgraph5#111 reported by valgrind (initialise goodjet array in function setclscales in reweight.f)
  f6d90fa58 (ghav/valassi_leak109, valassi_leak109) Fix memory leak madgraph5#109 in madevent_driver.f (close file dname.mg)
  10378b3c0 allow for second exporter to have access to all variable used in the fortran exporter
  f9f957918 (valgrind) Fix validity time check for UFO pickle (madgraph5#97)
  619f5db45 avoid that some parameter switch type when loading model

3) Note that color includes the following submodule updates, passing through 09c96dd17 to ba54a4153

git show --oneline upstream/master..color ../../MG5aMC/
  4b29496 [color] update MG5AMC to ba54a4153 in th egpuccp branch, with a minor fix in a comment for my icolamp patch
  Submodule MG5aMC/mg5amcnlo 99e064157..ba54a4153:
    > minor fix in a printout in my previous patch in export_cpp.py
  1c2a02d [color] update MG5AMC to 99e064157, fixing bug madgraph5#856 (and related ones) about the icolamp array in coloramps.h
  Submodule MG5aMC/mg5amcnlo 09c96dd17..99e064157:
    > In export_cpp.py fix bug madgraph5#114 in get_icolamp_lines, resulting in different icolamp arrays for F77 and CPP (see madgraph5#873)
  0a60262 [color] update MG5AMC to 09c96dd17: this is the latest gpucpp branch, now including Olivier's extra commit previously in gpucpp_826
  Submodule MG5aMC/mg5amcnlo 10378b3c0...09c96dd17:
    > allow for second exporter to have access to all variable used in the fortran exporter
    > Merge pull request madgraph5#113 from valassi/valassi_volatile
    > Merge pull request madgraph5#112 from valassi/valassi_uninitialised111
    > Merge pull request madgraph5#110 from valassi/valassi_leak109
    < allow for second exporter to have access to all variable used in the fortran exporter
  16ff942 try to fix the segfault on issue 826
  Submodule MG5aMC/mg5amcnlo f9f957918..10378b3c0:
    > allow for second exporter to have access to all variable used in the fortran exporter
  4b12e79 [color] temporarely downgrade back MG5AMC to the common base of gpucpp and gpucpp_826, to allow cherry-picking Olivier's fix_826 changes >
  Submodule MG5aMC/mg5amcnlo f274cab55..f9f957918 (rewind):
    < Workaround for SIGFPE crashes in function rotxxx (madgraph5#855): add 'volatile' to prevent optimizations
    < Merge pull request madgraph5#112 from valassi/valassi_uninitialised111
    < Merge pull request madgraph5#110 from valassi/valassi_leak109

=> Therefore I can simply merge origin/color into color2 and fix the MG5AMC conflict by setting it to ba54a4153 (valassi_icolamp114, before more recent changes)
@valassi
Copy link
Member Author

valassi commented Jul 4, 2024

Note, there is a crash #885 in master_june40 that I thought was related to this, but it most likely is unrelated (and is instead speciufic to master_june40)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant