Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(4 in pipeline) Faster RDTSC-based timers and new timer/counter APIs #1018

Draft
wants to merge 123 commits into
base: valassi_3_grid
Choose a base branch
from

Conversation

valassi
Copy link
Member

@valassi valassi commented Oct 7, 2024

This is a PR including faster RDTSC-based timers and new timer/counter APIs. It completes #972.

This PR ("prof0") was derived from the pre-existing PR #962 ("prof"), by stripping off the second part (additional profiling of non-ME fortran components) and keeping only the first part (new RDTSC based timers and new APIs).

The idea is that the additional profiling of non-ME fortran components will be done at a later time in #962, but it will be modified to include patches in upstream mg5amcnlo as suggested by @oliviermattelaer , rather than relying on patchMad.sh with much larger patches, as is done presently.

valassi added 30 commits August 10, 2024 13:27
…toring of counters using maps and explicit register methods
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.4510s
 [COUNTERS] Fortran Overhead ( 0 ) :    1.3466s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0871s for    16384 events => throughput is 5.32E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0164s for    16399 events => throughput is 1.00E-06 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
INFO: No Floating Point Exceptions have been reported
 [COUNTERS] PROGRAM TOTAL          :    1.9073s
 [COUNTERS] Fortran Overhead ( 0 ) :    1.2890s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5218s for    98304 events => throughput is 5.31E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0958s for    98371 events => throughput is 9.74E-07 events/s
…ke cleanall and rebuild)

Note: the counter itself has a huge overhead...

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7742s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.5162s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0906s for    16384 events => throughput is 5.53E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0174s for    16399 events => throughput is 1.06E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1493s for    98304 events => throughput is 1.52E-06 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    4.1335s
 [COUNTERS] Fortran Overhead ( 0 ) :    2.6717s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5176s for    98304 events => throughput is 5.27E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0961s for    98371 events => throughput is 9.77E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.8474s for   589824 events => throughput is 1.44E-06 events/s
…ain, to reduce performance overhead from counters themselves

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.4700s
 [COUNTERS] Fortran Overhead ( 0 ) :    1.2236s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0867s for    16384 events => throughput is 5.29E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0162s for    16399 events => throughput is 9.88E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1428s for    98304 events => throughput is 1.45E-06 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.9569s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.4895s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5181s for    98304 events => throughput is 5.27E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0958s for    98371 events => throughput is 9.74E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.8528s for   589824 events => throughput is 1.45E-06 events/s
…points

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7442s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2437s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0871s for    16384 events => throughput is 5.32E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0162s for    16399 events => throughput is 9.86E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1335s for    98304 events => throughput is 1.36E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.2629s for    16399 events => throughput is 1.60E-05 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.9099s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.3233s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5203s for    98304 events => throughput is 5.29E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0956s for    98371 events => throughput is 9.71E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.7980s for   589824 events => throughput is 1.35E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.1719s for    98371 events => throughput is 1.75E-06 events/s
NB: there is some hysteresis, the timing results depend on what was executed before
For instance, x1 results may be 0.7 or 1.5, and x10 results may be 1.5 or 4.1: this does NOT depend on the software version!

Start with x1, several times, eventually it gives 0.7
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7417s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2435s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0861s for    16384 events => throughput is 5.26E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0166s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1345s for    98304 events => throughput is 1.37E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.2603s for    16399 events => throughput is 1.59E-05 events/s

Then the FIRST execution of x10 gives 1.9
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.9285s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.3277s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5237s for    98304 events => throughput is 5.33E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0964s for    98371 events => throughput is 9.80E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.8057s for   589824 events => throughput is 1.37E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.1741s for    98371 events => throughput is 1.77E-06 events/s

But the SECOND execution gives 4.1s! With the big increase coming from the I/O part
(And any subsequent execution also gives the same)
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    4.1048s
 [COUNTERS] Fortran Overhead ( 0 ) :    1.1119s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5161s for    98304 events => throughput is 5.25E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0946s for    98371 events => throughput is 9.62E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.7954s for   589824 events => throughput is 1.35E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    1.5861s for    98371 events => throughput is 1.61E-05 events/s

Now the FIRST execution of x1 gives 1.4s!
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.4677s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.5601s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0861s for    16384 events => throughput is 5.26E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0167s for    16399 events => throughput is 1.02E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1338s for    98304 events => throughput is 1.36E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.6702s for    16399 events => throughput is 4.09E-05 events/s

But the SECOND execution gives again 0.7s! And all subsequent executions too (so we are back at the beginning of the loop above)
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7480s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2472s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0870s for    16384 events => throughput is 5.31E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0166s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1337s for    98304 events => throughput is 1.36E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.2628s for    16399 events => throughput is 1.60E-05 events/s

In the following, I will quote results for the second x1 and the first x10 only...
…een defined

I had done this to try and decrease the 4.1s... but in the meantime I understood the problem is elsewhere.
In particular, this is not faster than string comparison - will revert!

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7451s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2426s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0875s for    16384 events => throughput is 5.34E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0170s for    16399 events => throughput is 1.04E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1342s for    98304 events => throughput is 1.37E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.2631s for    16399 events => throughput is 1.60E-05 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.8970s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.3151s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5182s for    98304 events => throughput is 5.27E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0952s for    98371 events => throughput is 9.67E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.7950s for   589824 events => throughput is 1.35E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.1729s for    98371 events => throughput is 1.76E-06 events/s
…g if a counter has been defined: use string comparison to "", it is not slower

Revert "[prof] in gg_tt.mad counters.cc add a flag showing if a counter has been defined"
This reverts commit ee6f9f5.
…BLECOUNTERS to disable individual counters

I initially wanted to use this to check if it is the individual counters that caused the 4.1s in x10 tests.
But in the meantime I understood that the problem is elsewhere, and that timings depend on execution order! Will probably revert!

Note, the second x1 execution takes 0.7s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7485s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2472s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0872s for    16384 events => throughput is 5.32E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0166s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1346s for    98304 events => throughput is 1.37E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.2621s for    16399 events => throughput is 1.60E-05 events/s

CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7349s

And then the first x10 execution takes 1.9s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.9127s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.3268s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5172s for    98304 events => throughput is 5.26E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0964s for    98371 events => throughput is 9.80E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.7992s for   589824 events => throughput is 1.36E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.1723s for    98371 events => throughput is 1.75E-06 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.8511s

While the SECOND execution x10 takes 4.1s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    4.1152s
 [COUNTERS] Fortran Overhead ( 0 ) :    1.1174s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5173s for    98304 events => throughput is 5.26E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0950s for    98371 events => throughput is 9.65E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.8117s for   589824 events => throughput is 1.38E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    1.5731s for    98371 events => throughput is 1.60E-05 events/s

CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    4.0680s

Will therefore revert this
…CUDACPP_RUNTIME_DISABLECOUNTERS to disable individual counters

Revert "[prof] in gg_tt.mad counters add an env variable CUDACPP_RUNTIME_DISABLECOUNTERS to disable individual counters"
This reverts commit 0681a76.
…ther and make it counter[0]

No change in the timings

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL           :    0.7531s
 [COUNTERS] Fortran Other    (  0 ) :    0.2447s
 [COUNTERS] CudaCpp MEs      (  2 ) :    0.0862s for    16384 events => throughput is 5.26E-06 events/s
 [COUNTERS] CudaCpp HEL      (  3 ) :    0.0007s
 [COUNTERS] Fortran X2F      (  4 ) :    0.0166s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF      (  5 ) :    0.1395s for    98304 events => throughput is 1.42E-06 events/s
 [COUNTERS] Fortran I/O      (  6 ) :    0.2653s for    16399 events => throughput is 1.62E-05 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL           :    1.9572s
 [COUNTERS] Fortran Other    (  0 ) :    0.3215s
 [COUNTERS] CudaCpp MEs      (  2 ) :    0.5202s for    98304 events => throughput is 5.29E-06 events/s
 [COUNTERS] CudaCpp HEL      (  3 ) :    0.0007s
 [COUNTERS] Fortran X2F      (  4 ) :    0.0941s for    98371 events => throughput is 9.57E-07 events/s
 [COUNTERS] Fortran PDF      (  5 ) :    0.8486s for   589824 events => throughput is 1.44E-06 events/s
 [COUNTERS] Fortran I/O      (  6 ) :    0.1720s for    98371 events => throughput is 1.75E-06 events/s
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL           :    0.7543s
 [COUNTERS] Fortran Other    (  0 ) :    0.2451s
 [COUNTERS] Fortran X2F      (  1 ) :    0.0163s for    16399 events => throughput is 9.95E-07 events/s
 [COUNTERS] Fortran PDF      (  2 ) :    0.1419s for    98304 events => throughput is 1.44E-06 events/s
 [COUNTERS] Fortran I/O      (  3 ) :    0.2617s for    16399 events => throughput is 1.60E-05 events/s
 [COUNTERS] CudaCpp HEL      (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs      (  6 ) :    0.0885s for    16384 events => throughput is 5.40E-06 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL           :    1.9649s
 [COUNTERS] Fortran Other    (  0 ) :    0.3239s
 [COUNTERS] Fortran X2F      (  1 ) :    0.0951s for    98371 events => throughput is 9.67E-07 events/s
 [COUNTERS] Fortran PDF      (  2 ) :    0.8467s for   589824 events => throughput is 1.44E-06 events/s
 [COUNTERS] Fortran I/O      (  3 ) :    0.1783s for    98371 events => throughput is 1.81E-06 events/s
 [COUNTERS] CudaCpp HEL      (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs      (  6 ) :    0.5202s for    98304 events => throughput is 5.29E-06 events/s
…xcluded from fortran other calculation)

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7510s
 [COUNTERS] Fortran Other        (  0 ) :    0.2485s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0163s for    16399 events => throughput is 9.94E-07 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1359s for    98304 events => throughput is 1.38E-06 events/s
 [COUNTERS] Fortran I/O          (  3 ) :    0.2628s for    16399 events => throughput is 1.60E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0868s for    16384 events => throughput is 5.30E-06 events/s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6822s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    1.9135s
 [COUNTERS] Fortran Other        (  0 ) :    0.3225s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0938s for    98371 events => throughput is 9.54E-07 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.7961s for   589824 events => throughput is 1.35E-06 events/s
 [COUNTERS] Fortran I/O          (  3 ) :    0.1819s for    98371 events => throughput is 1.85E-06 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.5184s for    98304 events => throughput is 5.27E-06 events/s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    1.8445s
… that what is left is something inside sample_full

Rephrasing: programtotal = samplefull + initialIO
And FortranOther is inside sample_full

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7697s
 [COUNTERS] Fortran Other        (  0 ) :    0.1810s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0166s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1355s for    98304 events => throughput is 1.38E-06 events/s
 [COUNTERS] Fortran I/O          (  3 ) :    0.2672s for    16399 events => throughput is 1.63E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0877s for    16384 events => throughput is 5.35E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0808s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6860s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    2.0621s
 [COUNTERS] Fortran Other        (  0 ) :    0.2829s
 [COUNTERS] Fortran X2F          (  1 ) :    0.1024s for    98371 events => throughput is 1.04E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.8580s for   589824 events => throughput is 1.45E-06 events/s
 [COUNTERS] Fortran I/O          (  3 ) :    0.1838s for    98371 events => throughput is 1.87E-06 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.5532s for    98304 events => throughput is 5.63E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0811s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    1.9780s
…side the function to the calling sequence in sample_full

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7679s
 [COUNTERS] Fortran Other        (  0 ) :    0.1849s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0169s for    16399 events => throughput is 1.03E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1380s for    98304 events => throughput is 1.40E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2611s for    16399 events => throughput is 1.59E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0008s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0877s for    16384 events => throughput is 5.35E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0785s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6862s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    1.9454s
 [COUNTERS] Fortran Other        (  0 ) :    0.2618s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0961s for    98371 events => throughput is 9.77E-07 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.8161s for   589824 events => throughput is 1.38E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.1695s for    98371 events => throughput is 1.72E-06 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0008s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.5216s for    98304 events => throughput is 5.31E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0794s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    1.8627s
…ing (as "test12" for the moment, wip)

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7447s
 [COUNTERS] Fortran Other        (  0 ) :    0.1308s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0163s for    16399 events => throughput is 9.93E-07 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1328s for    98304 events => throughput is 1.35E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2614s for    16399 events => throughput is 1.59E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0878s for    16384 events => throughput is 5.36E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0649s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6768s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0499s for    16384 events => throughput is 3.05E-06 events/s
…or the moment, wip)

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7526s
 [COUNTERS] Fortran Other        (  0 ) :    0.1163s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0165s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1428s for    98304 events => throughput is 1.45E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2589s for    16399 events => throughput is 1.58E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0870s for    16384 events => throughput is 5.31E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0659s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6829s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0537s for    16384 events => throughput is 3.28E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0108s for    16384 events => throughput is 6.58E-07 events/s
This essentially completes the identification of all bottlenecks.
Must now clean up the timers (and remove double counting, "Fortran Other" is now negative?)

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7581s
 [COUNTERS] Fortran Other        (  0 ) :   -0.0298s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0168s for    16399 events => throughput is 1.02E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1441s for    98304 events => throughput is 1.47E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2627s for    16399 events => throughput is 1.60E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0882s for    16384 events => throughput is 5.38E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0656s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6896s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0533s for    16384 events => throughput is 3.25E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0105s for    16384 events => throughput is 6.41E-07 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1461s for    16384 events => throughput is 8.91E-06 events/s
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7519s
 [COUNTERS] Fortran Other        (  0 ) :   -0.0299s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0165s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1421s for    98304 events => throughput is 1.45E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2589s for    16399 events => throughput is 1.58E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0873s for    16384 events => throughput is 5.33E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0651s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6838s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0542s for    16384 events => throughput is 3.31E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0102s for    16384 events => throughput is 6.26E-07 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1467s for    16384 events => throughput is 8.95E-06 events/s
…er.f

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7533s
 [COUNTERS] Fortran Other        (  0 ) :   -0.0253s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0165s for    16399 events => throughput is 1.00E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1355s for    98304 events => throughput is 1.38E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2633s for    16399 events => throughput is 1.61E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0008s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0897s for    16384 events => throughput is 5.48E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0649s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6855s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0490s for    16384 events => throughput is 2.99E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0102s for    16384 events => throughput is 6.20E-07 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1488s for    16384 events => throughput is 9.08E-06 events/s
…g1.f

This changes the overall balance, now Fortran Other is again positive.
This is because pdg2pdf is also called elsewhere (e.g. in unwgt?) which was already profiled elsewhere.

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7551s
 [COUNTERS] Fortran Other        (  0 ) :    0.0111s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0168s for    16399 events => throughput is 1.02E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.0986s for    32768 events => throughput is 3.01E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2633s for    16399 events => throughput is 1.61E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0879s for    16384 events => throughput is 5.36E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0662s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6862s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0515s for    16384 events => throughput is 3.14E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0099s for    16384 events => throughput is 6.07E-07 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1492s for    16384 events => throughput is 9.11E-06 events/s
Now "Fortran Other" becomes negative again, there is again some double counting

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7511s
 [COUNTERS] Fortran Other        (  0 ) :   -0.0373s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0168s for    16399 events => throughput is 1.02E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.0965s for    32768 events => throughput is 2.94E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2598s for    16399 events => throughput is 1.58E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0008s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0868s for    16384 events => throughput is 5.30E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0670s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6811s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0506s for    16384 events => throughput is 3.09E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0099s for    16384 events => throughput is 6.01E-07 events/s
 [COUNTERS] Fortran TEST3        ( 14 ) :    0.0541s for    16384 events => throughput is 3.30E-06 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1462s for    16384 events => throughput is 8.93E-06 events/s
This makes it clearer that programtotal = samplefull + initialIO

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7554s
 [COUNTERS] Fortran Other        (  0 ) :   -0.0393s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0171s for    16399 events => throughput is 1.04E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.0984s for    32768 events => throughput is 3.00E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2621s for    16399 events => throughput is 1.60E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0872s for    16384 events => throughput is 5.32E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0688s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0521s for    16384 events => throughput is 3.18E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0100s for    16384 events => throughput is 6.08E-07 events/s
 [COUNTERS] Fortran TEST3        ( 14 ) :    0.0507s for    16384 events => throughput is 3.09E-06 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1478s for    16384 events => throughput is 9.02E-06 events/s
 [COUNTERS] PROGRAM initial_I/O  ( 19 ) :    0.0688s
 [COUNTERS] PROGRAM sample_full  ( 20 ) :    0.6838s
…grouping

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7428s
 [COUNTERS] Fortran Other        (  0 ) :   -0.0409s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0169s for    16399 events => throughput is 1.03E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.0982s for    32768 events => throughput is 3.00E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2585s for    16399 events => throughput is 1.58E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0865s for    16384 events => throughput is 5.28E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0670s
 [COUNTERS] Fortran grouping     ( 12 ) :    0.0520s for    16384 events => throughput is 3.17E-06 events/s
 [COUNTERS] Fortran scale        ( 13 ) :    0.0098s for    16384 events => throughput is 5.98E-07 events/s
 [COUNTERS] Fortran rewgt        ( 14 ) :    0.0497s for    16384 events => throughput is 3.03E-06 events/s
 [COUNTERS] Fortran unwgt        ( 16 ) :    0.1445s for    16384 events => throughput is 8.82E-06 events/s
 [COUNTERS] PROGRAM initial_I/O  ( 19 ) :    0.0670s
 [COUNTERS] PROGRAM sample_full  ( 20 ) :    0.6728s
…s, which was causing double counting and a negative Fortran Other

The problem is that select_grouping_choice calls dsigproc, which eventually calls dsig1, which includes pdf profiling

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7643s
 [COUNTERS] Fortran Other        (  0 ) :    0.0111s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0164s for    16399 events => throughput is 9.98E-07 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1013s for    32768 events => throughput is 3.09E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2712s for    16399 events => throughput is 1.65E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0008s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0874s for    16384 events => throughput is 5.34E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0663s
 [COUNTERS] Fortran scale        ( 13 ) :    0.0103s for    16384 events => throughput is 6.26E-07 events/s
 [COUNTERS] Fortran rewgt        ( 14 ) :    0.0511s for    16384 events => throughput is 3.12E-06 events/s
 [COUNTERS] Fortran unwgt        ( 16 ) :    0.1484s for    16384 events => throughput is 9.06E-06 events/s
 [COUNTERS] PROGRAM initial_I/O  ( 19 ) :    0.0663s
 [COUNTERS] PROGRAM sample_full  ( 20 ) :    0.6950s
…sig1 (not only dsig1_vec), but it does not show up! - will revert

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7479s
 [COUNTERS] Fortran Other        (  0 ) :    0.0122s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0166s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.0974s for    32768 events => throughput is 2.97E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2625s for    16399 events => throughput is 1.60E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0873s for    16384 events => throughput is 5.33E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0657s
 [COUNTERS] Fortran scale        ( 13 ) :    0.0102s for    16384 events => throughput is 6.21E-07 events/s
 [COUNTERS] Fortran rewgt        ( 14 ) :    0.0494s for    16384 events => throughput is 3.01E-06 events/s
 [COUNTERS] Fortran unwgt        ( 16 ) :    0.1459s for    16384 events => throughput is 8.90E-06 events/s
 [COUNTERS] PROGRAM initial_I/O  ( 19 ) :    0.0657s
 [COUNTERS] PROGRAM sample_full  ( 20 ) :    0.6793s
Revert "[prof] in gg_tt.mad auto_dsig1.f, add profiling for matrix1 also in dsig1 (not only dsig1_vec), but it does not show up! - will revert"
This reverts commit d3165cb.
valassi added 16 commits October 4, 2024 18:35
…ging

git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…ated code except gg_tt.mad for easier merging

git checkout upstream/master $(git ls-tree --name-only upstream/master *.mad/Source/dsample.f | grep -v ^gg_tt.mad)
…also amd and v1.00.01 fixes) into prof

Fix conflicts (use upstream/master version): epochX/cudacpp/gg_tt.mad/Source/dsample.f

Will then regenerate patches from this gg_tt.mad
…/master including v1.00.00 and also amd and v1.00.01 fixes

The only files that still need to be patched are
- 2 in patch.common: Source/dsample.f, SubProcesses/makefile
- 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f

Note: this is 3 files more than those needed in upstream/master (added Source/dsample.f, auto_dsig1.f, auto_dsig.f)

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad

(Later checked that regenerating gg_tt.mad gives no change)
git checkout grid $(git ls-tree --name-only grid */CODEGEN*txt)
Fix conflicts: epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 (will regenerate anyway)
…ncard/bldall/tlau) into prof - essentially no change to the version created by fixing conflicts

(NB: THIS IS THE LAST "PROF" CHANGE FOR THE MOMENT - WILL "TEMPORARELY" MOVE TO A SIMPLER "PROF0")
("PROF0" HAS THE NEW TIMERS/COUNTERS OF PROF WITH NEW APIS, BUT NOT THE ADDITIONAL PROFILING OF FORTRAN COMPONENTS)

The only files that still need to be patched are
- 2 in patch.common: Source/dsample.f, SubProcesses/makefile
- 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f

Note: this is 3 files more than those needed in upstream/master (added Source/dsample.f, auto_dsig1.f, auto_dsig.f)

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad

(Later checked that regenerating gg_tt.mad gives no change)
…nal profiling of Fortran components: keep only the new timers and counters
…port the "temporary" changes in auto_dsig1.f (so that they do not need to go to patch.P1)
…after "temporarely" removing additional Fortran profiling (and modifying other CODEGEN fragments accordingly)

The only files that still need to be patched are
- 1 in patch.common: Source/genps.inc, SubProcesses/makefile
- 2 in patch.P1: driver.f, matrix1.f

(Note in particular that the 'prof0' changes over 'grid' in auto_dsig1.f are in smatrix_multi.f and output.py)

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/makefile > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad

(Later regenerated gg_tt.mad and checked that all is ok)
…mall-g 72h) - all ok

STARTED  AT Mon 07 Oct 2024 01:56:32 AM EEST
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean  -nocuda
ENDED(1) AT Mon 07 Oct 2024 02:26:23 AM EEST [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean  -nocuda
ENDED(2) AT Mon 07 Oct 2024 02:36:48 AM EEST [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean  -nocuda
ENDED(3) AT Mon 07 Oct 2024 02:44:54 AM EEST [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst  -nocuda
ENDED(4) AT Mon 07 Oct 2024 02:46:40 AM EEST [Status=0]
SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common  -nocuda'
ENDED(5) AT Mon 07 Oct 2024 02:46:40 AM EEST [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common  -nocuda
ENDED(6) AT Mon 07 Oct 2024 02:48:26 AM EEST [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean  -nocuda
ENDED(7) AT Mon 07 Oct 2024 03:09:29 AM EEST [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs
…cted

STARTED  AT Mon 07 Oct 2024 03:09:30 AM EEST
(SM tests)
ENDED(1) AT Mon 07 Oct 2024 05:27:02 AM EEST [Status=0]
(BSM tests)
ENDED(1) AT Mon 07 Oct 2024 05:36:08 AM EEST [Status=0]

16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
Revert "[prof0] rerun 30 tmad tests on LUMI/HIP WITH NEW TIMERS - all as expected"
This reverts commit af67440.

Revert "[prof0] rerun 96 tput builds and tests with NEW TIMERS on LUMI/HIP (small-g 72h) - all ok"
This reverts commit 24f9115.
…l ok

STARTED  AT Mon Oct  7 12:53:50 AM CEST 2024
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean  -cpponly
ENDED(1) AT Mon Oct  7 01:12:41 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean  -cpponly
ENDED(2) AT Mon Oct  7 01:19:47 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean  -cpponly
ENDED(3) AT Mon Oct  7 01:24:43 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst  -cpponly
ENDED(4) AT Mon Oct  7 01:26:11 AM CEST 2024 [Status=0]
SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common  -cpponly'
ENDED(5) AT Mon Oct  7 01:26:11 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common  -cpponly
ENDED(6) AT Mon Oct  7 01:27:39 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean  -cpponly
ENDED(7) AT Mon Oct  7 01:36:12 AM CEST 2024 [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs
@valassi valassi self-assigned this Oct 7, 2024
@valassi valassi changed the title Faster RDTSC-based timers and new timer/counter APIs (4 in pipeline) Faster RDTSC-based timers and new timer/counter APIs Oct 7, 2024
@valassi valassi changed the base branch from master to valassi_3_grid October 7, 2024 08:33
@valassi
Copy link
Member Author

valassi commented Oct 7, 2024

Hi @oliviermattelaer as discussed via email, this N=4 and the last PR in the pipeline that I would like to merge.

Again, I changed this to target N=3 for easier review, but then I would merge it to master once approved.

Let me know what you think please.. thanks!

Andrea

@valassi valassi linked an issue Oct 7, 2024 that may be closed by this pull request
STARTED  AT Mon Oct  7 01:36:12 AM CEST 2024
(SM tests)
ENDED(1) AT Mon Oct  7 04:33:19 AM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Mon Oct  7 04:38:35 AM CEST 2024 [Status=0]

20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
Revert "[prof0] rerun 30 tmad tests on gold91 with NEW TIMERS - all as expected"
This reverts commit 3197d60.

Revert "[prof0] rerun 96 tput builds and tests on gold91 with NEW TIMERS - all ok"
This reverts commit 6ebaa5a.
… all ok

STARTED  AT Mon Oct  7 12:57:24 AM CEST 2024
./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Mon Oct  7 01:27:08 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Mon Oct  7 01:36:53 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean
ENDED(3) AT Mon Oct  7 01:45:56 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst
ENDED(4) AT Mon Oct  7 01:48:41 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst
ENDED(5) AT Mon Oct  7 01:51:24 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common
ENDED(6) AT Mon Oct  7 01:54:14 AM CEST 2024 [Status=0]
./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean
ENDED(7) AT Mon Oct  7 02:06:51 AM CEST 2024 [Status=0]

No errors found in logs

No FPEs or '{ }' found in logs
…cted (heft fail madgraph5#833)

STARTED  AT Mon Oct  7 02:06:51 AM CEST 2024
(SM tests)
ENDED(1) AT Mon Oct  7 05:53:08 AM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Mon Oct  7 06:03:23 AM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
@valassi valassi marked this pull request as draft October 8, 2024 14:25
@valassi
Copy link
Member Author

valassi commented Oct 8, 2024

For the moment I moved back to draft while working on the base PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster timers based on rdtsc instead of chrono
1 participant