-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(4 in pipeline) Faster RDTSC-based timers and new timer/counter APIs #1018
Draft
valassi
wants to merge
123
commits into
madgraph5:valassi_3_grid
Choose a base branch
from
valassi:prof0
base: valassi_3_grid
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…a counters namespace
…toring of counters using maps and explicit register methods
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 1.4510s [COUNTERS] Fortran Overhead ( 0 ) : 1.3466s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0871s for 16384 events => throughput is 5.32E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0164s for 16399 events => throughput is 1.00E-06 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp INFO: No Floating Point Exceptions have been reported [COUNTERS] PROGRAM TOTAL : 1.9073s [COUNTERS] Fortran Overhead ( 0 ) : 1.2890s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5218s for 98304 events => throughput is 5.31E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0958s for 98371 events => throughput is 9.74E-07 events/s
…ke cleanall and rebuild) Note: the counter itself has a huge overhead... ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7742s [COUNTERS] Fortran Overhead ( 0 ) : 0.5162s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0906s for 16384 events => throughput is 5.53E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0174s for 16399 events => throughput is 1.06E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1493s for 98304 events => throughput is 1.52E-06 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 4.1335s [COUNTERS] Fortran Overhead ( 0 ) : 2.6717s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5176s for 98304 events => throughput is 5.27E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0961s for 98371 events => throughput is 9.77E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.8474s for 589824 events => throughput is 1.44E-06 events/s
…ain, to reduce performance overhead from counters themselves ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 1.4700s [COUNTERS] Fortran Overhead ( 0 ) : 1.2236s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0867s for 16384 events => throughput is 5.29E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0162s for 16399 events => throughput is 9.88E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1428s for 98304 events => throughput is 1.45E-06 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9569s [COUNTERS] Fortran Overhead ( 0 ) : 0.4895s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5181s for 98304 events => throughput is 5.27E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0958s for 98371 events => throughput is 9.74E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.8528s for 589824 events => throughput is 1.45E-06 events/s
…points ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7442s [COUNTERS] Fortran Overhead ( 0 ) : 0.2437s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0871s for 16384 events => throughput is 5.32E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0162s for 16399 events => throughput is 9.86E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1335s for 98304 events => throughput is 1.36E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2629s for 16399 events => throughput is 1.60E-05 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9099s [COUNTERS] Fortran Overhead ( 0 ) : 0.3233s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5203s for 98304 events => throughput is 5.29E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0956s for 98371 events => throughput is 9.71E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.7980s for 589824 events => throughput is 1.35E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.1719s for 98371 events => throughput is 1.75E-06 events/s
NB: there is some hysteresis, the timing results depend on what was executed before For instance, x1 results may be 0.7 or 1.5, and x10 results may be 1.5 or 4.1: this does NOT depend on the software version! Start with x1, several times, eventually it gives 0.7 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7417s [COUNTERS] Fortran Overhead ( 0 ) : 0.2435s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0861s for 16384 events => throughput is 5.26E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0166s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1345s for 98304 events => throughput is 1.37E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2603s for 16399 events => throughput is 1.59E-05 events/s Then the FIRST execution of x10 gives 1.9 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9285s [COUNTERS] Fortran Overhead ( 0 ) : 0.3277s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5237s for 98304 events => throughput is 5.33E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0964s for 98371 events => throughput is 9.80E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.8057s for 589824 events => throughput is 1.37E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.1741s for 98371 events => throughput is 1.77E-06 events/s But the SECOND execution gives 4.1s! With the big increase coming from the I/O part (And any subsequent execution also gives the same) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 4.1048s [COUNTERS] Fortran Overhead ( 0 ) : 1.1119s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5161s for 98304 events => throughput is 5.25E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0946s for 98371 events => throughput is 9.62E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.7954s for 589824 events => throughput is 1.35E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 1.5861s for 98371 events => throughput is 1.61E-05 events/s Now the FIRST execution of x1 gives 1.4s! ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 1.4677s [COUNTERS] Fortran Overhead ( 0 ) : 0.5601s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0861s for 16384 events => throughput is 5.26E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0167s for 16399 events => throughput is 1.02E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1338s for 98304 events => throughput is 1.36E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.6702s for 16399 events => throughput is 4.09E-05 events/s But the SECOND execution gives again 0.7s! And all subsequent executions too (so we are back at the beginning of the loop above) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7480s [COUNTERS] Fortran Overhead ( 0 ) : 0.2472s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0870s for 16384 events => throughput is 5.31E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0166s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1337s for 98304 events => throughput is 1.36E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2628s for 16399 events => throughput is 1.60E-05 events/s In the following, I will quote results for the second x1 and the first x10 only...
…een defined I had done this to try and decrease the 4.1s... but in the meantime I understood the problem is elsewhere. In particular, this is not faster than string comparison - will revert! ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7451s [COUNTERS] Fortran Overhead ( 0 ) : 0.2426s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0875s for 16384 events => throughput is 5.34E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0170s for 16399 events => throughput is 1.04E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1342s for 98304 events => throughput is 1.37E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2631s for 16399 events => throughput is 1.60E-05 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.8970s [COUNTERS] Fortran Overhead ( 0 ) : 0.3151s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5182s for 98304 events => throughput is 5.27E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0952s for 98371 events => throughput is 9.67E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.7950s for 589824 events => throughput is 1.35E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.1729s for 98371 events => throughput is 1.76E-06 events/s
…g if a counter has been defined: use string comparison to "", it is not slower Revert "[prof] in gg_tt.mad counters.cc add a flag showing if a counter has been defined" This reverts commit ee6f9f5.
…BLECOUNTERS to disable individual counters I initially wanted to use this to check if it is the individual counters that caused the 4.1s in x10 tests. But in the meantime I understood that the problem is elsewhere, and that timings depend on execution order! Will probably revert! Note, the second x1 execution takes 0.7s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7485s [COUNTERS] Fortran Overhead ( 0 ) : 0.2472s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0872s for 16384 events => throughput is 5.32E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0166s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1346s for 98304 events => throughput is 1.37E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2621s for 16399 events => throughput is 1.60E-05 events/s CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7349s And then the first x10 execution takes 1.9s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9127s [COUNTERS] Fortran Overhead ( 0 ) : 0.3268s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5172s for 98304 events => throughput is 5.26E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0964s for 98371 events => throughput is 9.80E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.7992s for 589824 events => throughput is 1.36E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.1723s for 98371 events => throughput is 1.75E-06 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.8511s While the SECOND execution x10 takes 4.1s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 4.1152s [COUNTERS] Fortran Overhead ( 0 ) : 1.1174s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5173s for 98304 events => throughput is 5.26E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0008s [COUNTERS] Fortran X2F ( 4 ) : 0.0950s for 98371 events => throughput is 9.65E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.8117s for 589824 events => throughput is 1.38E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 1.5731s for 98371 events => throughput is 1.60E-05 events/s CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 4.0680s Will therefore revert this
…CUDACPP_RUNTIME_DISABLECOUNTERS to disable individual counters Revert "[prof] in gg_tt.mad counters add an env variable CUDACPP_RUNTIME_DISABLECOUNTERS to disable individual counters" This reverts commit 0681a76.
…ther and make it counter[0] No change in the timings ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7531s [COUNTERS] Fortran Other ( 0 ) : 0.2447s [COUNTERS] CudaCpp MEs ( 2 ) : 0.0862s for 16384 events => throughput is 5.26E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0166s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.1395s for 98304 events => throughput is 1.42E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.2653s for 16399 events => throughput is 1.62E-05 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9572s [COUNTERS] Fortran Other ( 0 ) : 0.3215s [COUNTERS] CudaCpp MEs ( 2 ) : 0.5202s for 98304 events => throughput is 5.29E-06 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0007s [COUNTERS] Fortran X2F ( 4 ) : 0.0941s for 98371 events => throughput is 9.57E-07 events/s [COUNTERS] Fortran PDF ( 5 ) : 0.8486s for 589824 events => throughput is 1.44E-06 events/s [COUNTERS] Fortran I/O ( 6 ) : 0.1720s for 98371 events => throughput is 1.75E-06 events/s
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7543s [COUNTERS] Fortran Other ( 0 ) : 0.2451s [COUNTERS] Fortran X2F ( 1 ) : 0.0163s for 16399 events => throughput is 9.95E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1419s for 98304 events => throughput is 1.44E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.2617s for 16399 events => throughput is 1.60E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0885s for 16384 events => throughput is 5.40E-06 events/s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9649s [COUNTERS] Fortran Other ( 0 ) : 0.3239s [COUNTERS] Fortran X2F ( 1 ) : 0.0951s for 98371 events => throughput is 9.67E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.8467s for 589824 events => throughput is 1.44E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.1783s for 98371 events => throughput is 1.81E-06 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.5202s for 98304 events => throughput is 5.29E-06 events/s
…xcluded from fortran other calculation) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7510s [COUNTERS] Fortran Other ( 0 ) : 0.2485s [COUNTERS] Fortran X2F ( 1 ) : 0.0163s for 16399 events => throughput is 9.94E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1359s for 98304 events => throughput is 1.38E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.2628s for 16399 events => throughput is 1.60E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0868s for 16384 events => throughput is 5.30E-06 events/s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6822s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9135s [COUNTERS] Fortran Other ( 0 ) : 0.3225s [COUNTERS] Fortran X2F ( 1 ) : 0.0938s for 98371 events => throughput is 9.54E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.7961s for 589824 events => throughput is 1.35E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.1819s for 98371 events => throughput is 1.85E-06 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.5184s for 98304 events => throughput is 5.27E-06 events/s [COUNTERS] PROGRAM sample_full ( 11 ) : 1.8445s
… that what is left is something inside sample_full Rephrasing: programtotal = samplefull + initialIO And FortranOther is inside sample_full ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7697s [COUNTERS] Fortran Other ( 0 ) : 0.1810s [COUNTERS] Fortran X2F ( 1 ) : 0.0166s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1355s for 98304 events => throughput is 1.38E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.2672s for 16399 events => throughput is 1.63E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0877s for 16384 events => throughput is 5.35E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0808s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6860s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 2.0621s [COUNTERS] Fortran Other ( 0 ) : 0.2829s [COUNTERS] Fortran X2F ( 1 ) : 0.1024s for 98371 events => throughput is 1.04E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.8580s for 589824 events => throughput is 1.45E-06 events/s [COUNTERS] Fortran I/O ( 3 ) : 0.1838s for 98371 events => throughput is 1.87E-06 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.5532s for 98304 events => throughput is 5.63E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0811s [COUNTERS] PROGRAM sample_full ( 11 ) : 1.9780s
…side the function to the calling sequence in sample_full ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7679s [COUNTERS] Fortran Other ( 0 ) : 0.1849s [COUNTERS] Fortran X2F ( 1 ) : 0.0169s for 16399 events => throughput is 1.03E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1380s for 98304 events => throughput is 1.40E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2611s for 16399 events => throughput is 1.59E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0008s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0877s for 16384 events => throughput is 5.35E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0785s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6862s ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp [COUNTERS] PROGRAM TOTAL : 1.9454s [COUNTERS] Fortran Other ( 0 ) : 0.2618s [COUNTERS] Fortran X2F ( 1 ) : 0.0961s for 98371 events => throughput is 9.77E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.8161s for 589824 events => throughput is 1.38E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.1695s for 98371 events => throughput is 1.72E-06 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0008s [COUNTERS] CudaCpp MEs ( 6 ) : 0.5216s for 98304 events => throughput is 5.31E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0794s [COUNTERS] PROGRAM sample_full ( 11 ) : 1.8627s
…ing (as "test12" for the moment, wip) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7447s [COUNTERS] Fortran Other ( 0 ) : 0.1308s [COUNTERS] Fortran X2F ( 1 ) : 0.0163s for 16399 events => throughput is 9.93E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1328s for 98304 events => throughput is 1.35E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2614s for 16399 events => throughput is 1.59E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0878s for 16384 events => throughput is 5.36E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0649s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6768s [COUNTERS] Fortran TEST ( 12 ) : 0.0499s for 16384 events => throughput is 3.05E-06 events/s
…or the moment, wip) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7526s [COUNTERS] Fortran Other ( 0 ) : 0.1163s [COUNTERS] Fortran X2F ( 1 ) : 0.0165s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1428s for 98304 events => throughput is 1.45E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2589s for 16399 events => throughput is 1.58E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0870s for 16384 events => throughput is 5.31E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0659s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6829s [COUNTERS] Fortran TEST ( 12 ) : 0.0537s for 16384 events => throughput is 3.28E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0108s for 16384 events => throughput is 6.58E-07 events/s
This essentially completes the identification of all bottlenecks. Must now clean up the timers (and remove double counting, "Fortran Other" is now negative?) ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7581s [COUNTERS] Fortran Other ( 0 ) : -0.0298s [COUNTERS] Fortran X2F ( 1 ) : 0.0168s for 16399 events => throughput is 1.02E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1441s for 98304 events => throughput is 1.47E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2627s for 16399 events => throughput is 1.60E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0882s for 16384 events => throughput is 5.38E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0656s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6896s [COUNTERS] Fortran TEST ( 12 ) : 0.0533s for 16384 events => throughput is 3.25E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0105s for 16384 events => throughput is 6.41E-07 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1461s for 16384 events => throughput is 8.91E-06 events/s
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7519s [COUNTERS] Fortran Other ( 0 ) : -0.0299s [COUNTERS] Fortran X2F ( 1 ) : 0.0165s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1421s for 98304 events => throughput is 1.45E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2589s for 16399 events => throughput is 1.58E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0873s for 16384 events => throughput is 5.33E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0651s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6838s [COUNTERS] Fortran TEST ( 12 ) : 0.0542s for 16384 events => throughput is 3.31E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0102s for 16384 events => throughput is 6.26E-07 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1467s for 16384 events => throughput is 8.95E-06 events/s
…er.f ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7533s [COUNTERS] Fortran Other ( 0 ) : -0.0253s [COUNTERS] Fortran X2F ( 1 ) : 0.0165s for 16399 events => throughput is 1.00E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1355s for 98304 events => throughput is 1.38E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2633s for 16399 events => throughput is 1.61E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0008s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0897s for 16384 events => throughput is 5.48E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0649s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6855s [COUNTERS] Fortran TEST ( 12 ) : 0.0490s for 16384 events => throughput is 2.99E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0102s for 16384 events => throughput is 6.20E-07 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1488s for 16384 events => throughput is 9.08E-06 events/s
…g1.f This changes the overall balance, now Fortran Other is again positive. This is because pdg2pdf is also called elsewhere (e.g. in unwgt?) which was already profiled elsewhere. ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7551s [COUNTERS] Fortran Other ( 0 ) : 0.0111s [COUNTERS] Fortran X2F ( 1 ) : 0.0168s for 16399 events => throughput is 1.02E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.0986s for 32768 events => throughput is 3.01E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2633s for 16399 events => throughput is 1.61E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0879s for 16384 events => throughput is 5.36E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0662s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6862s [COUNTERS] Fortran TEST ( 12 ) : 0.0515s for 16384 events => throughput is 3.14E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0099s for 16384 events => throughput is 6.07E-07 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1492s for 16384 events => throughput is 9.11E-06 events/s
Now "Fortran Other" becomes negative again, there is again some double counting ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7511s [COUNTERS] Fortran Other ( 0 ) : -0.0373s [COUNTERS] Fortran X2F ( 1 ) : 0.0168s for 16399 events => throughput is 1.02E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.0965s for 32768 events => throughput is 2.94E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2598s for 16399 events => throughput is 1.58E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0008s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0868s for 16384 events => throughput is 5.30E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0670s [COUNTERS] PROGRAM sample_full ( 11 ) : 0.6811s [COUNTERS] Fortran TEST ( 12 ) : 0.0506s for 16384 events => throughput is 3.09E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0099s for 16384 events => throughput is 6.01E-07 events/s [COUNTERS] Fortran TEST3 ( 14 ) : 0.0541s for 16384 events => throughput is 3.30E-06 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1462s for 16384 events => throughput is 8.93E-06 events/s
This makes it clearer that programtotal = samplefull + initialIO ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7554s [COUNTERS] Fortran Other ( 0 ) : -0.0393s [COUNTERS] Fortran X2F ( 1 ) : 0.0171s for 16399 events => throughput is 1.04E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.0984s for 32768 events => throughput is 3.00E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2621s for 16399 events => throughput is 1.60E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0872s for 16384 events => throughput is 5.32E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0688s [COUNTERS] Fortran TEST ( 12 ) : 0.0521s for 16384 events => throughput is 3.18E-06 events/s [COUNTERS] Fortran TEST2 ( 13 ) : 0.0100s for 16384 events => throughput is 6.08E-07 events/s [COUNTERS] Fortran TEST3 ( 14 ) : 0.0507s for 16384 events => throughput is 3.09E-06 events/s [COUNTERS] Fortran TEST5 ( 16 ) : 0.1478s for 16384 events => throughput is 9.02E-06 events/s [COUNTERS] PROGRAM initial_I/O ( 19 ) : 0.0688s [COUNTERS] PROGRAM sample_full ( 20 ) : 0.6838s
…grouping ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7428s [COUNTERS] Fortran Other ( 0 ) : -0.0409s [COUNTERS] Fortran X2F ( 1 ) : 0.0169s for 16399 events => throughput is 1.03E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.0982s for 32768 events => throughput is 3.00E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2585s for 16399 events => throughput is 1.58E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0865s for 16384 events => throughput is 5.28E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0670s [COUNTERS] Fortran grouping ( 12 ) : 0.0520s for 16384 events => throughput is 3.17E-06 events/s [COUNTERS] Fortran scale ( 13 ) : 0.0098s for 16384 events => throughput is 5.98E-07 events/s [COUNTERS] Fortran rewgt ( 14 ) : 0.0497s for 16384 events => throughput is 3.03E-06 events/s [COUNTERS] Fortran unwgt ( 16 ) : 0.1445s for 16384 events => throughput is 8.82E-06 events/s [COUNTERS] PROGRAM initial_I/O ( 19 ) : 0.0670s [COUNTERS] PROGRAM sample_full ( 20 ) : 0.6728s
…s, which was causing double counting and a negative Fortran Other The problem is that select_grouping_choice calls dsigproc, which eventually calls dsig1, which includes pdf profiling ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7643s [COUNTERS] Fortran Other ( 0 ) : 0.0111s [COUNTERS] Fortran X2F ( 1 ) : 0.0164s for 16399 events => throughput is 9.98E-07 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.1013s for 32768 events => throughput is 3.09E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2712s for 16399 events => throughput is 1.65E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0008s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0874s for 16384 events => throughput is 5.34E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0663s [COUNTERS] Fortran scale ( 13 ) : 0.0103s for 16384 events => throughput is 6.26E-07 events/s [COUNTERS] Fortran rewgt ( 14 ) : 0.0511s for 16384 events => throughput is 3.12E-06 events/s [COUNTERS] Fortran unwgt ( 16 ) : 0.1484s for 16384 events => throughput is 9.06E-06 events/s [COUNTERS] PROGRAM initial_I/O ( 19 ) : 0.0663s [COUNTERS] PROGRAM sample_full ( 20 ) : 0.6950s
…sig1 (not only dsig1_vec), but it does not show up! - will revert ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 0.7479s [COUNTERS] Fortran Other ( 0 ) : 0.0122s [COUNTERS] Fortran X2F ( 1 ) : 0.0166s for 16399 events => throughput is 1.01E-06 events/s [COUNTERS] Fortran PDF ( 2 ) : 0.0974s for 32768 events => throughput is 2.97E-06 events/s [COUNTERS] Fortran final_I/O ( 3 ) : 0.2625s for 16399 events => throughput is 1.60E-05 events/s [COUNTERS] CudaCpp HEL ( 5 ) : 0.0007s [COUNTERS] CudaCpp MEs ( 6 ) : 0.0873s for 16384 events => throughput is 5.33E-06 events/s [COUNTERS] Fortran initial_I/O ( 7 ) : 0.0657s [COUNTERS] Fortran scale ( 13 ) : 0.0102s for 16384 events => throughput is 6.21E-07 events/s [COUNTERS] Fortran rewgt ( 14 ) : 0.0494s for 16384 events => throughput is 3.01E-06 events/s [COUNTERS] Fortran unwgt ( 16 ) : 0.1459s for 16384 events => throughput is 8.90E-06 events/s [COUNTERS] PROGRAM initial_I/O ( 19 ) : 0.0657s [COUNTERS] PROGRAM sample_full ( 20 ) : 0.6793s
Revert "[prof] in gg_tt.mad auto_dsig1.f, add profiling for matrix1 also in dsig1 (not only dsig1_vec), but it does not show up! - will revert" This reverts commit d3165cb.
…ging git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…ated code except gg_tt.mad for easier merging git checkout upstream/master $(git ls-tree --name-only upstream/master *.mad/Source/dsample.f | grep -v ^gg_tt.mad)
…also amd and v1.00.01 fixes) into prof Fix conflicts (use upstream/master version): epochX/cudacpp/gg_tt.mad/Source/dsample.f Will then regenerate patches from this gg_tt.mad
…/master including v1.00.00 and also amd and v1.00.01 fixes The only files that still need to be patched are - 2 in patch.common: Source/dsample.f, SubProcesses/makefile - 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f Note: this is 3 files more than those needed in upstream/master (added Source/dsample.f, auto_dsig1.f, auto_dsig.f) ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad (Later checked that regenerating gg_tt.mad gives no change)
…and also amd and v1.00.01 fixes)
git checkout grid $(git ls-tree --name-only grid */CODEGEN*txt)
Fix conflicts: epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 (will regenerate anyway)
…ncard/bldall/tlau) into prof - essentially no change to the version created by fixing conflicts (NB: THIS IS THE LAST "PROF" CHANGE FOR THE MOMENT - WILL "TEMPORARELY" MOVE TO A SIMPLER "PROF0") ("PROF0" HAS THE NEW TIMERS/COUNTERS OF PROF WITH NEW APIS, BUT NOT THE ADDITIONAL PROFILING OF FORTRAN COMPONENTS) The only files that still need to be patched are - 2 in patch.common: Source/dsample.f, SubProcesses/makefile - 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f Note: this is 3 files more than those needed in upstream/master (added Source/dsample.f, auto_dsig1.f, auto_dsig.f) ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad (Later checked that regenerating gg_tt.mad gives no change)
…nal profiling of Fortran components: keep only the new timers and counters
…port the "temporary" changes in auto_dsig1.f (so that they do not need to go to patch.P1)
…after "temporarely" removing additional Fortran profiling (and modifying other CODEGEN fragments accordingly) The only files that still need to be patched are - 1 in patch.common: Source/genps.inc, SubProcesses/makefile - 2 in patch.P1: driver.f, matrix1.f (Note in particular that the 'prof0' changes over 'grid' in auto_dsig1.f are in smatrix_multi.f and output.py) ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/SubProcesses/makefile > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad (Later regenerated gg_tt.mad and checked that all is ok)
…y" simplification of profiling
…mall-g 72h) - all ok STARTED AT Mon 07 Oct 2024 01:56:32 AM EEST ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -nocuda ENDED(1) AT Mon 07 Oct 2024 02:26:23 AM EEST [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(2) AT Mon 07 Oct 2024 02:36:48 AM EEST [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean -nocuda ENDED(3) AT Mon 07 Oct 2024 02:44:54 AM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -nocuda ENDED(4) AT Mon 07 Oct 2024 02:46:40 AM EEST [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -nocuda' ENDED(5) AT Mon 07 Oct 2024 02:46:40 AM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -nocuda ENDED(6) AT Mon 07 Oct 2024 02:48:26 AM EEST [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(7) AT Mon 07 Oct 2024 03:09:29 AM EEST [Status=0] No errors found in logs No FPEs or '{ }' found in logs
…cted STARTED AT Mon 07 Oct 2024 03:09:30 AM EEST (SM tests) ENDED(1) AT Mon 07 Oct 2024 05:27:02 AM EEST [Status=0] (BSM tests) ENDED(1) AT Mon 07 Oct 2024 05:36:08 AM EEST [Status=0] 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
…l ok STARTED AT Mon Oct 7 12:53:50 AM CEST 2024 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -cpponly ENDED(1) AT Mon Oct 7 01:12:41 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -cpponly ENDED(2) AT Mon Oct 7 01:19:47 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean -cpponly ENDED(3) AT Mon Oct 7 01:24:43 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst -cpponly ENDED(4) AT Mon Oct 7 01:26:11 AM CEST 2024 [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -cpponly' ENDED(5) AT Mon Oct 7 01:26:11 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common -cpponly ENDED(6) AT Mon Oct 7 01:27:39 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -cpponly ENDED(7) AT Mon Oct 7 01:36:12 AM CEST 2024 [Status=0] No errors found in logs No FPEs or '{ }' found in logs
valassi
changed the title
Faster RDTSC-based timers and new timer/counter APIs
(4 in pipeline) Faster RDTSC-based timers and new timer/counter APIs
Oct 7, 2024
Hi @oliviermattelaer as discussed via email, this N=4 and the last PR in the pipeline that I would like to merge. Again, I changed this to target N=3 for easier review, but then I would merge it to master once approved. Let me know what you think please.. thanks! Andrea |
STARTED AT Mon Oct 7 01:36:12 AM CEST 2024 (SM tests) ENDED(1) AT Mon Oct 7 04:33:19 AM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Mon Oct 7 04:38:35 AM CEST 2024 [Status=0] 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 20 /data/avalassi/GPU2024/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
… all ok STARTED AT Mon Oct 7 12:57:24 AM CEST 2024 ./tput/teeThroughputX.sh -dmf -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Mon Oct 7 01:27:08 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -d_f -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Mon Oct 7 01:36:53 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -d_f -bridge -makeclean ENDED(3) AT Mon Oct 7 01:45:56 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -rmbhst ENDED(4) AT Mon Oct 7 01:48:41 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -curhst ENDED(5) AT Mon Oct 7 01:51:24 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -d_f -common ENDED(6) AT Mon Oct 7 01:54:14 AM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -dmf -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Mon Oct 7 02:06:51 AM CEST 2024 [Status=0] No errors found in logs No FPEs or '{ }' found in logs
…cted (heft fail madgraph5#833) STARTED AT Mon Oct 7 02:06:51 AM CEST 2024 (SM tests) ENDED(1) AT Mon Oct 7 05:53:08 AM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Mon Oct 7 06:03:23 AM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
… branch prof0 (new timers madgraph5#972) and add copyright
For the moment I moved back to draft while working on the base PRs |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a PR including faster RDTSC-based timers and new timer/counter APIs. It completes #972.
This PR ("prof0") was derived from the pre-existing PR #962 ("prof"), by stripping off the second part (additional profiling of non-ME fortran components) and keeping only the first part (new RDTSC based timers and new APIs).
The idea is that the additional profiling of non-ME fortran components will be done at a later time in #962, but it will be modified to include patches in upstream mg5amcnlo as suggested by @oliviermattelaer , rather than relying on patchMad.sh with much larger patches, as is done presently.