Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: studies on CMS DY #946

Draft
wants to merge 316 commits into
base: master
Choose a base branch
from
Draft

WIP: studies on CMS DY #946

wants to merge 316 commits into from

Conversation

valassi
Copy link
Member

@valassi valassi commented Aug 2, 2024

This is a WIP PR with various studies on CMS Drell Yan, addressing various issues

@valassi valassi self-assigned this Aug 2, 2024
@valassi valassi marked this pull request as draft August 2, 2024 10:20
valassi added 28 commits August 7, 2024 13:20
./tlau/lauX.sh -fortran gg_tt.mad -fromgridpack
…one (with backend switch)

./tlau/lauX.sh -fortran gg_tt.mad -fromgridpack
…LL backends (with backend switch)

./tlau/lauX.sh -ALL gg_tt.mad -fromgridpack

What remains TODO
- instrument a better profiling of the time spent
- add events.lhe comparison madgraph5#956 (once fortran/cpp mismatch and second helicity is fixed)
…n itgold91)

CUDACPP_RUNTIME_DISABLEFPE=1 ./tlau/lauX.sh -nomakeclean -fortran pp_dy012j.mad -fromgridpack
…ne (with backend switch)

./tlau/lauX.sh -cppnone gg_tt.mad -fromgridpack
…LL backends (with backend switch)

./tlau/lauX.sh -ALL gg_tt.mad -fromgridpack

What remains TODO
- instrument a better profiling of the time spent
- add events.lhe comparison madgraph5#956 (once fortran/cpp mismatch and second helicity is fixed)
…madevent_interface.py and prepare to modify it

cp -dpr gg_tt.mad/madevent/bin/internal/madevent_interface.py MG5aMC_patches/

It must then be symlinked in gg_tt.mad/madevent/bin/internal:
ln -sf ../../../../MG5aMC_patches/madevent_interface.py .
… with two P*)

./tlau/lauX.sh -fortran gq_ttq.mad -togridpack
…-format v15 from cvmfs if a more recent version is installed madgraph5#952
…one (with backend switch)

./tlau/lauX.sh -cppnone gq_ttq.mad -fromgridpack
…extra debug printouts)

./tlau/lauX.sh -cppnone gq_ttq.mad -fromgridpack
…rnal/gen_ximprove.py

cp -dpr gg_tt.mad/madevent/bin/internal/gen_ximprove.py MG5aMC_patches
…additional debug printouts in gen_ximprove.py)

./tlau/lauX.sh -cppnone gq_ttq.mad -fromgridpack
…rnal/gen_ximprove.py

cp -dpr gg_tt.mad/madevent/bin/internal/cluster.py MG5aMC_patches
valassi added 29 commits August 23, 2024 14:28
…econds() call and go back to the old getTotalDurationSeconds
…mer overhead if CUDACPP_RUNTIME_REMOVETIMEROVERHEAD is set

However, test counters like sample_get_x need a special handling
…UNTERS, remove special meaning of PROGRAM counters
…ng a TEST counter as included in a non-TEST counter, to subtract ovberheads
…SpaceSampling

These are the first results where timer overhead is removed: looks nice,
but the overhead should be computed in the counters.cc calls rather than in the individual timers
(this would also make more sense with respect to timermap.h where this will not be possible - remane the env, too)

./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4608s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1171s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0690s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2317s for  1087437 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0917s for    32768 events => throughput is 3.57E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1719s for    16384 events => throughput is 9.53E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0483s for    16384 events => throughput is 3.39E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0691s for    16384 events => throughput is 2.37E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1276s for  1087437 events => throughput is 8.52E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4718s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0269s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3519s for 14136681 events => throughput is 6.01E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4251s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    5.2204s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1550s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0697s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.9335s for  1087437 events => throughput is 2.76E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0924s for    32768 events => throughput is 3.55E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1722s for    16384 events => throughput is 9.52E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0487s for    16384 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0689s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1401s for  1087437 events => throughput is 7.76E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4779s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0263s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.8064s for 14136681 events => throughput is 5.04E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.1846s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s

CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: RdtscTimer overhead :    0.0179s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.4668s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.2924s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.1745s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1190s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0696s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.9612s for  1087437 events => throughput is 3.67E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0913s for    32768 events => throughput is 3.59E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1709s for    16384 events => throughput is 9.59E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0482s for    16384 events => throughput is 3.40E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0678s for    16384 events => throughput is 2.42E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1125s for  1087437 events => throughput is 9.67E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4716s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.0989s for 14136681 events => throughput is 6.74E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.1387s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: ChronoTimer overhead :    0.0489s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    5.2779s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.7998s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4781s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1570s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0669s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2485s for  1087437 events => throughput is 3.35E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0930s for    32768 events => throughput is 3.52E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1716s for    16384 events => throughput is 9.55E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0474s for    16384 events => throughput is 3.46E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0681s for    16384 events => throughput is 2.41E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0929s for  1087437 events => throughput is 1.17E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4705s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.1629s for 14136681 events => throughput is 6.54E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4424s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.8210s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8210s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.8301s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8301s
…s: this will be moved to counters alone

Revert "[prof] in gux_taptamggux.mad timer.h, add instead a getTotalOverheadSeconds() call and go back to the old getTotalDurationSeconds"
This reverts commit ad9b747.

Revert "[prof] in gux_taptamggux.mad timer.h, add the option to remove overhead from getTotalDurationSeconds calls"
This reverts commit 5c0a2ed.
…unter overhead (remove it from timer.h: there will be none for tiumermap.h)

Rename the env as CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD to make it clear that this is in the counters.cc infrastructure

These are the results

(1) keep overhead

./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.5315s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1198s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0678s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2691s for  1087437 events => throughput is 3.33E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1044s for    32768 events => throughput is 3.14E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1757s for    16384 events => throughput is 9.33E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0543s for    16384 events => throughput is 3.02E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0731s for    16384 events => throughput is 2.24E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1322s for  1087437 events => throughput is 8.23E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4719s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0274s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0358s for    16384 events => throughput is 4.57E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3686s for 14136681 events => throughput is 5.97E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4957s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0358s for    16384 events => throughput is 4.57E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    5.2048s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1559s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0673s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.9265s for  1087437 events => throughput is 2.77E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0993s for    32768 events => throughput is 3.30E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1648s for    16384 events => throughput is 9.94E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0514s for    16384 events => throughput is 3.19E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0700s for    16384 events => throughput is 2.34E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1365s for  1087437 events => throughput is 7.97E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4711s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0264s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.8006s for 14136681 events => throughput is 5.05E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.1691s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

(2) remove overhead

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0331s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.5208s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.5413s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9795s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1548s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0670s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.7547s for  1087437 events => throughput is 3.95E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0988s for    32768 events => throughput is 3.32E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1639s for    16384 events => throughput is 1.00E+05 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0510s for    16384 events => throughput is 3.21E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0674s for    16384 events => throughput is 2.43E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0898s for  1087437 events => throughput is 1.21E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4700s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0356s for    16384 events => throughput is 4.60E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.8855s for 14136681 events => throughput is 7.50E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.9439s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0356s for    16384 events => throughput is 4.60E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0640s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    5.3491s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    1.0455s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.3036s
 [COUNTERS] Fortran Other                  (  0 ) :    0.2216s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0692s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.0230s for  1087437 events => throughput is 3.60E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0992s for    32768 events => throughput is 3.30E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1652s for    16384 events => throughput is 9.92E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0504s for    16384 events => throughput is 3.25E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0684s for    16384 events => throughput is 2.39E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0716s for  1087437 events => throughput is 1.52E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4727s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.9427s for 14136681 events => throughput is 7.28E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.2679s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

(3) remove overhead, disable individual timers (so here the overhead is 0)

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0039s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.7998s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.7998s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0038s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.9067s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9067s
…ter overhead

These are the results

(1) keep overhead

./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4766s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1202s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0685s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2400s for  1087437 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1007s for    32768 events => throughput is 3.25E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1673s for    16384 events => throughput is 9.79E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0521s for    16384 events => throughput is 3.14E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0687s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1237s for  1087437 events => throughput is 8.79E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4728s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0269s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3496s for 14136681 events => throughput is 6.02E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4409s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    5.3144s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1588s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0674s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    4.0191s for  1087437 events => throughput is 2.71E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0996s for    32768 events => throughput is 3.29E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1660s for    16384 events => throughput is 9.87E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0508s for    16384 events => throughput is 3.22E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0704s for    16384 events => throughput is 2.33E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1482s for  1087437 events => throughput is 7.34E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4718s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.8646s for 14136681 events => throughput is 4.94E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.2787s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

(2) remove overhead

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0338s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.8244s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.8905s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9339s
 [COUNTERS] Fortran Other                  (  0 ) :    0.2954s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0674s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.7332s for  1087437 events => throughput is 3.98E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1003s for    32768 events => throughput is 3.27E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1688s for    16384 events => throughput is 9.71E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0507s for    16384 events => throughput is 3.23E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0695s for    16384 events => throughput is 2.36E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0924s for  1087437 events => throughput is 1.18E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4692s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0263s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.8723s for 14136681 events => throughput is 7.55E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.8982s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0637s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    5.8826s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    1.6786s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.2040s
 [COUNTERS] Fortran Other                  (  0 ) :    0.4831s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0691s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.9924s for  1087437 events => throughput is 3.63E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0983s for    32768 events => throughput is 3.33E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1669s for    16384 events => throughput is 9.81E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0506s for    16384 events => throughput is 3.24E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0676s for    16384 events => throughput is 2.42E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0698s for  1087437 events => throughput is 1.56E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4712s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0350s for    16384 events => throughput is 4.68E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.9227s for 14136681 events => throughput is 7.35E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.1690s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0350s for    16384 events => throughput is 4.68E+05 events/s

(3) remove overhead, disable individual timers (so here the overhead is 0)

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0333s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.1897s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.3330s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8567s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0659s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.5119s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.6594s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8525s

(4) do not remove overhead, disable individual timers (remove also the overhead from the estimation of the overhead)
(this test was done on another day on the same machine and build, but the results are compatible with the previous ones)

CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8072s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8214s
…r merging

git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…Source/makefile madgraph5#980) into prof

(Checked that regenerating gg_tt.mad is all ok)
…r merging

git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…er merging

git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…adgraph5#980) into cmsdy

Fix conflicts:
- epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common (remove Source/makefile)
- epochX/cudacpp/CODEGEN/allGenerateAndCompare.sh (add processes from both branches)

(Checked that regenerating gg_tt.mad is ok)
…ier merging

git checkout upstream/master $(git ls-tree --name-only HEAD tput/logs* tmad/logs*)
…nerated code except gg_tt.mad for easier merging

git checkout upstream/master $(git ls-tree --name-only upstream/master *.mad/SubProcesses/P*/auto_dsig1.f | grep -v ^gg_tt.mad)
…dhel, for360) into prof

Fix conflicts:
- epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f (use upstream/master, will add back all counters as in prof)
- epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 (use upstream/master, will regenerate this)
- epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common (use upstream/master, will regenerate this)
…f branch before merging upstream/master (fix conflicts)
…pstream/master including june24, goodhel, for360

The only files that still need to be patched are
- 2 in patch.common: Source/dsample.f, SubProcesses/makefile
- 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f

Note: this is 3 files more than those needed in upstream/master (added Source/dsample.f, auto_dsig1.f, auto_dsig.f)

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad

(Later checked that gg_tt.mad can be regenerated ok)
…' (including june24, goodhel, for360) into prof

Also add to the repo a few missing files in gux_taptamggux.mad and nobm_pp_ttW.mad
…ging

git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…ated code except gg_tt.mad for easier merging

git checkout upstream/master $(git ls-tree --name-only upstream/master *.mad/Source/dsample.f | grep -v ^gg_tt.mad)
…also amd and v1.00.01 fixes) into prof

Fix conflicts (use upstream/master version): epochX/cudacpp/gg_tt.mad/Source/dsample.f

Will then regenerate patches from this gg_tt.mad
…/master including v1.00.00 and also amd and v1.00.01 fixes

The only files that still need to be patched are
- 2 in patch.common: Source/dsample.f, SubProcesses/makefile
- 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f

Note: this is 3 files more than those needed in upstream/master (added Source/dsample.f, auto_dsig1.f, auto_dsig.f)

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad

(Later checked that regenerating gg_tt.mad gives no change)
… v1.00.00 and with AMD and v1.00.01 fixes) into cmsdy

Fix conflicts:
- epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 (manual attempt, will regenerate anyway)
- epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common (manual attempt, will regenerate anyway)
- epochX/cudacpp/CODEGEN/recreateRefs.sh (use profs version)
…est prof (with upstream/master v1.00.00 and AMD/v1.00.01 fixes) into cmsdy

The only files that still need to be patched are
- 2 in patch.common: Source/dsample.f, SubProcesses/makefile
- 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f

Note: this is 3 files more than those needed in upstream/master (added Source/dsample.f, auto_dsig1.f, auto_dsig.f)

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad
@valassi
Copy link
Member Author

valassi commented Oct 5, 2024

Now including the latest prof #962 (including upstream/master with v1.00.00 and with the AMD and v1.00.01 patches #1014 and #1012)

NOT including the latest grid #948... I need to streamline this stuff and make it releasable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants