Skip to content
Bruno Alves edited this page Aug 2, 2023 · 12 revisions

Contact: [email protected]

Summary

The following contains useful notes for the running of the code when the SKIMS are already available. The process is: BigNTuples → SKIMS → Histo files → plots → limits → needed stuff in general

  • BigNTuples → are the NTuples coming from CERN and containing everything
  • SKIMS → they are small NTuples where only the stuff that is needed for our analysis is saved (step needed just so we do not have to deal with a big amount of memory from the BigNTuples dimension)
  • Histo files → .root files containing the histograms that are then used for the plotting, limit extraction, etc.

In the following the workflow is explained in detail.

Prepare Big Ntuples

Clone this repository in lxplus (using release CMSSW_12_4_14_patch2). Look into branch 124X_HH_UL, using the NtupleProducer/test/ folder.

Submission with crab

  • The datasets are under NtupleProducer/test/datasets_UL18.txt (similar for other data periods). This file is picked up by NtupleProducer/test/submitAllDatasetOnCrab_LLR.py.
  • You might want to edit the script in the following places:
    • isMC flag set to True/False depending on the samples being processed
    • background MC samples potentially commented out
    • edit analyzer_LLR.py with isMC=True if needed
ssh lxplus;
cd CMSSW_10_6_29/src/;
cmsenv;
cd LLRHiggsTauTau/NtupleProducer/test/;
source /cvmfs/cms.cern.ch/crab3/crab.sh;
python2 submitAllDatasetOnCrab_LLR.py;
  • Visualize progression with Grafana (sign in below with CERN’s credentials)
  • Submission outputs store under root://eos.grif.fr//eos/grif/cms/llr/store/user/${USER}/HHNtuples_res/ (access with gfal-ls tool)

Note #1: Make sure the isMC flag is the same in NtupleProducer/test/submitAllDatasetOnCrab_LLR.py and NtupleProducer/test/analyzer_LLR.py. Note #2: Common CRAB commands: crab submit / crab submit -d <folder> / crab status

Prepare SKIM NTuples

To launch the production of skimmed samples we use (if you are at LLR, run source /opt/exp_soft/cms/t3/t3setup before):

bash scripts/submit_skims_UL18.sh -t <some_tag> --user bfontana

this calls scripts/makeListOnStorage.py (input files definition) and scripts/skimNtuple.py (skimming job submission) per sample. The configuration file is config/skims_UL18.cfg file (you may need to change some of its parameters).

One can add the option -n to avoid producing input files lists if one is sure they are up-to-date. All options can be inspected with -h. One can comment out some samples in the script to avoid skimming over all the samples. Two text files are created (unless empty) automatically: goodfiles.txt and badfiles.txt. There the list of “good” or “bad/corrupted” files is stored, defined according to scripts/check_outputs.py (called within scripts/skimNtuple.py). The list of “good” files is used in subsequent analysis steps. This additional step avoids crashes due to corrupted skimmed files. One can understand the reason for the crashes by looking at the files under the logs/ or live_logs/ directories (*.err, *.out and *.log), where their ids match the ones listed in good/badfiles.txt.

Note: Some logic (a simple lock, transparent to the user) was implemented to ensure no race condition happens when jobs write to the same text files.

Systematic variations

Ones needs to run the skimming step three times: one for the nominal values, and twice for up and down variations. This is controlled by the smearVariation variable in the [JetSmearing] section of the configuration files for the skimming, called by default skim_*.cfg. The value “0” corresponds to the nominal skims, “1” to the up and “-1” to the down variations.

When the skims are prepared we have to run the calculation of the systematics on them. The process is exactly the same, but this time we run:

source scripts/submit_syst<year>.sh

Job resubmission

To resubmit jobs assigned to the badfiles.txt, the user must simply issue the following command:

bash scripts/submit_skims_UL18.sh -t <some_tag> --resubmit

where the tag must be the same used in the original submission. New log files are named appropriately, so to make it clear they correspond to resubmitted jobs. Old logs are not lost. The user can resubmit the jobs as many times as needed, and the most recent badfiles.txt (with a slightly different name) is picked as input.

Fill cfg of the samples

We fill the file config/sampleCfg_*.cfg with the paths to the various directories containing the SKIM NTuples.

  • The most recent file is config/sampleCfg_UL18.cfg

Fill cfg of selections

We fill the selectionCfg_*.cfg with all the selection criteria that are decided (the weights are set in this config file). The currently most up-to-date files are:

  • config/selectionCfg_ETau_UL18_template.cfg
  • config/selectionCfg_MuTau_UL18_template.cfg
  • config/selectionCfg_TauTau_UL18_template.cfg

Fill main cfg

We fill the mainCfg_*.cfg file in which we have to specify:

  1. the samples to be included (here we have to use the aliases defined inside the samplesCfg_*.cfg file). The samples to be specified are data, signal and background
  2. the variables to be plotted (these variables will have to match the variables that we are trying to plot inside the makeFinalPlots.sh) [e.g. tauH_pt ]
  3. the selections to be plotted (these selections will have to match the selections that we are trying to plot inside the makeFinalPlots.sh/py) [i.e. baseline, s1b1jresolved, s2b0jresolved, sboostedLL]
  4. in the [pp_QCD] section, the regions to be used for the QCD estimation with the ABCD method

The currently most up-to-date files are:

  • config/mainCfg_ETau_UL18.cfg
  • config/mainCfg_MuTau_UL18.cfg
  • config/mainCfg_TauTau_UL18.cfg

Histogram production

=================

  1. Check `submitHistoFiller_parallel.py`
    1. Set options in `launchHistoFiller.py` and run it

=================

To launch the production of the histograms for all channels (example):

for i in "E" "Mu" "Tau"; do python scripts/submitHistoFiller.py --cfg config/mainCfg_${i}Tau_UL18.cfg --njobs 20 --tag <tag> --queue short; done

the number of jobs is by default set to 10. The --tag option can take any value (but must be different for each channel!); it is used solely for book keeping, and it will be used during the plotting step. More jobs means more output file and quicker individual jobs (up to the availability of the cluster); 20 jobs tend to complete in at most 30 minutes. The data used corresponds to what was defined on the sampleCfg_*.cfg file.

The command will create a folder under /data_CMS/cms/${USER}/HHresonant_hist/<tag>/ in which the histograms (in ROOT format), logs and copies of used configuration files are stored. To modify the output folder use the --outdir option. If no errors are present all the logs will end with:

@@ ... saving completed, closing output file
... exiting

The above command will run the production of histograms twice over the following configuration sections:

  • [merge_plots]: to be used for plotting only, where usually DY histograms are merged, and also minor histograms into a other name.
  • [merge_limits]: to be used for the limit extraction, where DY categories and minor backgrounds are kept separate

Each job produces two ROOT files (outPlots_<id>.root and outLimits_<id>.root) containing histograms that must be merged (see the next step).

Acceptance scale due to systematic variations

The histogram producing step must be repeated, in order to consider acceptance scaling due to the up and down variations that can have effects on the acceptance (energy scale related systematics). The command above should be the same, except:

  • aim for a larger number of jobs
  • use --queue long (default) to be on the safe side
  • point to different configuration files, by default called mainCfg_ETau_UL18_syst_scales.cfg (which in turn calls selectionCfg_ETau_UL18_syst_scales.cfg), and similar for the other channels

For convenience, we report the command below:

for i in "E" "Mu" "Tau"; do python scripts/submitHistoFiller.py --cfg config/mainCfg_${i}Tau_UL18_syst_scales.cfg --njobs 50 --tag <tag>; done

Check the jobs

To check if the jobs are still running or if they are done or if they broke for some reason launch one of the following:

condor_q # built-in solution, see manual
/opt/exp_soft/cms/t3/t3stat # wrapper at LLR
/opt/exp_soft/cms/t3/t3stat -q # gives only the queue

this will give a live output of the machine carrying out the jobs. Statuses:

  • R: running
  • Q: queueing (also known as “Idle” state)

If a job has some issue (for instance, it goes to “Hold” state) you can kill it with:

condor_rm <code_name_of_job>  # cancel signal job
condor_rm -name llrt3condor <username> # cancels all jobs under username

Combine histograms

Merge all the outPlots_*.root and outLimits_*.root files, creating the combined_outPlots.root and combined_outLimits.root files in which the histograms have been analysed for, respectively, the production of the plots and the extraction of the final limits. It also runs the “harvesting” step, meant to collect the histograms from three different locations (nominal, up and down variations) into a single place. UPDATE: THE HARVESTING STEP IS CURRENTLY COMMENTED OUT!!!!!

cmsenv
for i in "E" "Mu" "Tau"; do python scripts/combineFillerOutputs.py --cfg mainCfg_${i}Tau_UL18.cfg --tag <tag>; done
rm <storage>/outPlotter_*.root #optional

where --tag is the same as the tag used in the previous step. You can use --dir in case the outputs of the previous step were not stored in the default path.

Final plots

This step is not strictly required to produce the final results; it is however useful to visualize signal, background and data distributions, including the variables that will be later fitted. To make the final plots we have to use makeFinalPlots.sh:

for i in "E" "Mu" "Tau"; do bash scripts/makeFinalPlots.sh -t <tag> -c ${i}Tau -s baseline --cfg mainCfg_${i}Tau_UL18.cfg --nodata --nosig; done

where -t points again to the same tag as before, and -c (channel) can be “EleTau”, “MuTau” or “TauTau”. The options --nodata or --nosig can be added to remove the corresponding contributions from the final plots. Type -h to see all available options.

Some variables are hard-coded, like which variables to plot (they match the variables that were specified in the mainCfg_*.cfg). Many varables are also hard-coded in makeFinalPlots.py.

The plots are copied from their local storage to https://${EOS_USER}}.web.cern.ch/${EOS_USER}/HH_Plots/${TAG}/${CHANNEL}/${BASELINE}/. With some minor html/php definitions you will manage to see the plots in your browser. mat

Limits extraction with datacards

The limit extraction is obtained via a maximum likelihood fit. This paper summarizes well the statistical techniques employed. The combine tool is used; you can find its documentation here. The legacy limit extraction was done with these scripts. Currently, a single script is able to produce all the results; it runs four steps sequentially which can be launched separately if wished (see below). The script was only tested in the scope of the resonant analysis; modifications will be required for the non-resonant one (scattered scripts stored under KLUBAnalysis/nonResonantLimits).

Warning: The script assumes that all categories contain the substrings “resolved” or “boosted” due to the different signal mass ranges; changing this implies minor modifications to the source code of some shell scripts.

Inspect the run_limits.py file, namely all the variables defined at the bottom; if all looks fine, run the following:

cd ~/CMSSW_11_3_4/src/HiggsAnalysis/CombinedLimit/ # CMSSW release used by combine v9.0.0
cmsenv # pick up all combine commands
cd ~/CMSSW_11_1_9/src/KLUBAnalysis/resonantLimits/ # go back to the KLUB limit extraction folder, in this case the resonant one; do NOT run 'cmsenv'
# adjust the variables defined inside run_limits.py
python3 run_limits.py --dryrun # remove 'dry-run' to actually run the commands

Technical note: The KLUB frameworks uses release CMSSW_11_1_9, while combine uses a different one, depending on its version. To run the following commands (most of them depending on combine), you have to run cmsenv in the folder release of combine, not on the KLUB folder.

The above runs the following steps:

1. Generate datacards

Make sure the configs for the systematics are up to date. Then run make_res_cards.sh which calls the following scripts sequentially:

  • compute_scales.py: calculates systematic acceptance scale factors; not yet supported (to be added soon)
  • prepare_histos.py: removes negative bins from the histograms and applies the scale computed in the previous step to the hadd‘ed histograms of the histogram combination step (by default called combined_outLimits.py). If the default is used, the output files will be called prepared_outLimits.py and will be stored in the same folder as its input. This step becomes slow when considering mHH categories (TFile.Get() calls seem to scale linearly with the location of the number of histograms); you can run run_limits.py with the -p option to skip it in case you run it before.
  • write_res_card.py: generates the datacards per channel/category/mass point, running on the output of the previous step. ABCD regions for QCD estimate are generated separately. The argument --tag does not refer to the tag used in the histogram production; besides, it should be the same for all channels. Use the same tag throughout the limits extraction.
bash make_res_cards.sh -d UL18 --channels ETau MuTau TauTau --in_tags 10Feb_ETau_UL18 10Feb_MuTau_UL18 10Feb_TauTau_UL18 --tag <tag> --var DNNoutSM_kl_1 --cfg mainCfg_ETau_UL18.cfg mainCfg_MuTau_UL18.cfg mainCfg_TauTau_UL18.cfg

2. Generate workspaces

make_workspace_res.sh

Generates workspaces for all possible combinations separately.

bash make_workspace_res.sh --tag <tag>
combine_res_categories.sh

Combines datacards from different categories, and generates workspaces for each channel/mass point

# you may want to adjust the variables defined inside
bash combine_res_channels.sh --tag <tag> --masses 250 260 --var DNNoutSM_kl_1 --signal ggFRadion --selections s1b1jresolvedInvMcut s2b0jresolvedInvMcut sboostedLLInvMcut

If, additionally, the --selections list is provided, it uses its elements as prefixes (just like in other scripts), making possible a more fine-grained choice of which selections to group.

combine_res_channels.sh

Combines datacards from all channels (generally ETau, MuTau and TauTau), and generates workspaces for each category/mass point. Creates a directory called cards_<tag>_CombChan/. Supports granular category grouping (useful when defining diHiggs mass categories, for instance) via the --selprefixes option; all categories starting with a specific prefix are grouped together (plus grouping across all channels).

combine_res_all.sh

Combines all datacards for the period, and generates workspaces for each mass point. Creates a directory called cards_<tag>_All/.

3. get_limits_res.sh

Runs combine for asymptotic limits for all channel/category/mass point separately, stores result in a log file for easy limit plotting. Can also group categories and/or channels using the --mode option.

4. plotSimple_resMass.py

Plots final limits taking the log files from the previous step as input. Also supports channels and/or categories plot overlays via the --mode option, specifically the overlay_channels and overlay_selections flags.

DY systematics (under development)

mkdir myinference
cd myinference
  • Download the inference package (source):
git clone --recursive ssh://[email protected]:7999/hh/tools/inference.git
cd inference
source setup.sh
law index --verbose
  • Create a folder for the cards to be produced (to be run inside inference/)
cd ..
mkdir cards
cd cards
# copy the cards produced by KLUB
cd ../inference
  • Run in parallel using “stack_DY” and “remove unused shapes” (they modify the datacard files, hence copying them before!)
python stack_and_clean_eTau.py
python stack_and_clean_muTau.py
python stack_and_clean_tauTau.py

ttbar Scale Factors

This section is be run independently from the bulk of the analysis. It follows what was done for the non-resonant analysis (see section 7.3 of AN-18-121). From a technical point of view it uses many of the histogramming and data preparation scripts done for the resonant analysis.

Produce histograms

Just like for the standard workflow, but using dedicated configuration files:

for i in "E" "Mu" "Tau"; do python scripts/submitHistoFiller.py --cfg config/mainCfg_${i}Tau_ttCR_UL18.cfg --njobs 30 --tag ttSF_${i}Tau --queue short; done

When defining --tag, please keep the <string>_<channel> structure. It is used by the next scripts. This step produces output files for histogramming and limit extraction, with the exact same structure as in the standard analysis, using a dedicated ttbar control region defined in the selectionCfg_*_ttCR_*.cfg files. Since few selections are applied (the ttbar control and signal regions), this step runs very quickly (~ 5 minutes). Make sure all the variables you store in histograms have the same binning across channels, otherwise it will not be possible to merge them later on.

Combine histograms

Again very similar to the standard workflow, runs quickly on the local machine:

for i in "E" "Mu" "Tau"; do python scripts/combineFillerOutputs.py --cfg mainCfg_${i}Tau_ttCR_UL18.cfg --tag ttSF_${i}Tau; done

Plotting

Just like in the analysis flow:

for i in "E" "Mu" "Tau"; do bash scripts/makeFinalPlots.sh -t <tag> -c ${i}Tau -s ttCR_invMcut --cfg mainCfg_${i}Tau_ttCR_UL18.cfg --nosig; done

You might have to change the encoded variables depending on how many you included in the histograms.

Datacard writing and Fitting

We want to obtain datacards for the control region (CR) and for CR+SR (validation). The fit is done automatically and the result will be printed on screen (rate_TT).

cd ~/CMSSW_11_3_4/src/HiggsAnalysis/CombinedLimit/ # CMSSW release used by combine v9.0.0
cmsenv # pick up all combine commands
cd ~/CMSSW_11_1_9/src/KLUBAnalysis/ # go back to KLUB

bash scripts/ttSF_fit.sh --tag ttSF -d UL18 -p --dryrun

Remove --dryrun if you are sure your options are correct. Notice that --tag does not include the channel.

Apply ttSF in the analysis

This is done at the level of the histogram combination. Simple run the following, specifying the tags of the full analysis (which were the inputs to the ttbar scale factor scripts). In this example the SF is 0.758.

# combine ETau histograms with fixed ttbar
for i in "E"; do python scripts/combineFillerOutputs.py --cfg mainCfg_${i}Tau_UL18.cfg --tag SomeTag_${i}Tau --moreTT 0.785; done

# plot the histograms
for i in "E"; do bash scripts/makeFinalPlots.sh -t SFCheck_${i}Tau_2d -c ${i}Tau -s baselineInvMcut --cfg mainCfg_${i}Tau_UL18.cfg --nosig --moreTT 0.785; done

The command line option --moreTT allows to test several values without changing the main tag.

Clone this wiki locally