This demo shows the submission of AGC - Analysis Grand Challenge to the REANA, using Snakemake as the workflow engine.
For a full explanation please have a look at this documentation:
The Analysis Grand Challenge (AGC) is about performing the last steps in an analysis pipeline at scale to test workflows envisioned for the HL-LHC. This includes
- columnar data extraction from large datasets,
- processing of that data (event filtering, construction of observables, evaluation of systematic uncertainties) into histograms,
- statistical model construction and statistical inference,
- relevant visualizations for these steps,
The physics analysis task is a datasets/cms-open-data-2015
).
The current reference implementation can be found in analyses/cms-open-data-ttbar
.
We are using 2015 CMS Open Data in this demonstration to showcase an analysis pipeline. The input .root
files are located in the nanoAODschema.json
.
The current coffea AGC version defines the coffea Processor, which includes a lot of the physics analysis details:
- event filtering and the calculation of observables,
- event weighting,
- calculating systematic uncertainties at the event and object level,
- filling all the information into histograms that get aggregated and ultimately returned to us by coffea.
The analysis takes the following inputs:
nanoAODschema.json
input.root
filesSnakefile
The Snakefile forttbar_analysis_reana.ipynb
The main notebook file where files are processed and analysed.file_merging.ipynb
Notebook to merge each processed.root
file in one file with unique keys.final_merging.ipynb
Notebook to merge histograms together all of
REANA supports the Snakemake workflow engine. To ensure optimal execution of the AGC ttbar workflow, we implement a two-level (multicascading) parallelization approach with Snakemake. Initially, Snakemake distributes all jobs across separate nodes, each working on a single .roo
t file for ttbar_analysis_reana.ipynb
.
Once each rule completes, the individual files are merged into one per sample.
#Here is the high level of AGC workflow
+-----------------------------------------+
| Take the CMS open data from nanoaod.json|
+-----------------------------------------+
|
|
|
v
+-----------------------------------+
|rule: Process each file in parallel|
+-----------------------------------+
|
|
|
v
+-----------------------------------------+
|rule: Merge created files for each sample|
+-----------------------------------------+
|
|
|
v
+----------------------------------------------+
|rule: Merge sample files into single histogram|
+----------------------------------------------+
To be able to rerun the AGC after some time, we need to "encapsulate the current compute environment", for example to freeze the ROOT version our analysis is using. We shall achieve this by preparing a Docker container image for our analysis steps.
We are using the modified version of the analysis-systems-base
Docker image container with additional packages, the main on is papermill which allows running Jupyter Notebooks from the command line with additional parameters.
In our case, the Dockerfile creates a conda virtual environment with all necessary packages for running the AGC analysis.
$ less environment/Dockerfile
Let's enter the environment and build it
$ cd environment/
We can build our AGC environment image and give it a name
docker.io/reanahub/reana-demo-agc-cms-ttbar-coffea
:
$ docker build -t docker.io/reanahub/reana-demo-agc-cms-ttbar-coffea .
After this, we can push the image to the DockerHub image registry:
$ docker push docker.io/reanahub/reana-demo-agc-cms-ttbar-coffea
Some data are located at the eos/public so in order to process the big amount of files, user should be authenticated with Kerberos. In our case we achieve it by setting up:
workflow:
type: snakemake
resources:
kerberos: true
file: Snakefile
If you are processing a small number of files (less than 10) you can set this option to False
.
Or you can also set the kerberos authentication via the Snakemake rules.
For deeper understanding please refer to the (REANA documentation)[https://docs.reana.io/advanced-usage/access-control/kerberos/]
The reana.yaml file describes the above analysis structure with its inputs, code, runtime environment, computational workflow steps and expected outputs:
version: 0.8.0
inputs:
files:
- ttbar_analysis_reana.ipynb
- nanoaod_inputs.json
- fix-env.sh
- corrections.json
- Snakefile
- file_merging.ipynb
- final_merging.ipynb
- prepare_workspace.py
directories:
- histograms
- utils
workflow:
type: snakemake
resources:
kerberos: true
file: Snakefile
outputs:
files:
- histograms_merged.root
We can now install the REANA command-line client, run the analysis and download the resulting plots:
$ # create new virtual environment
$ virtualenv ~/.virtualenvs/reana
$ source ~/.virtualenvs/reana/bin/activate
$ # install REANA client
$ pip install reana-client
$ # connect to some REANA cloud instance
$ export REANA_SERVER_URL=https://reana.cern.ch/
$ export REANA_ACCESS_TOKEN=XXXXXXX
$ # run AGC workflow
$ reana-client run -w reana-agc-cms-ttbar-coffea
$ # ... should be finished in around 6 minutes if you select all files (-1 for n_files_max_per_sample) in inputs.yaml
$ reana-client status
$ # list workspace files
$ reana-client ls
Please see the REANA-Client documentation for
more detailed explanation of typical reana-client
usage scenarios.
The output is created under the name histograms.root
, which can be further analyzed using various AGC tools. Below are simple figures of the collected results: