GitHub - CMSRA1/alphatwirl: A python library for summarizing event data into multi-dimensional categorical data

A python library for summarizing event data into multivariate categorical data

Description

AlphaTwirl is a python library that summarizes event data into multivariate categorical data as data frames. Event data, input to AlphaTwirl, are data with one entry (or row) for one event: for example, data in ROOT TTrees with one entry per collision event of an LHC experiment at CERN. Event data are often large—too large to be loaded in memory—because they have as many entries as events. multivariate categorical data, the output of AlphaTwirl, have one row for one category. They are usually small—small enough to be loaded in memory—because they only have as many rows as categories. Users can, for example, import them as data frames into R and pandas, which usually load all data in memory, and can perform categorical data analyses with a rich set of data operations available in R and pandas.

Quick start

Jupyter Notebook: Quick start of AlphaTwirl with qtwirl

CHEP 2018

Tai Sakuma, "AlphaTwirl: a python library for summarizing event data into multi-dimensional categorical data", CHEP 2018, 9-13 July 2018 Sofia, Bulgaria, (indico)

Features

Input format

Event data: input data of alphatwirl are event data in general
- Event data are any data with one entry (row) for one event.
- Data in ROOT trees are typically event data
  - e.g., one entry for one proton-proton collision event
- Event data are often large because they have as many entries as events
  - e.g., they are often stored in many files in a server machine or a dedicated storage system
ROOT trees: the main input format of alphatwirl
- Flat trees: ROOT trees with only primitive types such as int and float and an array of those.
- With additional code to access each class, it is also possible to read trees with persistent objects
  - CMS EDM formats
Users can write modules to support other formats

Output format

Multivariate categorical data: output data of alphatwirl are multivariate categorical data
- They are usually small because they only have as many entries as categories.
  - Often small enough to be stored as text files in a laptop computer.
Fixed width format: text files with fixed width format have been primarily used as output format
- This format is convenient as long as the data size is small. You can browse it with a text editor. You can import it in R and pandas as a data frame.
- An example output file looks like

process htbin njetbin minChi         n
    QCD   400       2      0  8.15e+05
    QCD   400       2   0.05  3.49e+05
    QCD   400       2    0.1  1.18e+05
    QCD   400       2   0.15  3.78e+04
                   ⋮
 TTJets  1200       6   1.45      0.00
 TTJets  1200       6    1.5      0.00

There are plans to support feather.
Users can write modules to support other formats

Split-apply-combine strategy

The general idea of alphatwirl is to employ the split-apply-combine strategy on event data.
- split event data into groups by categories, apply a function to data in each group, and combine the results as a table of multivariate categorical data.
- Histograms can be created in this strategy—split data into bins, count the number of entries in each bin, and combine the results as a table.
- Summarizing events in alphatwirl is generalization of creating histograms.

Keys and values

Keys: categories are defined in terms of keys
Values: values are summarized in each group defined by categories
Keys and values are attributes of the event object, they are either
- stored in the input file
- or created by scribllers

Table configuration

Tables can be configured by a list of python dictionaries.
- The example code below configures five tables

htbin = Binning(boundaries=(0, 200, 400, 800))
njetbin = Binning(boundaries=(1, 2, 3, 4, 5))
tblcfg = [
  dict(keyAttrNames=('mht40', ),
       binnings=(Round(10, 0), ),
       keyOutColumnNames=('mht', )),
  dict(keyAttrNames=('ht40', ‘mht40'),
       binnings=(htbin, Round(10, 0)),
       keyOutColumnNames=('ht', 'mht')),
  dict(keyAttrNames=('ht40', 'nJet40', ‘mht40'),
       binnings=(htbin, njetbin, Round(10, 0)),
       keyOutColumnNames=('ht', 'njet', 'mht')),
  dict(keyAttrNames=('ht40', ‘jet_pt'),
       binnings=(htbin, RoundLog(0.1, 100)),
       keyIndices=(None, 0),
       keyOutColumnNames=('ht', 'jet_pt')),
  dict(keyAttrNames=('ht40', ‘jet_pt'),
       binnings=(htbin, RoundLog(0.1, 100)),
       keyIndices=(None, ‘*'),
       keyOutColumnNames=('ht', 'jet_pt')),
]

A more complex example

dict(
    keyAttrNames=('ieta', 'iphi', 'depth', 'QIE10_index'),
    keyIndices=('(*)', '\\1', '\\1', '\\1'),
    binnings=(echo, echo, echo, echo),
    valAttrNames=('QIE10_energy', ),
    valIndices=('\\1', ),
    keyOutColumnNames=('ieta', 'iphi', 'depth', 'idxQIE10'),
    valOutColumnNames=('energy', ),
    summaryClass=alphatwirl.Summary.Sum
)

Indices

Variables are scalar or arrays. Indices specify elements of an array
Indices can be flexibly configured
- a simple example:
  dict(keyAttrNames=('ht40', 'jet_pt'), keyIndices=(None, 0), ⋯ )
  ht40 is scalar; the index is None. jet_pt is an array; 0 specifies the first element of jet_pt.
- inclusive:
  dict(keyAttrNames=('ht40', 'jet_pt'), keyIndices=(None, '*'), ⋯ )
  '*' means all elements. all pairs of ht40 and an element of jet_pt.
- all combinations:
  dict(keyAttrNames=('jet_pt', 'muon_pt'), keyIndices=('*' '*'), ⋯ )
  all combinations of jet_pt and muon_pt
- back reference:
  dict(keyAttrNames=('jet_pt', ‘jet_eta'), keyIndices = ('(*)', '\\1'), ⋯ )
  pairs of jet_pt and jet_eta with same index. The parenthesis in '(*)' indicates to remember the index. '\\1' refers the index in the first parenthesis.
- a more complex example:
  dict(keyAttrNames=('jet_pt', 'jet_eta', 'muon_pt', 'muon_eta'), keyIndices=('(*)', '\\1', '(*)', '\\2'), ⋯ )

Binnings

Four binnigs classes are implemented
- Binning: bin boundaries are manually specified by a user
  Binning(boundaries=(0, 200, 400, 800))
- Round: equal bin width
  Round(10, 0)
  10 is the bin width and 0 is a boundary. The lower edge of a bin is included. The upper edge belongs to the next bin.
- RoundLog: equal bin width in logarithm
  RoundLog(0.1, 100)
- Echo: the value itself
  Echo(0.1, 100)
Users can write own custom binning classes

Scribblers

If variables necessary for table configuration or event selection are not in the input file, users can write scribblers to create them on the fly
The variables stored in the input files and the variables created by scribblers can be used as keys and values in the same way in the table configuration and event selection

Event selection

Conditions of event selections can be specified by nested tuples and dictionaries.

dict(All=(
  'ev : ev.ht[0] >= 400',
  'ev : ev.mht[0] >= 200',
  dict(Any=(
    'ev : ev.nJet[0] == 1',
    dict(All=(
      'ev : ev.nJet[0] >= 2',
      'ev : ev.minChi[0] >= 0.7’))
))))

A nested combination of all and any
- All: all conditions need to be met
- Any: at least one of the conditions needs to be met
Users can write their own implementation of All and Any to add functionalities, for example, to count number of events that satisfy each condition

Dependency injection

Classes in alphatwirl generally operate on abstract classes (in python, abstract classes don’t actually need to exist. duck typing is used instead).
Particular implementations of most operations are determined at run time: input formats, output formats, a concurrency method, event selections, object selections, categorization, event summarizing methods, summary collecting methods, delivery methods, and even progress bars.
- Furthermore, each particular implementation doesn’t generally depend on the framework either. In fact, the same event selection code can be used in Heppy.
Particular implementations are specified by configuration.

Fast branch access by addresses

Although using PyROOT, instead of accessing to branches by attributes of a tree object, alphatwirl uses SetBranchAddress(), which is much faster—can be more than ten times faster.

Multiprocessing

Multiprocessing can be used to concurrently process events
Progress bars grow in parallel on terminal screen to indicate the progress of each process.

  25.10% ::::::::::                               |      753 /     3000 |:  WJetsToLNu_HT1200to2500_madgraph 
  30.47% ::::::::::::                             |      914 /     3000 |:  WJetsToLNu_HT1200to2500_madgraph 
  29.30% :::::::::::                              |      879 /     3000 |:  WJetsToLNu_HT1200to2500_madgraph 
  85.40% ::::::::::::::::::::::::::::::::::       |      854 /     1000 |:  WJetsToLNu_HT1200to2500_madgraph 
  27.57% :::::::::::                              |      827 /     3000 |:  WJetsToLNu_HT2500toInf_madgraphM 
  25.47% ::::::::::                               |      764 /     3000 |:  WJetsToLNu_HT2500toInf_madgraphM 
  79.60% :::::::::::::::::::::::::::::::          |      796 /     1000 |:  WJetsToLNu_HT2500toInf_madgraphM 
  25.50% ::::::::::                               |      765 /     3000 |:  WJetsToLNu_HT2500toInf_madgraphM

Batch system

Instead of multiprocessing, a batch system can be also used
Currently, the interface to HTCondor is implemented.
Users can write modules to use other batch system.
While jobs are running in a batch system, the main process is running in the foreground, monitoring the progress of the jobs, and collecting the results as the jobs finish.
Failed jobs are automatically resubmitted.
Jobs can be split in terms of the number of input files and events.
- one input file can be split into multiple jobs
- one job can include multiple input files

Name		Name	Last commit message	Last commit date
Latest commit History 2,417 Commits
alphatwirl		alphatwirl
docs		docs
images		images
requirements		requirements
tests		tests
.coveragerc		.coveragerc
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
codecov.yml		codecov.yml
setup.cfg		setup.cfg
setup.py		setup.py
versioneer.py		versioneer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Quick start

CHEP 2018

Features

Input format

Output format

Split-apply-combine strategy

Keys and values

Table configuration

Indices

Binnings

Scribblers

Event selection

Dependency injection

Fast branch access by addresses

Multiprocessing

Batch system

About

Releases

Packages

Languages

License

CMSRA1/alphatwirl

Folders and files

Latest commit

History

Repository files navigation

Description

Quick start

CHEP 2018

Features

Input format

Output format

Split-apply-combine strategy

Keys and values

Table configuration

Indices

Binnings

Scribblers

Event selection

Dependency injection

Fast branch access by addresses

Multiprocessing

Batch system

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages