A python library for summarizing event data into multivariate categorical data
AlphaTwirl is a python library that summarizes event data into multivariate categorical data as data frames. Event data, input to AlphaTwirl, are data with one entry (or row) for one event: for example, data in ROOT TTrees with one entry per collision event of an LHC experiment at CERN. Event data are often large—too large to be loaded in memory—because they have as many entries as events. multivariate categorical data, the output of AlphaTwirl, have one row for one category. They are usually small—small enough to be loaded in memory—because they only have as many rows as categories. Users can, for example, import them as data frames into R and pandas, which usually load all data in memory, and can perform categorical data analyses with a rich set of data operations available in R and pandas.
- Jupyter Notebook: Quick start of AlphaTwirl with qtwirl
- Tai Sakuma, "AlphaTwirl: a python library for summarizing event data into multi-dimensional categorical data", CHEP 2018, 9-13 July 2018 Sofia, Bulgaria, (indico)
- Event data: input data of alphatwirl are event data in general
- Event data are any data with one entry (row) for one event.
- Data in ROOT trees are typically event data
- e.g., one entry for one proton-proton collision event
- Event data are often large because they have as many entries as
events
- e.g., they are often stored in many files in a server machine or a dedicated storage system
- ROOT trees: the main input format of alphatwirl
- Flat trees: ROOT trees with only primitive types such as int and float and an array of those.
- With additional code to access each class, it is also possible to read trees with persistent objects
- Users can write modules to support other formats
- Multivariate categorical data: output data of alphatwirl
are multivariate categorical data
- They are usually small because they only have as many entries as categories.
- Often small enough to be stored as text files in a laptop computer.
- They are usually small because they only have as many entries as categories.
- Fixed width format: text files with fixed width format have been primarily used as output format
process htbin njetbin minChi n
QCD 400 2 0 8.15e+05
QCD 400 2 0.05 3.49e+05
QCD 400 2 0.1 1.18e+05
QCD 400 2 0.15 3.78e+04
⋮
TTJets 1200 6 1.45 0.00
TTJets 1200 6 1.5 0.00
- There are plans to support feather.
- Users can write modules to support other formats
- The general idea of alphatwirl is to employ the split-apply-combine
strategy on event data.
- split event data into groups by categories, apply a function to data in each group, and combine the results as a table of multivariate categorical data.
- Histograms can be created in this strategy—split data into bins, count the number of entries in each bin, and combine the results as a table.
- Summarizing events in alphatwirl is generalization of creating histograms.
- Keys: categories are defined in terms of keys
- Values: values are summarized in each group defined by categories
- Keys and values are attributes of the event object, they are either
- stored in the input file
- or created by scribllers
- Tables can be configured by a list of python dictionaries.
- The example code below configures five tables
htbin = Binning(boundaries=(0, 200, 400, 800))
njetbin = Binning(boundaries=(1, 2, 3, 4, 5))
tblcfg = [
dict(keyAttrNames=('mht40', ),
binnings=(Round(10, 0), ),
keyOutColumnNames=('mht', )),
dict(keyAttrNames=('ht40', ‘mht40'),
binnings=(htbin, Round(10, 0)),
keyOutColumnNames=('ht', 'mht')),
dict(keyAttrNames=('ht40', 'nJet40', ‘mht40'),
binnings=(htbin, njetbin, Round(10, 0)),
keyOutColumnNames=('ht', 'njet', 'mht')),
dict(keyAttrNames=('ht40', ‘jet_pt'),
binnings=(htbin, RoundLog(0.1, 100)),
keyIndices=(None, 0),
keyOutColumnNames=('ht', 'jet_pt')),
dict(keyAttrNames=('ht40', ‘jet_pt'),
binnings=(htbin, RoundLog(0.1, 100)),
keyIndices=(None, ‘*'),
keyOutColumnNames=('ht', 'jet_pt')),
]
- A more complex example
dict(
keyAttrNames=('ieta', 'iphi', 'depth', 'QIE10_index'),
keyIndices=('(*)', '\\1', '\\1', '\\1'),
binnings=(echo, echo, echo, echo),
valAttrNames=('QIE10_energy', ),
valIndices=('\\1', ),
keyOutColumnNames=('ieta', 'iphi', 'depth', 'idxQIE10'),
valOutColumnNames=('energy', ),
summaryClass=alphatwirl.Summary.Sum
)
- Variables are scalar or arrays. Indices specify elements of an array
- Indices can be flexibly configured
- a simple example:
dict(keyAttrNames=('ht40', 'jet_pt'), keyIndices=(None, 0), ⋯ )
ht40
is scalar; the index isNone
.jet_pt
is an array;0
specifies the first element ofjet_pt
. - inclusive:
dict(keyAttrNames=('ht40', 'jet_pt'), keyIndices=(None, '*'), ⋯ )
'*'
means all elements. all pairs ofht40
and an element ofjet_pt
. - all combinations:
dict(keyAttrNames=('jet_pt', 'muon_pt'), keyIndices=('*' '*'), ⋯ )
all combinations ofjet_pt
andmuon_pt
- back reference:
dict(keyAttrNames=('jet_pt', ‘jet_eta'), keyIndices = ('(*)', '\\1'), ⋯ )
pairs ofjet_pt
andjet_eta
with same index. The parenthesis in'(*)'
indicates to remember the index.'\\1'
refers the index in the first parenthesis. - a more complex example:
dict(keyAttrNames=('jet_pt', 'jet_eta', 'muon_pt', 'muon_eta'), keyIndices=('(*)', '\\1', '(*)', '\\2'), ⋯ )
- a simple example:
- Four binnigs classes are implemented
- Binning: bin boundaries are manually specified by a user
Binning(boundaries=(0, 200, 400, 800))
- Round: equal bin width
Round(10, 0)
10
is the bin width and0
is a boundary. The lower edge of a bin is included. The upper edge belongs to the next bin. - RoundLog: equal bin width in logarithm
RoundLog(0.1, 100)
- Echo: the value itself
Echo(0.1, 100)
- Binning: bin boundaries are manually specified by a user
- Users can write own custom binning classes
- If variables necessary for table configuration or event selection are not in the input file, users can write scribblers to create them on the fly
- The variables stored in the input files and the variables created by scribblers can be used as keys and values in the same way in the table configuration and event selection
- Conditions of event selections can be specified by nested tuples and dictionaries.
dict(All=(
'ev : ev.ht[0] >= 400',
'ev : ev.mht[0] >= 200',
dict(Any=(
'ev : ev.nJet[0] == 1',
dict(All=(
'ev : ev.nJet[0] >= 2',
'ev : ev.minChi[0] >= 0.7’))
))))
- A nested combination of all and any
- All: all conditions need to be met
- Any: at least one of the conditions needs to be met
- Users can write their own implementation of All and Any to add functionalities, for example, to count number of events that satisfy each condition
- Classes in alphatwirl generally operate on abstract classes (in python, abstract classes don’t actually need to exist. duck typing is used instead).
- Particular implementations of most operations are determined at run
time: input formats, output formats, a concurrency method, event
selections, object selections, categorization, event summarizing
methods, summary collecting methods, delivery methods, and even
progress bars.
- Furthermore, each particular implementation doesn’t generally depend on the framework either. In fact, the same event selection code can be used in Heppy.
- Particular implementations are specified by configuration.
- Although using PyROOT, instead of
accessing to branches by attributes of a tree object, alphatwirl
uses
SetBranchAddress()
, which is much faster—can be more than ten times faster.
- Multiprocessing can be used to concurrently process events
- Progress bars grow in parallel on terminal screen to indicate the progress of each process.
25.10% :::::::::: | 753 / 3000 |: WJetsToLNu_HT1200to2500_madgraph
30.47% :::::::::::: | 914 / 3000 |: WJetsToLNu_HT1200to2500_madgraph
29.30% ::::::::::: | 879 / 3000 |: WJetsToLNu_HT1200to2500_madgraph
85.40% :::::::::::::::::::::::::::::::::: | 854 / 1000 |: WJetsToLNu_HT1200to2500_madgraph
27.57% ::::::::::: | 827 / 3000 |: WJetsToLNu_HT2500toInf_madgraphM
25.47% :::::::::: | 764 / 3000 |: WJetsToLNu_HT2500toInf_madgraphM
79.60% ::::::::::::::::::::::::::::::: | 796 / 1000 |: WJetsToLNu_HT2500toInf_madgraphM
25.50% :::::::::: | 765 / 3000 |: WJetsToLNu_HT2500toInf_madgraphM
- Instead of multiprocessing, a batch system can be also used
- Currently, the interface to HTCondor is implemented.
- Users can write modules to use other batch system.
- While jobs are running in a batch system, the main process is running in the foreground, monitoring the progress of the jobs, and collecting the results as the jobs finish.
- Failed jobs are automatically resubmitted.
- Jobs can be split in terms of the number of input files and events.
- one input file can be split into multiple jobs
- one job can include multiple input files