MLaaS for HEP is a set of Python based modules to support reading HEP data and stream them to ML of user choice for training. It consists of three independent layers:
- data streaming layer to handle remote data, see reader.py
- data training layer to train ML model for given HEP data, see workflow.py
- data inference layer, see tfaas_client.py
The general architecture of MLaaS4HEP looks like this: Even though this architecture was originally developed for dealing with HEP ROOT files we extend it to other data formats. So far the following data formats are supported: JSON, CSV, Parquet, ROOT. The former ones support reading files from local file system or HDFS, while later (ROOT) format allows to read ROOT files from local file system or remote files via xrootd protocol.
The pre-trained models can be easily uploaded to TFaaS inference server for serving them to clients.
The MLaaS4HEP relies on third-party libraries to support reading different data-formats. Here we outline main of them:
- pyarrow for reading data from HDFS file system
- uproot for reading ROOT files
- numpy, pandas for data representation
- modin for fast panda support
- numba for speeing up individual functions For ML modeling you may use your favorite framework, e.g. Keras, TensorFlow, scikit-learn, PyTorch, etc. Therefore, we suggest to use anaconda to install its dependencies:
# to install pyarrow, uproot
conda install -c conda-forge pyarrow uproot numba scikit-learn
# to install pytorch
conda install -c pytorch pytorch
# to install TensorFlow, Kearas, Numpy, Pandas
conda install keras numpy pandas
The easiest way to install and run MLaaS4HEP and TFaaS is to use pre-build docker images
# run MLaaS4HEP docker container
docker run veknet/mlaas4hep
# run TFaaS docker container
docker run veknet/tfaas
MLaaS4HEP python repository provides two base modules to read and manipulate with
HEP ROOT files. The reader.py
module defines a DataReader class which is
able to read either local or remote ROOT files (via xrootd). And, workflow.py
module provide a basic DataGenerator class which can be used with any ML
framework to read HEP ROOT data in chunks. Both modules are based on
uproot framework.
Basic usage
# setup the proper environment, e.g.
# export PYTHONPATH=/path/src/python # path to MLaaS4HEP python framework
# export PATH=/path/bin:$PATH # path to MLaaS4HEP binaries
# get help and option description
reader --help
# here is a concrete example of reading local ROOT file:
reader --fin=/opt/cms/data/Tau_Run2017F-31Mar2018-v1_NANOAOD.root --info --verbose=1 --nevts=2000
# here is an example of reading remote ROOT file:
reader --fin=root://cms-xrd-global.cern.ch//store/data/Run2017F/Tau/NANOAOD/31Mar2018-v1/20000/6C6F7EAE-7880-E811-82C1-008CFA165F28.root --verbose=1 --nevts=2000 --info
# both of aforementioned commands produce the following output
First pass: 2000 events, 35.4363200665 sec, shape (2316,) 648 branches: flat 232 jagged
VMEM used: 960.479232 (MB) SWAP used: 0.0 (MB)
Number of events : 1131872
# flat branches : 648
... # followed by a long list of ROOT branches found along with their dimentionality
TrigObj_pt values in [5.03515625, 1999.75] range, dim=21
More examples about using uproot may be found here and here
The HEP data are presented in ROOT data-format. The DataReader class provides access to ROOT files and various APIs to access the HEP data.
A simple workflow example can be found in workflow.py code that executes a full HEP ML workflow, i.e. it can read remote files and perform the training of ML models with HEP ROOT files.
If you clone the repo and setup your PYTHONPATH you should be able to run it as simple as
# setup the proper environment, e.g.
# export PYTHONPATH=/path/src/python # path to MLaaS4HEP python framework
# export PATH=/path/bin:$PATH # path to MLaaS4HEP binaries
workflow --help
# run the code with list of LFNs from files.txt and using labels file labels.txt
workflow --files=files.txt --labels=labels.txt
# run pytorch example
workflow --files=files.txt --labels=labels.txt --model=ex_pytorch.py
# run keras example
workflow --files=files.txt --labels=labels.txt --model=ex_keras.py
# cat files.txt
#dasgoclient -query="file dataset=/Tau/Run2018C-14Sep2018_ver3-v1/NANOAOD"
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/069A01AD-A9D0-7C4E-8940-FA5990EDFFCE.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/577AF166-478C-1F40-8E10-044AA4BC0576.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/9A661A77-58AC-0245-A442-8093D48A6551.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/C226A004-077B-7E41-AFB3-6AFB38D1A63B.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/D1E05C97-DB14-3941-86E8-C510D602C0B9.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/6FA4CC7C-8982-DE4C-BEED-C90413312B35.root
/store/data/Run2018C/Tau/NANOAOD/14Sep2018_ver3-v1/60000/282E0083-6B41-1F42-B665-973DF8805DE3.root
# cat labels.txt
1
0
1
0
1
1
1
# run keras example and save our model into external file
workflow --files=files.txt --labels=labels.txt --model=ex_keras.py --fout=model.pb
The workflow.py
relies on two JSON files, one which contains parameters for
reading ROOT files and another with specification of ROOT branches. The later
will be generated by reading ROOT file itself.
You may use workflow.py
to use other data-formats, e.g. CSV, JSON, Parquet,
to train your model. The procedure is identical to dealing with HEP ROOT files.
# prepare your files.txt and labels.txt files, e.g. here we show example
# of using json gzipped files located on HDFS
cat files.txt
hdfs:///path/file1.json.gz
hdfs:///path/file2.json.gz
# optionally define your preprocessing function, see example in ex_preproc.py
# run workflow with your set of files, labels, model and preprocessing function
# and save it into model.pb file
workflow --files=files.txt --labels=labels.txt --model=ex_keras.py --preproc=ex_preproc.py --fout=model.pb
We provide more comprehensive example over here
We provided full code called hep_resnet.py
as a basic model based on
ResNet implementation.
It can classify images from HEP events, e.g.
hep_resnet.py --fdir=/path/hep_images --flabels=labels.csv --epochs=200 --mdir=models
Here we supply input directory /path/hep_images
which contains HEP images
in train
folder along with labels.csv
file which provides labels.
The model runs for 200 epochs and save Keras/TF model into models
output
directory.
We provide inference server in separate TFaaS repository. It contains full set of instructions how to build and set it up.
To access your ML model in TFaaS inference server you only need to rely on HTTP protocol. Please see TFaaS repository for more information.
But for convenience we also provide pure python client to perform all necessary actions against TFaaS server. Here is short description of available APIs:
# setup url to point to your TFaaS server
url=http://localhost:8083
# create upload json file, which should include
# fully qualified model file name
# fully qualified labels file name
# model name you want to assign to your model file
# fully qualified parameters json file name
# For example, here is a sample of upload json file
{
"model": "/path/model_0228.pb",
"labels": "/path/labels.txt",
"name": "model_name",
"params":"/path/params.json"
}
# upload given model to the server
tfaas_client.py --url=$url --upload=upload.json
# list existing models in TFaaS server
tfaas_client.py --url=$url --models
# delete given model in TFaaS server
tfaas_client.py --url=$url --delete=model_name
# prepare input json file for querying model predictions
# here is an example of such file
{"keys":["attribute1", "attribute2"], values: [1.0, -2.0]}
# get predictions from TFaaS server
tfaas_client.py --url=$url --predict=input.json
# get image predictions from TFaaS server
# here we refer to uploaded on TFaaS ImageModel model
tfaas_client.py --url=$url --image=/path/file.png --model=ImageModel
Please use this publication for further citation: DOI: 10.1007/s41781-021-00061-3