This is the official repository used in the paper OmniLearn: A Method to Simultaneously Facilitate All Jet Physics Tasks. If you find this repository useful for your own work please cite the paper:
@article{Mikuni:2024qsr,
author = "Mikuni, Vinicius and Nachman, Benjamin",
title = "{OmniLearn: A Method to Simultaneously Facilitate All Jet Physics Tasks}",
eprint = "2404.16091",
archivePrefix = "arXiv",
primaryClass = "hep-ph",
month = "4",
year = "2024"
}
The list of packages needed to train/evaluate the model is found in the requirements.txt
file. Alternatively, you can use the Docker container found in this link.
Our recommendation is to use the docker container.
Use the flag ```--folder''' with the path of the downloaded data to preprocess each file. See below for specific file instructions.
OmniLearn is trained using the JetClass dataset. All files can be directly download from zenodo or downloaded using the dataloader scripts provided by the authors from the original repository.
cd preprocessing
python preprocess_jetclass.py --sample [train/test/val] --folder FOLDER
This command will create the new files in the same folder specified by the input flag.
Notice these files can be several GB, so save them in a folder where you have a lot of storage space
Files can be donwloaded from zenodo and further preprocessed using the script:
cd preprocessing
python preprocess_top.py --sample [train.h5/test.h5/val.h5] --folder FOLDER
Files can either be downloaded from zenodo or directly through the EnergyFlow package with the commands.
cd preprocessing
python preprocess_qg.py --folder FOLDER
The 2011A release of the CMS Open Data can also be loaded using the EnergyFlow package as part of the MOD dataset using the following script:
cd preprocessing
python preprocess_cms.py --folder FOLDER
The ATLAS Top Tagging Dataset can be downloaded from the following link and preprocessed using the following script:
cd preprocessing
python preprocess_atlas.py --folder FOLDER --sample [train.h5/test.h5]
The DIS dataset uses simulations created by the H1 Collaboration and are currently proprietary. If you are interested in this application, send me a message and we can work through!
The JetNet datasets can be downloaded from zenodo with 30 particles and 150 particles and similarly preprocessed with script:
cd preprocessing
python preprocess_jetnet.py --folder FOLDER --label [30/150]
OmniFold files can be downloaded directly from zenodo or part of the EnergyFlow package. The processing script is:
cd preprocessing
python preprocess_omnifold.py --folder FOLDER --sample [pythia/herwig]
The dataset can be downloaded from zenodo
To perform the clustering and selection of the two leading jets, Jupyter Notebooks are provided to save the files in the format needed for the OmniLearn training. Additional preprocessing to split the files into training/test/validation in the signal region and sidebands can be used:
cd preprocessing
python preprocess_lhco.py --folder FOLDER
You can either train OmniLearn from scratch or take advantage of the pre-trained checkpoint and skip this step. For the training run:
cd scripts
python train.py --dataset jetclass --lr 3e-5 --layer_scale --local --mode all
Since the training dataset is large, this step may take hours or days depending on the amount of resources available.
We provide the scripts used in the paper to adapt OmniLearn to each individual task we present with evaluation scripts to derive the results shown in the paper. Notice that training any algorithm from scratch can also be accomplished by omitting the --fine_tune
flag. We provide the trained checkpoints from OmniLearn in the checkpoint folder. Copy the checkpoint folder to the dataset folder before running the fine tunning step.
These datasets can all be fine-tuned using the same scripts used to train OmniLearn with commands:
cd scripts
python train.py --dataset top --layer_scale --local --mode classifier --warm_epoch 3 --epoch 10 --stop_epoch 3 --batch 256 --wd 0.1 --fine_tune
python train.py --dataset cms --lr 3e-5 --layer_scale --local --mode classifier --warm_epoch 3 --epoch 40 --stop_epoch 3 --batch 256 --wd 0.0001 --fine_tune
python train.py --dataset qg --lr 3e-5 --layer_scale --local --mode classifier --warm_epoch 3 --epoch 20 --stop_epoch 3 --batch 256 --wd 0.1 --fine_tune
python train.py --dataset h1 --lr 3e-5 --layer_scale --local --mode classifier --warm_epoch 3 --epoch 10 --stop_epoch 3 --batch 256 --lr_factor 2 --fine_tune
For the ATLAS Top Tagging dataset we need to modify the loss function to include the event weights, requiring a different script:
cd scripts
python train_atlas.py --dataset [atlas/atlas_small] --layer_scale --local --fine_tune
The evaluation for all basic classifiers is carried out using the evaluation script:
python evaluate_classifiers.py --batch 1000 --local --layer_scale --dataset [top,qg,cms,atlas,h1] [--load] --fine_tune
The output of the evaluation code are the metrics described in the paper for AUC, accuracy, and signal efficiency at fixed background efficiency values. The evaluation script will also run the network evaluation and save the outputs of the predictions to a npy file. If you need to evaluate the same model again, you can simply load the npy file and skip the full evaluation by providing the --load
flag.
You can run the full iterative OmniFold training using the commands::
python train_omnifold.py --layer_scale --local --num_iter 5 --fine_tune
With number of iterations determined by the --num_iter
flag. The evaluation is performed with script:
python evaluate_omnifold.py --local --layer_scale [--reco] --num_iter 5
The --reco
flag is used to load step 1 iteration 0 of the algorithm, equivalent to the event reweighting results shown in the paper.
To adapt OmniLearn to the JetNet data you can use the script:
python train_jetnet.py --local --layer_scale --dataset [jetnet30/jetnet150] --fine_tune
The JetNet dataset evaluation is performed in 2 steps. The first one we only generate samples. The second step we load the generated data and run through the scripts from the JetNet official repo to determine the performance. Clone the original repo first if you are interested in the metrics. The generation is dine with the following command:
python evaluate_jetnet.py --dataset [jetnet30/jetnet150] --layer_scale --local --fine_tune --sample
The LHCO training is performed in two steps. First we adapt the generative model using background events present in the sidebands, generate background events in the signal region, then adapt the classifier to separate data in the signal region from generated background events. The generative model is trained with the following commands:
cd scripts
python train_lhco.py --local --layer_scale --fine_tune
After training, background events are generated with:
python evaluate_lhco.py --layer_scale --local [--SR] --sample --fine_tune
The same script can create plots of the generated distributions when called without the --sample
flag. You can also generate predictions for the sidebands by omitting the --SR
flag.
From the generated samples, you can train the classifier with the commands:
python classify_lhco.py --local --layer_scale [--ideal] [--SR] --nsig NSIG --fine_tune
The --ideal
flag trains a classifier using true background events, used in the weakly-supervised results displayed in the paper. The --nsig
flag determined the amount of injected background events to consider during the training. The evaluation of the classifier is done with:
python evaluate_classifiers_lhco.py --batch 100 --local --layer_scale [--ideal] --nsig NSIG --fine_tune
All training scripts, evaluation scripts, and generation scripts (for generative models) can be run in parallel using multiple GPUs. This is accomplished using Horovod. Horovod can either be installed separately, as part of the requirements package, but already comes as part of the docker container linked in the top of the repository. On the Perlmutter supercomputer, one can use the SLURM submission system to run the scripts with multiple GPUs using the command
srun --mpi=pmi2 shifter python [train.py]
for example.
You can easily include new datasets and benefit from the OmniLearn training! For any task, the first step is to prepare the data to have the same format as the data used in this work. This means creating an .h5
file that contains 3 groups of data:
data
: Contains the point clouds used for inputs. The expected shape is (N,P,F) where N is the total number of events, P is the maximum number of particles to be used and F is the number of features per particle. For best results use the same set of features used to train OmniLearn. These features can be found in any of the preprocessing scripts under the preprocess folder. Notice that OmniLearn does not enforce a fixed number of particles and can adapt to any dataset.jet
: Contains the kinematic information for jets. The expected shape is (N,J) with J the number of features per jet. For best results, we use J = 4 with features (jet pT, jet eta, jet mass, particle multiplicity), but any other combination may still work well.pid
: These contain the class labels for the classification task. For best results we one_hot_encode the labels. This can be accomplished directly at the level of the preprocessing or done later inside the Dataloader class.
With the dataset created the next step is to add a subclassed dataset object to the utils.py
file.
For simple applications look at the TopDataLoader
subclassed dataloader for an example. Notice that we use a preprocessing that shifts the mean of the data to 0 and sets the standard deviation to unit, based on the JetClass data. If your dataset has similar ranges for each feature, then keeping the preprocessing from jetclass will result in better results. You can also override the preprocessing parameters in case your data has different features or very different ranges. Look at the H1DataLoader
for an example. You can also plot the input distributions to verify that each feature is consistent by running:
python plotter.py --dataset yourdata
by adding your new dataloader object to the plotter.py script. Notice that the created plots will be already normalized.
The minimum addition to be able to train your OmniLearn-powered classifier is to add your dataloader to the train.py
script and start the training with commands:
python train.py --dataset YOURDATA --layer_scale --local --mode classifier --warm_epoch 3 --epoch 10 --stop_epoch 3 --batch 256 --wd 0.1 --fine_tune
Tune the number of epochs and early stopping based on the size of your dataset (smaller datasets are quick to start overtraining, so no many epochs are needed). If you need to include event weights in the training procedure, look at the ATLAS Top tagging training script train_atlas.py
where we modify the classifier loss function to handle the event weights.
For generative models, the same script can be used in case you only need the point cloud generation. This can be accomplished by calling
python train.py --dataset YOURDATA --layer_scale --local --mode generator --warm_epoch 3 --epoch 100 --stop_epoch 30 --batch 256 --wd 0.1 --fine_tune
Similarly, change the number of epochs and early stopping based on the size of your dataset. In case you need the jet generation to be determined simultaneously during training (such is the case for both LHCO and JetNets datasets), you can create a new subclassed model that loads OmniLearn as part of the model. In this case, take a look at the train_jetnet.py
script where we load a subclassed model named PET_jetnet
stored in the PET_jetnet.py
file.
The evaluation of the trained classifier can be carried out directly from the script
python evaluate_classifiers.py --batch 1000 --local --layer_scale --dataset YOURDATA [--load] --fine_tune
after including your dataloader in the script. For data generation, you can create your scripts based on the evaluation scripts for jetnet to accomplish the sampling. Additional plotting functionality can be used to compare generated samples with data. Look at the evaluation scripts for the LHCO application for more details.