A profiling tool for the preprocessing stage of machine learning pipelines
We introduce Lotus, a profiling tool for machine learning (ML) preprocessing pipelines defined using PyTorch's DataLoader.
Lotus is an easy-to-use, low overhead, and visualization-ready profiler specialized for the widely used PyTorch DataLoader preprocessing library.
- [Nov 2024] [Slides] Talk - Leveraging Lotus to evaluate CPU SKUs for AI/ML servers @ HotInfra 2024 (co-located with SOSP'24)
- [Oct 2024] [Slides] Talk - Lotus presented to Intel Processor Architecture Research (PAR) Lab
- [Sep 2024] [PDF] Paper - Lotus accepted to HotInfra 2024 (co-located with SOSP'24)!
- [Sep 2024] [Slides] Talk - Lotus: Evaluate your ML preprocessing pipelines at framework and CPU arch-level @ IISWC 2024!
- [Aug 2024] Lotus won a 🏆 Best Paper Nomination in IISWC 2024!
- [Aug 2024] Lotus artifact won Available, Reviewed, and Reproduced badges according to IEEE Badges!
- [Jul 2024] [PDF] Paper - Lotus accepted to IISWC 2024!
- About Lotus
- Cite Lotus
- Replicate IISWC24 paper experiments
- Get Lotus
- Use Lotus
- Concrete examples
- Limitations of Lotus
- Acknowledgment
- License
- Contact
Lotus employs two novel approaches:
- LotusTrace - An instrumentation methodology for the PyTorch library, which enables fine-grained elapsed time profiling with minimal time and storage overheads.
- LotusMap - A mapping methodology to reconstruct a mapping between Python functions and the underlying C++ functions they call, effectively linking high-level Python functions with low-level hardware counters.
Above combination is powerful as it allows enables users to better reason about their pipeline’s performance, both at the level of preprocessing operations and their performance on hardware resource usage.
@INPROCEEDINGS{lotus-iiswc24,
title={{Lotus: Characterization of Machine Learning Preprocessing Pipelines via Framework and Hardware Profiling}},
author={Bachkaniwala, Rajveer and Lanka, Harshith and Rong, Kexin and Gavrilovska, Ada},
booktitle={2024 IEEE International Symposium on Workload Characterization (IISWC)},
year={2024}
}
@INPROCEEDINGS{lotus-hotinfra24,
title={{Lotus: Characterize Architecture Level CPU-based Preprocessing in Machine Learning Pipelines}},
author={Bachkaniwala, Rajveer and Lanka, Harshith and Rong, Kexin and Gavrilovska, Ada},
booktitle={The 2nd Workshop on Hot Topics in System Infrastructure (HotInfra’24), co-located with SOSP’24, November 3, 2024, Austin, TX, USA},
year={2024}
}
For replicating the key experiments in our paper presented at the 2024 IEEE International Symposium on Workload Characterization (IISWC'24), refer to the SETUP.md
and REPLICATE.md
files. You can also refer to the appendix of our paper.
-
Clone this repository
-
Get submodules:
git submodule update --init --recursive
-
Create a conda environment
conda create -n Lotus python=3.10 conda activate Lotus
-
Install Intel VTune from here and activate it as Intel descsribes.
Note: we used
Intel(R) VTune(TM) Profiler 2023.2.0 (build 626047)
-
Install AMD uProf from here
Note: we used
AMDuProfCLI Version 4.0.341.0
-
Follow the LotusTrace build instructions in
code/LotusTrace/README.md
-
Follow the itt-python build instructions in
code/itt-python/README.md
-
Follow the amduprofile-python build instructions in
code/amdprofilecontrol-python/README.md
-
That's it!
LotusTrace can be enabled by simply passing a custom_log_file
to be used by LotusTrace using keywords log_transform_elapsed_time
and log_file
as shown below:
import torchvision.transforms as transforms
import torchvision.datasets as datasets
custom_log_file = <To use our instrumentation>
train_dataset = datasets.ImageFolder(
traindir,
transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
], log_transform_elapsed_time=custom_log_file),
log_file=custom_log_file
)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=args.batch_size,
shuffle=(train_sampler is None) and args.shuffle,
num_workers=args.workers,
pin_memory=True,
sampler=train_sampler,
)
But, what if you have a custom dataset?
We do support LotusTrace for custom datasets as well check below instance:
import torchvision.transforms as transforms
log_file = <To use our instrumentation>
transforms = transforms.Compose([
op1(), op2(), op3(), op4()],
log_transform_elapsed_time=log_file)
class CustomDataset:
def __init__(self, log_file = None, transforms):
...
self.log_file = log_file # If None, then no logging
self.transforms = transforms # A Compose object
...
def __getitem__(self, index):
...
data,label = self.transforms(index) # Calls Compose's __call__()
...
return data, label
dataset = CustomDataset(log_file = log_file, transforms = transforms)
You simply need to add self.log_file
and self.transforms
variable in __init__
function of your custom dataset object as shown above.
Moreover, you need to structure the code such that you use torchvision's Compose
class' object to perform preprocessing operations as shown in self.transforms(index)
line. That's it!
The trace generated by LotusTrace will be stored in the directory of the log_file
as mentioned in How to use LotusTrace. To generate a visualization ready trace from LotusTrace's trace run the below command:
python code/visualize_LotusTrace_trace/visualization_augmenter.py \
--LotusTrace_trace_dir <LotusTrace_trace_dir> \
--coarse \
--output_LotusTrace_viz_file <viz_file_path>
Note: --coarse
option is great option for a quick high level view. Visualization trace will be stored in the same directory as <LotusTrace_trace_dir>
. You can open this trace in your chrome browser with URL set to chrome://tracing/
and simply upload the file using Load
button.
For more options:
python code/visualize_LotusTrace_trace/visualization_augmenter.py \
--help
Below is an example of how to write a python file called RandomResizedCrop.py such that using LotusMap's method can be applied to collect the mapping:
import torchvision.transforms as t
from PIL import Image
import time,itt
# increase PIL image open size
Image.MAX_IMAGE_PIXELS = 1000000000
image_file = "<path to image>"
for i in range(5):
# Open the image
image = Image.open(image_file)
# convert to RGB like torch's pil_loader
image = image.convert('RGB') # Responisble for Loader operation
# Define the desired crop size
crop_size = 224 # Define this as needed
time.sleep(1) # sleep for 1 sec
if i == 4: # Delay collection to prevent cold start
itt.resume()
image = t.RandomResizedCrop(crop_size)(image)
if i == 4:
itt.detach()
Now, run below commands to collect mapping:
vtune -collect hotspots -start-paused \
-result-dir <your_vtune_result_dir> \
-- python RandomResizedCrop.py
vtune -report hotspots \
-result-dir <your_vtune_result_dir> \
-format csv \
-csv-delimiter comma \
-report-output RandomResizedCrop.csv
RandomResizedCrop.csv
contains the C/C++ functions mapped to RandomResizedCrop
operation.
Below is an example of how to write a python file called RandomResizedCrop.py such that using LotusMap's method can be applied to collect the mapping:
import torchvision.transforms as t
from PIL import Image
import time, amdprofilecontrol as amd
# increase PIL image open size
Image.MAX_IMAGE_PIXELS = 1000000000
image_file = "<path to image>"
for i in range(5):
# Open the image
image = Image.open(image_file)
# convert to RGB like torch's pil_loader
image = image.convert('RGB') # Responisble for Loader operation
# Define the desired crop size
crop_size = 224 # Define this as needed
time.sleep(1) # sleep for 1 sec
if i == 4: # Delay collection to prevent cold start
amd.resume(1)
image = t.RandomResizedCrop(crop_size)(image)
if i == 4:
amd.pause(1)
Now, run below commands to collect mapping:
AMDuProfCLI collect --config tbp --start-paused \
--output-dir <your_uprof_result_dir> \
python RandomResizedCrop.py
AMDuProfCLI report \
--input-dir <your_uprof_generated_result_dir> \
--report-output RandomResizedCrop.csv \
--cutoff 100 -f csv #can be set to more than 100
RandomResizedCrop.csv
contains the C/C++ functions mapped to RandomResizedCrop
operation.
Note: For completeness, checkout our paper to navigate how to correctly use LotusMap methodology.
An example of how to enable LotusTrace facilitated logging for an image classification task has been described in code/image_classification/code/pytorch_main.py
, we add the snippet below for the same:
normalize = transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
)
train_dataset = datasets.ImageFolder(
traindir,
transforms.Compose(
[
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
],
log_transform_elapsed_time=args.log_train_file,
),
log_file=args.log_train_file,
)
Notice that the user simply has to pass the same log file to be used by LotusTrace using keywords log_transform_elapsed_time
and log_file
.
We provide 6 examples of how to use LotusMap in code/image_classification/LotusMap
directory. Please check the code for more details.
Similar to other tools in the past which do not claim to be perfect, we follow the same tradition with Lotus:
- No current support for multi-node setting
- No current support for DDP setting
- LotusMap is approximate, checkout our paper for additional information
We claim issues 1 and 2 as a limitation as we simply have not tested the system in these settings yet.
The lotus image is from "Image by Sketchepedia on Freepik"
Click here.
Name: Rajveer Bachkaniwala
Email: rr [at] gatech [dot] edu