This repository is dedicated to the easy-to-understand practice of the standard MLOps techniques.
The "Bike Rentals" dataset is used for scripts in this repository. This dataset contains daily counts of rented bicycles from the bicycle rental company Capital-Bikeshare in Washington D.C., along with weather and seasonal information. The goal is to predict how many bikes will be rented depending on the weather and the day. The train split info is
Index: 8645 entries, 0 to 8644
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 season 8645 non-null category
1 month 8645 non-null int64
2 hour 8645 non-null int64
3 holiday 8645 non-null category
4 weekday 8645 non-null int64
5 workingday 8645 non-null category
6 weather 8645 non-null category
7 temp 8645 non-null float64
8 feel_temp 8645 non-null float64
9 humidity 8645 non-null float64
10 windspeed 8645 non-null float64
First of all, clone the repository
git clone https://github.com/TopCoder2K/mlops-course.git
To setup only the necessary dependencies, run the following:
poetry install --without dev
If you want to use pre-commit
and dvc
, install all the dependencies:
poetry install
To fetch the preprocessed train and test splits of the dataset, run:
poetry run dvc pull
The command should download two .csv files from my
GDrive
and place them inside the mlopscourse/data/
directory.
If you want to train the chosen model and save it afterwards, place its configuration file
in the configs
directory and run:
poetry run python3 commands.py train --config_name [config_name_without_extension]
The available models are rf
(Random Forest from the scikit-learn
library) and cb
(Yandex's CatBoost), so an example with the CatBoost would be the following:
poetry run python3 commands.py train --config_name cb_config
N.B. Do not forget to set logging.mlflow.tracking_uri
before the launch. The logs are
saved in the default directory: mlruns
. If you are using the standard MLFlow server,
then run it before the training with poetry run mlflow ui
.
If you want to infer a previously trained model, make sure you've placed the checkpoint in
checkpoints/
and the configuration file in configs/
then run
poetry run python3 commands.py infer --config_name [config_name_without_extension]
Warning! This feature works stably only with the CatBoost model. Predictions of the
onnx version of the Random Forest differ from the original one (see
this).
Moreover, I was not able to infer the onnx version with MLflow (although everything worked
fine with the mlflow.sklearn
flavour as you can see in the hw2
version of the
repository).
In order to deploy a trained model, run:
poetry run mlflow models serve -p 5001 -m checkpoints/mlflow_[model_type]_ckpt/ --env-manager=local
where [model_type]
is cb
or rf
.
After this, it is possible to send requests to the model. I've created a script to generate the correct json containing the first example from the training set, but the json itself is in the repository, so you can skip this step. If you want to generate the json by yourself, run:
poetry run python3 create_example_request.py create_example_request
Send a request to the deployed model using the generated json:
curl http://127.0.0.1:5001/invocations -H 'Content-Type: application/json' -d @example_request.json
The model should reply with something like this:
{"predictions": [31.22848957148021]}
Since there are problems with the onnx version of the Random Forest model, this part is done only for the CatBoost model.
The container with Triton had
OS: Ubuntu 22.04.3 LTS
CPU: 12th Gen Intel(R) Core(TM) i7-12700H
vCPU: 10
RAM: 15.29GiB
Run the following to deploy the model:
docker build -t triton_with_catboost:latest mlopscourse/triton/
docker run -it --rm --cpus 10 -v ./mlopscourse/triton/model_repository:/models -v ./mlopscourse/triton/assets:/assets -p 8000:8000 -p 8001:8001 -p 8002:8002 triton_with_catboost:latest
(You are inside the container from now)
cd mlops-course
tritonserver --model-repository /models
Test the model:
poetry run python3 mlopscourse/triton/client.py
The client will check the predicted output with a hardcoded value. The client should print
Predicted: 31.22848957148021
The test is passed!
I've used
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:23.12-py3-sdk
and
docker stats [container_id]
to monitor and find good configuration.
Without any optimizations:
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 681.19 infer/sec, latency 1467 usec
Concurrency: 2, throughput: 851.402 infer/sec, latency 2348 usec
Concurrency: 3, throughput: 854.636 infer/sec, latency 3509 usec
Concurrency: 4, throughput: 821.657 infer/sec, latency 4867 usec
Concurrency: 5, throughput: 848.47 infer/sec, latency 5891 usec
and the approximate CPU usage by the container with the model for concurrency == 5
is
140%.
A notable problem here is the queue. Compare for concurrency == 1
and for
concurrency == 5
:
1: Avg request latency: 1208 usec (overhead 1 usec + queue 19 usec + compute input 15 usec + compute infer 1153 usec + compute output 19 usec)
5: Avg request latency: 5631 usec (overhead 2 usec + queue 4467 usec + compute input 15 usec + compute infer 1128 usec + compute output 18 usec)
So, it seems I have to optimize somehow the provision of input data. Unfortunately, this hasn't been covered in the course. Thus, let's try to optimize only the inference.
I will consider two options here: model instances and dynamic batching. Let's start with the first one.
If count: 2
, I've gotten:
Concurrency: 1, throughput: 574.325 infer/sec, latency 1740 usec
Concurrency: 2, throughput: 1376.2 infer/sec, latency 1452 usec
Concurrency: 3, throughput: 1608.26 infer/sec, latency 1864 usec
Concurrency: 4, throughput: 1634.76 infer/sec, latency 2446 usec
Concurrency: 5, throughput: 1642.57 infer/sec, latency 3043 usec
and
1: Avg request latency: 1425 usec (overhead 2 usec + queue 54 usec + compute input 18 usec + compute infer 1324 usec + compute output 25 usec)
5: Avg request latency: 2857 usec (overhead 2 usec + queue 1655 usec + compute input 15 usec + compute infer 1163 usec + compute output 20 usec)
and the approximate CPU usage by the container with the model for concurrency == 5
is
285%. If count: 3
, I've gotten:
Concurrency: 1, throughput: 491.298 infer/sec, latency 2034 usec
Concurrency: 2, throughput: 1247.34 infer/sec, latency 1602 usec
Concurrency: 3, throughput: 2019.29 infer/sec, latency 1484 usec
Concurrency: 4, throughput: 2356.69 infer/sec, latency 1696 usec
Concurrency: 5, throughput: 2385.19 infer/sec, latency 2095 usec
and
1: Avg request latency: 1640 usec (overhead 3 usec + queue 61 usec + compute input 20 usec + compute infer 1528 usec + compute output 27 usec)
5: Avg request latency: 1922 usec (overhead 2 usec + queue 682 usec + compute input 18 usec + compute infer 1199 usec + compute output 20 usec)
and the approximate CPU usage by the container with the model for concurrency == 5
is
450%.
Conclusions:
- The queue consumpts a lot of time compared to the model.
- Increasing the models count from
$1$ to$N$ forconcurrency == 5
results in$\approx 150\% \cdot N$ CPU usage,$\approx 800 \cdot N$ throughput and dividing queue's latency by some$k > 1$ .
With dynamic batching { max_queue_delay_microseconds: 1000 }
I've gotten:
Concurrency: 1, throughput: 304.009 infer/sec, latency 3288 usec
Concurrency: 2, throughput: 601.324 infer/sec, latency 3324 usec
Concurrency: 3, throughput: 877.651 infer/sec, latency 3416 usec
Concurrency: 4, throughput: 1140.14 infer/sec, latency 3507 usec
Concurrency: 5, throughput: 1346.32 infer/sec, latency 3712 usec
and
1: Avg request latency: 2790 usec (overhead 3 usec + queue 1211 usec + compute input 22 usec + compute infer 1528 usec + compute output 26 usec)
5: Avg request latency: 3242 usec (overhead 5 usec + queue 1145 usec + compute input 64 usec + compute infer 1979 usec + compute output 48 usec)
and the approximate CPU usage by the container with the model for concurrency == 5
is
80%. With dynamic batching { max_queue_delay_microseconds: 500 }
I've gotten:
Concurrency: 1, throughput: 409.087 infer/sec, latency 2443 usec
Concurrency: 2, throughput: 757.904 infer/sec, latency 2637 usec
Concurrency: 3, throughput: 1136.69 infer/sec, latency 2638 usec
Concurrency: 4, throughput: 1414.94 infer/sec, latency 2826 usec
Concurrency: 5, throughput: 1659.14 infer/sec, latency 3013 usec
and
1: Avg request latency: 2042 usec (overhead 2 usec + queue 671 usec + compute input 21 usec + compute infer 1323 usec + compute output 24 usec)
5: Avg request latency: 2718 usec (overhead 3 usec + queue 1196 usec + compute input 38 usec + compute infer 1444 usec + compute output 36 usec)
and the approximate CPU usage by the container with the model for concurrency == 5
is
143%. With dynamic batching { max_queue_delay_microseconds: 100 }
I've gotten:
Concurrency: 1, throughput: 563.679 infer/sec, latency 1773 usec
Concurrency: 2, throughput: 852.412 infer/sec, latency 2345 usec
Concurrency: 3, throughput: 1072.07 infer/sec, latency 2797 usec
Concurrency: 4, throughput: 1429.04 infer/sec, latency 2798 usec
Concurrency: 5, throughput: 1624.53 infer/sec, latency 3076 usec
and
1: Avg request latency: 1462 usec (overhead 2 usec + queue 195 usec + compute input 19 usec + compute infer 1222 usec + compute output 22 usec)
5: Avg request latency: 2790 usec (overhead 4 usec + queue 1418 usec + compute input 31 usec + compute infer 1304 usec + compute output 33 usec)
and the approximate CPU usage by the container with the model for concurrency == 5
is
147%. With dynamic batching { max_queue_delay_microseconds: 2000 }
I've gotten:
Concurrency: 1, throughput: 206.951 infer/sec, latency 4830 usec
Concurrency: 2, throughput: 419.314 infer/sec, latency 4768 usec
Concurrency: 3, throughput: 596.492 infer/sec, latency 5027 usec
Concurrency: 4, throughput: 779.399 infer/sec, latency 5130 usec
Concurrency: 5, throughput: 975.283 infer/sec, latency 5125 usec
and
1: Avg request latency: 4258 usec (overhead 3 usec + queue 2238 usec + compute input 28 usec + compute infer 1956 usec + compute output 32 usec)
5: Avg request latency: 4546 usec (overhead 5 usec + queue 2218 usec + compute input 67 usec + compute infer 2207 usec + compute output 48 usec)
and the approximate CPU usage by the container with the model for concurrency == 5
is
62%.
Conclusions:
- Dynamic batching can significally lower CPU usage (2 times less for
concurrency == 5
withmax_queue_delay_microseconds: 1000
!). - It seems optimal to set
max_queue_delay_microseconds: 500
, since further diminishing shows no improvements.
Since I've allocated 10 vCPUs, let's set count: 6
and
{ max_queue_delay_microseconds: 500 }
:
Concurrency: 5, throughput: 1100.77 infer/sec, latency 4540 usec
and
5: Avg request latency: 3972 usec (overhead 7 usec + queue 676 usec + compute input 86 usec + compute infer 3137 usec + compute output 65 usec)
and the approximate CPU usage by the container with the model is 103%. Why the hell did
this happen? It seems that the concurrency is too small and there are no benefits from
setting count
> 1... Let's try with concurrency == 30
:
Concurrency: 30, throughput: 5393.48 infer/sec, latency 5560 usec
and
Avg request latency: 5192 usec (overhead 26 usec + queue 460 usec + compute input 320 usec + compute infer 4210 usec + compute output 175 usec)
and the approximate CPU usage by the container with the model is 200%. We still have a lot
of room for the vCPU usage. I've played around with concurrency
and have found
concurrency == 150
to be optimal:
Concurrency: 150, throughput: 18625.9 infer/sec, latency 8044 usec
and
Avg request latency: 7561 usec (overhead 38 usec + queue 733 usec + compute input 326 usec + compute infer 6180 usec + compute output 284 usec)
and the approximate CPU usage by the container with the model is 800%. With the initial configuration I've gotten:
Concurrency: 150, throughput: 844.372 infer/sec, latency 176648 usec
and
Avg request latency: 176013 usec (overhead 3 usec + queue 174844 usec + compute input 18 usec + compute infer 1124 usec + compute output 23 usec)
and the approximate CPU usage by the container with the model is 143%.
Conclusion: We get 22X throughput and 0.04X latency with the best configuration for
concurrency = 150
.
- System configuration is at top of the section
- The task description is at top of the
README.md
- My
model_repository/
is the following:mlopscourse/triton/model_repository/ └── catboost ├── 1 │ ├── model.py │ └── __pycache__ │ ├── model.cpython-310.pyc │ └── model.cpython-38.pyc └── config.pbtxt
- Experiments and motivation of the best configuration are given above
- Throughput and latency metrics comparison is given above
N.B. Since I used python backend, there is no special script for the model conversion.