←Back
Image source: GitHub - allegroai/clearml-serving
This setup orchestrates a scalable ML serving infrastructure using ClearML, integrating Kafka for message streaming, ZooKeeper for Kafka coordination, Prometheus for monitoring, Alertmanager for alerts, Grafana for visualization, and Triton for optimized model serving.
- Setup your ClearML Server or use the Free tier Hosting
- Setup local access (if you haven't already), see instructions here
pip install clearml-serving #you prob did this already
clearml-serving create --name "aiorhu demo"`
- The new serving service UID should be printed
New Serving Service created: id=ooooahah12345
Lets look at this in ClearML.
- clearml-serving/docker/docker-compose-triton.yml
- find the var:
CLEARML_EXTRA_PYTHON_PACKAGES
and add the packages you need for your model. we'll add ours here.
CLEARML_EXTRA_PYTHON_PACKAGES: ${CLEARML_EXTRA_PYTHON_PACKAGES:-textstat empath torch transformers nltk openai datasets diffusers benepar spacy sentence_transformers optuna interpret markdown bs4}
- (
docker/example.env
) with your clearml-server credentials and Serving Service UID. For example, you should have something like
CLEARML_WEB_HOST="https://app.clear.ml"
CLEARML_API_HOST="https://api.clear.ml"
CLEARML_FILES_HOST="https://files.clear.ml"
CLEARML_API_ACCESS_KEY="<access_key_here>"
CLEARML_API_SECRET_KEY="<secret_key_here>"
CLEARML_SERVING_TASK_ID="<serving_service_id_here>"
- using docker-compose (or if running on Kubernetes use clearml-serving Helm Chart)
- We are deploying a Pytorch model. So we want to use NVIDIA Triton Inference https://developer.nvidia.com/triton-inference-server, made for gpu, but it will work on cpu dev machine (my laptop in this case). In production using k8 and help charts is the ay to go. https://github.com/allegroai/clearml-helm-charts
cd docker && docker-compose --env-file example.env -f docker-compose-triton.yml up
Notice: Any model that registers with "Triton" engine, will run the pre/post processing code on the Inference service container, and the model inference itself will be executed on the Triton Engine container.
Let's review what we did.