Azimuth LLM

This repository contains a Helm chart for deploying Large Language Models (LLMs) on Kubernetes. It is developed primarily for use as a pre-packaged application within Azimuth but is structured such that it can, in principle, be deployed on any Kubernetes cluster with at least 1 GPU node.

Azimuth App

This app is provided as part of a standard deployment Azimuth, so no specific steps are required to use this app other than access to an up-to-date Azimuth deployment.

Manual Deployment

Alternatively, to set up the Helm repository and manually install this chart on an existing Kubernetes cluster, run

helm repo add <chosen-repo-name> https://stackhpc.github.io/azimuth-llm/
helm repo update
helm install <installation-name> <chosen-repo-name>/azimuth-llm --version <version>

where version is the full name of the published version for the specified commit (e.g. 0.1.0-dev.0.main.125). To see the latest published version, see this page.

Customisation

The chart/values.yaml file documents the various customisation options which are available. In order to access the LLM from outside the Kubernetes cluster, the API and/or UI service types may be changed to

api:
  service:
    type: LoadBalancer
    zenith:
      enabled: false
ui:
  service:
    type: LoadBalancer
    zenith:
      enabled: false

[!WARNING] Exposing the services in this way provides no authentication mechanism and anyone with access to the load balancer IPs will be able to query the language model. It is up to you to secure the running service as appropriate for your use case. In contrast, when deployed via Azimuth, authentication is provided via the standard Azimuth Identity Provider mechanisms and the authenticated services are exposed via Zenith.

The both the web-based interface and the backend OpenAI-compatible vLLM API server can also optionally be exposed using Kubernetes Ingress. See the ingress section in values.yml for available config options.

Tested Models

The application uses vLLM for model serving, therefore any of the vLLM supported models should work. Since vLLM pulls the model files directly from HuggingFace it is likely that some other models will also be compatible with vLLM but mileage may vary between models and model architectures. If a model is incompatible with vLLM then the API pod will likely enter a CrashLoopBackoff state and any relevant error information will be found in the API pod logs. These logs can be viewed with

kubectl (-n <helm-release-namespace>) logs deploy/<helm-release-name>-api

If you suspect that a given error is not caused by the upstream vLLM support and a problem with this Helm chart then please open an issue.

Monitoring

The LLM chart integrates with kube-prometheus-stack by creating a ServiceMonitor resource and installing two custom Grafana dashboard as Kubernetes ConfigMaps. If the target cluster has an existing kube-prometheus-stack deployment which is appropriately configured to watch all namespaces for new Grafana dashboards, the LLM dashboards will automatically appear in Grafana's dashboard list.

To disable the monitoring integrations, set the api.monitoring.enabled value to false.

Components

The Helm chart consists of the following components:

A backend web API which runs vLLM's OpenAI compatible web server.
A choice of frontend web-apps built using Gradio (see web-apps). Each web interface is available as a pre-built container image hosted on ghcr.io and be configured for each Helm release by changing the ui.image section of the chart values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Azimuth LLM

Azimuth App

Manual Deployment

Customisation

Tested Models

Monitoring

Components

Files

README.md

Latest commit

History

README.md

File metadata and controls

Azimuth LLM

Azimuth App

Manual Deployment

Customisation

Tested Models

Monitoring

Components