This is an example of deploying and scaling Riva Speech Skills on Azure Cloud's Azure Kuberenetes Service (AKS) with Traefik-based load balancing. It includes the following steps:
- Creating the AKS cluster
- Deploying the Riva API service
- Deploying the Traefik edge router
- Creating the IngressRoute to handle incoming requests
- Deploying a sample client
- Scaling the cluster
Before continuing, ensure you have:
- An Azure account with the appropriate user/role privileges to manage AKS
- The azure command-line tool, Configured for your account
- Access to NGC and the associated command-line interface
- Cluster management tools
az
,helm
andkubectl
The cluster contains three separate node pools:
-
rivaserver
: A GPU-equipped node where the main Riva service runs.Standard_NC8as_T4_v3
instances, each using a Tesla T4 GPU, which provides good value and sufficient capacity for many applications. -
loadbalancer
: A general-purpose compute node for the Traefik load balancer, using anStandard_D4s_v3
instance. -
rivaclient
: A general-purpose node with anStandard_D8s_v3
instance for client applications accessing the Riva service.
-
Create an Azure Resource Group as all the resources created for the AKS cluster will be part of this resource group:
AKS_RESOURCE_GROUP=riva-resource # Set a unique name for your cluster. AKS_CLUSTER_NAME=riva-aks az group create --name ${AKS_RESOURCE_GROUP} --location eastus
-
Create the AKS cluster. This will take some time as it will spin nodes and setup kubernetes control plane in backend.
az aks create --resource-group ${AKS_RESOURCE_GROUP} --name ${AKS_CLUSTER_NAME} --node-count 1 --generate-ssh-keys
-
After the cluster creation is complete, pull the cluster config to the local machine so that
kubectl
can connect to the cluster:az aks get-credentials --resource-group ${AKS_RESOURCE_GROUP} --name ${AKS_CLUSTER_NAME} --admin
-
Verify if you are able to connect to the cluster using
kubectl
. You should see nodes and pods running.kubectl get nodes kubectl get po -A
-
Create the three node pools for GPU workers, load balancers, and clients:
-
GPU LINUX WORKERS:
az aks nodepool add --name rivaserver --resource-group ${AKS_RESOURCE_GROUP} --cluster-name ${AKS_CLUSTER_NAME} --node-vm-size Standard_NC8as_T4_v3 --node-count 1 --labels role=workers
-
CPU LINUX LOAD BALANCERS:
az aks nodepool add --name loadbalancer --resource-group ${AKS_RESOURCE_GROUP} --cluster-name ${AKS_CLUSTER_NAME} --node-vm-size Standard_D4s_v3 --node-count 1 --labels role=loadbalancers
-
CPU LINUX CLIENTS:
az aks nodepool add --name rivaclient --resource-group ${AKS_RESOURCE_GROUP} --cluster-name ${AKS_CLUSTER_NAME} --node-vm-size Standard_D8s_v3 --node-count 1 --labels role=clients
-
-
Verify that the newly added nodes now appear in the Kubernetes cluster.
kubectl get nodes --show-labels kubectl get nodes --selector role=workers kubectl get nodes --selector role=clients kubectl get nodes --selector role=loadbalancers
The Riva Speech Skills Helm chart is designed to automate deployment to a Kubernetes cluster. After downloading the Helm chart, minor adjustments will adapt the chart to the way Riva will be used in the remainder of this tutorial.
-
Download and untar the Riva API Helm chart. Replace
VERSION_TAG
with the specific version needed.export NGC_CLI_API_KEY=<your NGC API key> export VERSION_TAG="{VersionNum}" helm fetch https://helm.ngc.nvidia.com/nvidia/riva/charts/riva-api-${VERSION_TAG}.tgz --username='$oauthtoken' --password=$NGC_CLI_API_KEY tar -xvzf riva-api-${VERSION_TAG}.tgz
-
In the
riva-api
folder, modify the following files:-
values.yaml
- In
modelRepoGenerator.ngcModelConfigs
, comment or uncomment specific models or languages, as needed. - Change
service.type
fromLoadBalancer
toClusterIP
. This directly exposes the service only to other services within the cluster, such as the proxy service to be installed below. - Set
persistentVolumeClaim.usePVC
totrue
,persistentVolumeClaim.storageClassName
toazurefile
,persistentVolumeClaim.storageAccessMode
toReadWriteOnce
, This will store the riva models in Created Persistent Volume.
- In
-
templates/deployment.yaml
-
Add a node selector constraint to ensure that Riva is only deployed on the correct GPU resources. In
spec.template.spec
, add:nodeSelector: kubernetes.azure.com/agentpool: rivaserver
-
-
-
Install the NVIDIA GPU device plugin. Azure will not install it by default. Verify the installation with the following command:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm repo update helm install \ --generate-name \ --set failOnInitError=false \ nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace
-
Verify the GPU plugin installation with either of the following commands:
kubectl get pod -A | grep nvidia or kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
-
Ensure you are in a working directory with
riva-api
as a subdirectory, then install the Riva Helm chart. You can explicitly override variables from thevalues.yaml
file, such as themodelRepoGenerator.modelDeployKey
settings.helm install riva-api riva-api/ \ --set ngcCredentials.password=`echo -n $NGC_CLI_API_KEY | base64 -w0` \ --set modelRepoGenerator.modelDeployKey=`echo -n tlt_encode | base64 -w0`
-
The Helm chart runs two containers in order: a
riva-model-init
container that downloads and deploys the models, followed by ariva-speech-api
container to start the speech service API. Depending on the number of models, the initial model deployment could take an hour or more. To monitor the deployment, usekubectl
to describe theriva-api
pod and watch the container logs.export pod=`kubectl get pods | cut -d " " -f 1 | grep riva-api` kubectl describe pod $pod kubectl logs -f $pod -c riva-model-init kubectl logs -f $pod -c riva-speech-api
Now that the Riva service is running, the cluster needs a mechanism to route requests into Riva.
In the default values.yaml
of the riva-api
Helm chart, service.type
was set to LoadBalancer
, which would have automatically created an Azure Classic Load Balancer to direct traffic into the Riva service. Instead, the open-source Traefik edge router will serve this purpose.
-
Download and untar the Traefik Helm chart.
helm repo add traefik https://helm.traefik.io/traefik helm repo update helm fetch traefik/traefik tar -zxvf traefik-*.tgz
-
Modify the
traefik/values.yaml
file.-
Change
service.type
fromLoadBalancer
toClusterIP
. This exposes the service on a cluster-internal IP. -
Set
nodeSelector
to{ kubernetes.azure.com/agentpool: loadbalancer}
. Similar to what you did for the Riva API service, this tells the Traefik service to run on theloadbalancer
node pool.
-
-
Deploy the modified
traefik
Helm chart.helm install traefik traefik/
An IngressRoute enables the Traefik load balancer to
recognize incoming requests and distribute them across multiple riva-api
services.
When you deployed the traefik
Helm chart above, Kubernetes automatically created a local DNS entry for that service: traefik.default.svc.cluster.local
. The IngressRoute definition below matches these DNS entries and directs requests to the riva-api
service. You can modify the entries to support a different DNS arrangement, depending on your requirements.
-
Create the following
riva-ingress.yaml
file:apiVersion: traefik.containo.us/v1alpha1 kind: IngressRoute metadata: name: riva-ingressroute spec: entryPoints: - web routes: - match: "Host(`traefik.default.svc.cluster.local`)" kind: Rule services: - name: riva-api port: 50051 scheme: h2c
-
Deploy the IngressRoute.
kubectl apply -f riva-ingress.yaml
The Riva service is now able to serve gRPC requests from within the cluster at the address traefik.default.svc.cluster.local
. If you are planning to deploy your own client application in the cluster to communicate with Riva, you can send requests to that address. In the next section, you will deploy a Riva sample client and use it to test the deployment.
Riva provides a container with a set of pre-built sample clients to test the Riva services. The clients are also available on GitHub for those interested in adapting them.
-
Create the
client-deployment.yaml
file that defines the deployment and contains the following:apiVersion: apps/v1 kind: Deployment metadata: name: riva-client labels: app: "rivaasrclient" spec: replicas: 1 selector: matchLabels: app: "rivaasrclient" template: metadata: labels: app: "rivaasrclient" spec: nodeSelector: kubernetes.azure.com/agentpool: rivaclient imagePullSecrets: - name: imagepullsecret containers: - name: riva-client image: "nvcr.io/{NgcOrg}/{NgcTeam}/riva-speech:{VersionNum}" command: ["/bin/bash"] args: ["-c", "while true; do sleep 5; done"]
-
Deploy the client service.
kubectl apply -f client-deployment.yaml
-
Connect to the client pod.
export cpod=`kubectl get pods | cut -d " " -f 1 | grep riva-client` kubectl exec --stdin --tty $cpod /bin/bash
-
From inside the shell of the client pod, run the sample ASR client on an example
.wav
file. Specify thetraefik.default.svc.cluster.local
endpoint, with port 80, as the service address.riva_streaming_asr_client \ --audio_file=wav/en-US_sample.wav \ --automatic_punctuation=true \ --riva_uri=traefik.default.svc.cluster.local:80
As deployed above, the AKS cluster only provisions a single GPU node, although we can scale the nodes. While a single GPU can handle a large volume of requests, the cluster can easily be scaled with more nodes.
-
Scale the GPU node pool to the desired number of compute nodes (2 in this case).
az aks nodepool scale --name rivaserver --resource-group ${AKS_RESOURCE_GROUP} --cluster-name ${AKS_CLUSTER_NAME} --node-count 2
-
Scale the
riva-api
deployment to use the additional nodes.kubectl scale deployments/riva-api --replicas=2
As with the original riva-api
deployment, each replica pod downloads and initializes the necessary models prior to starting the Riva service.