Skip to content

Commit

Permalink
feat: Addon for nvidia device plugin (#995)
Browse files Browse the repository at this point in the history
Co-authored-by: Bryant Biggs <[email protected]>
  • Loading branch information
askulkarni2 and bryantbiggs authored Sep 26, 2022
1 parent 36c403b commit 632a698
Show file tree
Hide file tree
Showing 11 changed files with 182 additions and 0 deletions.
48 changes: 48 additions & 0 deletions docs/add-ons/nvidia-device-plugin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# NVIDIA Device Plugin

The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically:

* Expose the number of GPUs on each nodes of your cluster
* Keep track of the health of your GPUs
* Run GPU enabled containers in your Kubernetes cluster.


For complete project documentation, please visit the [NVIDIA Device Plugin](https://github.com/NVIDIA/k8s-device-plugin#readme).

Additionally, refer to this AWS [blog](https://aws.amazon.com/blogs/compute/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks/) for more information on how the add-on can be tested.

## Usage

NVIDIA device plugin can be deployed by enabling the add-on via the following.

```hcl
enable_nvidia_device_plugin = true
```

You can optionally customize the Helm chart via the following configuration.

```hcl
enable_nvidia_device_plugin = true
# Optional nvidia_device_plugin_helm_config
nvidia_device_plugin_helm_config = {
name = "nvidia-device-plugin"
chart = "nvidia-device-plugin"
repository = "https://nvidia.github.io/k8s-device-plugin"
version = "0.12.3"
namespace = "nvidia-device-plugin"
values = [templatefile("${path.module}/values.yaml", {
...
})]
}
```

### GitOps Configuration
The following properties are made available for use when managing the add-on via GitOps.

Refer to [locals.tf](https://github.com/aws-ia/terraform-aws-eks-blueprints/blob/main/modules/kubernetes-addons/nvidia-device-plugin/locals.tf) for latest config. GitOps with ArgoCD Add-on repo is located [here](https://github.com/aws-samples/eks-blueprints-add-ons/blob/main/chart/values.yaml)

```hcl
argocd_gitops_config = {
enable = true
}
```
3 changes: 3 additions & 0 deletions modules/kubernetes-addons/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@
| <a name="module_kyverno"></a> [kyverno](#module\_kyverno) | ./kyverno | n/a |
| <a name="module_local_volume_provisioner"></a> [local\_volume\_provisioner](#module\_local\_volume\_provisioner) | ./local-volume-provisioner | n/a |
| <a name="module_metrics_server"></a> [metrics\_server](#module\_metrics\_server) | ./metrics-server | n/a |
| <a name="module_nvidia_device_plugin"></a> [nvidia\_device\_plugin](#module\_nvidia\_device\_plugin) | ./nvidia-device-plugin | n/a |
| <a name="module_ondat"></a> [ondat](#module\_ondat) | ondat/ondat-addon/eksblueprints | 0.1.1 |
| <a name="module_opentelemetry_operator"></a> [opentelemetry\_operator](#module\_opentelemetry\_operator) | ./opentelemetry-operator | n/a |
| <a name="module_prometheus"></a> [prometheus](#module\_prometheus) | ./prometheus | n/a |
Expand Down Expand Up @@ -196,6 +197,7 @@
| <a name="input_enable_kyverno_policy_reporter"></a> [enable\_kyverno\_policy\_reporter](#input\_enable\_kyverno\_policy\_reporter) | Enable Kyverno UI. Requires `enable_kyverno` to be `true` | `bool` | `false` | no |
| <a name="input_enable_local_volume_provisioner"></a> [enable\_local\_volume\_provisioner](#input\_enable\_local\_volume\_provisioner) | Enable Local volume provisioner add-on | `bool` | `false` | no |
| <a name="input_enable_metrics_server"></a> [enable\_metrics\_server](#input\_enable\_metrics\_server) | Enable metrics server add-on | `bool` | `false` | no |
| <a name="input_enable_nvidia_device_plugin"></a> [enable\_nvidia\_device\_plugin](#input\_enable\_nvidia\_device\_plugin) | Enable NVIDIA device plugin add-on | `bool` | `false` | no |
| <a name="input_enable_ondat"></a> [enable\_ondat](#input\_enable\_ondat) | Enable Ondat add-on | `bool` | `false` | no |
| <a name="input_enable_opentelemetry_operator"></a> [enable\_opentelemetry\_operator](#input\_enable\_opentelemetry\_operator) | Enable opentelemetry operator add-on | `bool` | `false` | no |
| <a name="input_enable_prometheus"></a> [enable\_prometheus](#input\_enable\_prometheus) | Enable Community Prometheus add-on | `bool` | `false` | no |
Expand Down Expand Up @@ -243,6 +245,7 @@
| <a name="input_kyverno_policy_reporter_helm_config"></a> [kyverno\_policy\_reporter\_helm\_config](#input\_kyverno\_policy\_reporter\_helm\_config) | Kyverno UI Helm Chart config | `any` | `{}` | no |
| <a name="input_local_volume_provisioner_helm_config"></a> [local\_volume\_provisioner\_helm\_config](#input\_local\_volume\_provisioner\_helm\_config) | Local volume provisioner Helm Chart config | `any` | `{}` | no |
| <a name="input_metrics_server_helm_config"></a> [metrics\_server\_helm\_config](#input\_metrics\_server\_helm\_config) | Metrics Server Helm Chart config | `any` | `{}` | no |
| <a name="input_nvidia_device_plugin_helm_config"></a> [nvidia\_device\_plugin\_helm\_config](#input\_nvidia\_device\_plugin\_helm\_config) | NVIDIA device plugin Helm Chart config | `any` | `{}` | no |
| <a name="input_ondat_admin_password"></a> [ondat\_admin\_password](#input\_ondat\_admin\_password) | Password for Ondat admin user | `string` | `"storageos"` | no |
| <a name="input_ondat_admin_username"></a> [ondat\_admin\_username](#input\_ondat\_admin\_username) | Username for Ondat admin user | `string` | `"storageos"` | no |
| <a name="input_ondat_create_cluster"></a> [ondat\_create\_cluster](#input\_ondat\_create\_cluster) | Create cluster resources | `bool` | `true` | no |
Expand Down
1 change: 1 addition & 0 deletions modules/kubernetes-addons/locals.tf
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ locals {
kyverno = var.enable_kyverno ? { enable = true } : null
kyverno_policies = var.enable_kyverno ? { enable = true } : null
kyverno_policy_reporter = var.enable_kyverno ? { enable = true } : null
nvidiaDevicePlugin = var.enable_nvidia_device_plugin ? module.nvidia_device_plugin[0].argocd_gitops_config : null
}

addon_context = {
Expand Down
8 changes: 8 additions & 0 deletions modules/kubernetes-addons/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -631,4 +631,12 @@ module "local_volume_provisioner" {
addon_context = local.addon_context
}

module "nvidia_device_plugin" {
count = var.enable_nvidia_device_plugin ? 1 : 0
source = "./nvidia-device-plugin"
helm_config = var.nvidia_device_plugin_helm_config
manage_via_gitops = var.argocd_manage_add_ons
addon_context = local.addon_context
}

# whitespace noise
45 changes: 45 additions & 0 deletions modules/kubernetes-addons/nvidia-device-plugin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# NVIDIA Device Plugin

The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically:

* Expose the number of GPUs on each nodes of your cluster
* Keep track of the health of your GPUs
* Run GPU enabled containers in your Kubernetes cluster.

Read the add-on [docs](../../../docs/add-ons/nvidia-device-plugin.md) for more details.

<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

| Name | Version |
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.0.0 |

## Providers

No providers.

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_helm_addon"></a> [helm\_addon](#module\_helm\_addon) | ../helm-addon | n/a |

## Resources

No resources.

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_addon_context"></a> [addon\_context](#input\_addon\_context) | Input configuration for the addon | <pre>object({<br> aws_caller_identity_account_id = string<br> aws_caller_identity_arn = string<br> aws_eks_cluster_endpoint = string<br> aws_partition_id = string<br> aws_region_name = string<br> eks_cluster_id = string<br> eks_oidc_issuer_url = string<br> eks_oidc_provider_arn = string<br> tags = map(string)<br> irsa_iam_role_path = string<br> irsa_iam_permissions_boundary = string<br> })</pre> | n/a | yes |
| <a name="input_helm_config"></a> [helm\_config](#input\_helm\_config) | Helm provider config for the add-on | `any` | `{}` | no |
| <a name="input_manage_via_gitops"></a> [manage\_via\_gitops](#input\_manage\_via\_gitops) | Determines if the add-on should be managed via GitOps. | `bool` | `false` | no |

## Outputs

| Name | Description |
|------|-------------|
| <a name="output_argocd_gitops_config"></a> [argocd\_gitops\_config](#output\_argocd\_gitops\_config) | Configuration used for managing the add-on with ArgoCD |
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
23 changes: 23 additions & 0 deletions modules/kubernetes-addons/nvidia-device-plugin/locals.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
locals {
name = "nvidia-device-plugin"
version = "0.12.3"

default_helm_config = {
name = local.name
chart = local.name
repository = "https://nvidia.github.io/k8s-device-plugin"
version = local.version
namespace = local.name
description = "nvidia-device-plugin Helm Chart deployment configuration"
create_namespace = true
}

helm_config = merge(
local.default_helm_config,
var.helm_config
)

argocd_gitops_config = {
enable = true
}
}
6 changes: 6 additions & 0 deletions modules/kubernetes-addons/nvidia-device-plugin/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
module "helm_addon" {
source = "../helm-addon"
manage_via_gitops = var.manage_via_gitops
helm_config = local.helm_config
addon_context = var.addon_context
}
4 changes: 4 additions & 0 deletions modules/kubernetes-addons/nvidia-device-plugin/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
output "argocd_gitops_config" {
description = "Configuration used for managing the add-on with ArgoCD"
value = var.manage_via_gitops ? local.argocd_gitops_config : null
}
28 changes: 28 additions & 0 deletions modules/kubernetes-addons/nvidia-device-plugin/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
variable "helm_config" {
description = "Helm provider config for the add-on"
type = any
default = {}
}

variable "manage_via_gitops" {
description = "Determines if the add-on should be managed via GitOps."
type = bool
default = false
}

variable "addon_context" {
description = "Input configuration for the addon"
type = object({
aws_caller_identity_account_id = string
aws_caller_identity_arn = string
aws_eks_cluster_endpoint = string
aws_partition_id = string
aws_region_name = string
eks_cluster_id = string
eks_oidc_issuer_url = string
eks_oidc_provider_arn = string
tags = map(string)
irsa_iam_role_path = string
irsa_iam_permissions_boundary = string
})
}
3 changes: 3 additions & 0 deletions modules/kubernetes-addons/nvidia-device-plugin/versions.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
terraform {
required_version = ">= 1.0.0"
}
13 changes: 13 additions & 0 deletions modules/kubernetes-addons/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -1220,3 +1220,16 @@ variable "local_volume_provisioner_helm_config" {
type = any
default = {}
}

#-----------NVIDIA DEVICE PLUGIN-----------------------
variable "enable_nvidia_device_plugin" {
description = "Enable NVIDIA device plugin add-on"
type = bool
default = false
}

variable "nvidia_device_plugin_helm_config" {
description = "NVIDIA device plugin Helm Chart config"
type = any
default = {}
}

0 comments on commit 632a698

Please sign in to comment.