Skip to content

Commit

Permalink
Add GPU/TPU addonplacementscore to each node in the managed clusters.
Browse files Browse the repository at this point in the history
Develop the add-on in addontemplate mode.

Signed-off-by: z1ens <[email protected]>
  • Loading branch information
z1ens committed Jul 10, 2024
1 parent 9498948 commit 83f79ab
Show file tree
Hide file tree
Showing 29 changed files with 649 additions and 845 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,7 @@

# Dependency directories (remove the comment below to include it)
# vendor/

# Ignore macOS and IDE specific files
.DS_Store
.idea/
4 changes: 4 additions & 0 deletions resource-usage-collect-addon/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,7 @@ bin/

# Dependency directories (remove the comment below to include it)
vendor/

# Ignore macOS and IDE specific files
.DS_Store
.idea/
2 changes: 1 addition & 1 deletion resource-usage-collect-addon/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM golang:1.21 AS builder
FROM golang:1.22 AS builder
WORKDIR /go/src/open-cluster-management.io/addon-contrib/resource-usage-collect
COPY . .
ENV GO_PACKAGE open-cluster-management.io/addon-contrib/resource-usage-collect
Expand Down
4 changes: 2 additions & 2 deletions resource-usage-collect-addon/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ PWD=$(shell pwd)
# Image URL to use all building/pushing image targets;
GO_BUILD_PACKAGES :=./pkg/...
IMAGE ?= resource-usage-collect-addon
IMAGE_REGISTRY ?= quay.io/haoqing
IMAGE_REGISTRY ?= zheshen
IMAGE_TAG ?= latest
IMAGE_NAME ?= $(IMAGE_REGISTRY)/$(IMAGE):$(IMAGE_TAG)

Expand All @@ -43,7 +43,7 @@ vet: ## Run go vet against code.
##@ Build
.PHONY: build
build: fmt vet ## Build manager binary.
GOFLAGS="" go build -o addon ./pkg/addon/main.go ./pkg/addon/controller.go
GOFLAGS="" go build -o addon ./pkg/main.go

.PHONY: images
images: ## Build addon binary.
Expand Down
85 changes: 51 additions & 34 deletions resource-usage-collect-addon/README.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,72 @@

# Prototype of extensible scheduling using resources usage.
We already support [extensible placement scheduling](https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md), which allows use of [addonplacementscore](https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md#addonplacementscore-api) to select clusters, but we lack an addonplacementscore that contains cluster resource usage information.
# Resource usage collect addon
## Background
Open-Cluster-Management has already supported [extensible placement scheduling](https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md), which allow users to use [addonplacementscore](https://github.com/open-cluster-management-io/enhancements/blob/main/enhancements/sig-architecture/32-extensiblescheduling/32-extensiblescheduling.md#addonplacementscore-api) to select clusters under certain conditions.

In this repo, I developed an addon through addon-freamwork, this addon is mainly used to collect resource usage information on the cluster, and generate an addonplacementscore under the cluster namespace of the hub.
The basic idea of `addonPlacementScore` is that, the addon agent, which is installed on the managed cluster, collect information about the managed cluster, and calculate a score. These scores can be used when selecting or comparing multiple clusters.
With the rapid advancement of artificial intelligence, an increasing number of developers need to schedule and plan workloads based on available resources to achieve better performance and save resources.

More details refer to [Extend the multicluster scheduling capabilities with placement](https://open-cluster-management.io/scenarios/extend-multicluster-scheduling-capabilities/)
This repository mainly introduce an addon which collects the resource usage information in the managed clusters and calculate `addonPlacementScore`, users could select clusters based on the score using a `placement`.
A possible use case could be: As a developer, I want to deploy my work on the cluster who has the most GPU resources available. This addon is developed using `addonTemplate`.

# Quickstart
## Prepare
You have at least two running kubernetes cluster. One is the hub cluster, the other is managedcluster.

You can create an ocm environment by running below command, which will create a hub and two managedclusters for you.
More details about:
- Extensible scheduling, please refer to [Extend the multicluster scheduling capabilities with placement](https://open-cluster-management.io/scenarios/extend-multicluster-scheduling-capabilities/)
- Add-on, please refer to [What-is-an-addon](https://open-cluster-management.io/concepts/addon/#what-is-an-add-on)
- Placement, please refer to [What-is-a-placement](https://open-cluster-management.io/concepts/placement/#select-clusters-in-managedclusterset)
- Addon template, please refer to [Enhancement:addontemplate](https://github.com/open-cluster-management-io/enhancements/tree/main/enhancements/sig-architecture/82-addon-template)

```bash
curl -sSL https://raw.githubusercontent.com/open-cluster-management-io/OCM/main/solutions/setup-dev-environment/local-up.sh | bash
```
# Quickstart
## Prerequisite
1. Follow the instructions on [OCM official website](https://open-cluster-management.io/getting-started/quick-start/) install`clusteradm` command-line tool and set up a hub (manager) cluster and two managed clusters.
If using a different kubernetes distribution, follow the instructions in [Set-hub-and-managed-cluster](https://open-cluster-management.io/getting-started/quick-start/#setup-hub-and-managed-cluster).
2. Command line tool `kubectl` installed.
3. [Docker](https://www.docker.com/) installed.

## Deploy

Set environment variables.
**Export `kubeconfig` file of your hub cluster.**

```bash
export KUBECONFIG=</path/to/hub_cluster/kubeconfig> # export KUBECONFIG=~/.kube/config
```

Build the docker image to run the sample AddOn.
**Build the docker image to run the resource-usage-addon.**

```bash
# build image
export IMAGE_NAME=quay.io/haoqing/resource-usage-collect-addon:latest
export IMAGE_NAME=zheshen/resource-usage-collect-addon-template:latest
make images
```

If your are using kind, load image into kind cluster.
**If you are using kind, load image to your hub cluster.**

```bash
kind load docker-image $IMAGE_NAME --name cluster_name # kind load docker-image $IMAGE_NAME --name hub
```

And then deploy the example AddOns controller on hub cluster.
**On the hub cluster, deploy the addon.**

```bash
make deploy
```

On the hub cluster, verify the resource-usage-collect-controller pod is running.
```bash
$ kubectl get pods -n open-cluster-management | grep resource-usage-collect-controller
resource-usage-collect-controller-55c58bbc5-t45dh 1/1 Running 0 71s
```
## What's Next

## What is next
If deployed successfully:

After the deployment is complete, addon will create an addonplacementscore in its own namespace for each managedcluster in the hub.
On the hub cluster, you can see the `addonTemplate`, and check the `managedClusterAddon` status.
```bash
$ kubectl get addontemplate
NAME ADDON NAME
resource-usage-collect resource-usage-collect

$ kubectl get mca -A
NAMESPACE NAME AVAILABLE DEGRADED PROGRESSING
cluster1 resource-usage-collect True False
cluster2 resource-usage-collect True False
```

After a short while,on the hub cluster, `addonPlacementScore` for each managed cluster will be generated.
```bash
$ kubectl config use kind-hub
$ kubectl get addonplacementscore -A
Expand All @@ -62,21 +75,20 @@ cluster1 resource-usage-score 3m23s
cluster2 resource-usage-score 3m24s
```

### For example

Select a cluster with more available CPU.
### Use Placement to select clusters
Consider this example use case: As a developer, I want to select a cluster with the most available GPU resources and deploy a job on it.

Bind the default ManagedClusterSet to default Namespace.
```bash
clusteradm clusterset bind default --namespace default
```

User could create a placement to select one cluster who has the most GPU resources.
```bash
cat <<EOF | kubectl apply -f -
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
metadata:
name: placement
name: placement1
namespace: default
spec:
numberOfClusters: 1
Expand All @@ -87,18 +99,23 @@ spec:
type: AddOn
addOn:
resourceName: resource-usage-score
scoreName: cpuAvailable
scoreName: gpuAvailable
weight: 1
EOF
```

After the `placement` is created, the user wants to deploy a job to the selected cluster, could use `clusteradm` command to combine these two steps.
```bash
kubectl get placementdecisions -A
```
clusteradm create work my-first-work -f work1.yaml --placement default/placement1
````
Then the work will be deployed to the cluster who has been selected. User could see the changes in `addonPlacementScore` if the GPU resources has been consumed by the job.

# Clean up
# Uninstall in the addon.

```bash
# clean up this addon
make undeploy
```

### Troubleshoot
1. If `make deploy` could not work, it might be that there has an auto-generated `kustomization_tmp.yaml.tmp` file, delete it and rerun the command.
Also make sure you are under hub cluster context, check the `kustomization.yaml` file, delete the part under `configMapGenerator`(if there is one exists).
12 changes: 6 additions & 6 deletions resource-usage-collect-addon/deploy/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
namespace: open-cluster-management

resources:
- resources/clusterrole.yaml
- resources/controller.yaml
- resources/clustermanagementaddon.yaml
- resources/addon-template.yaml
- resources/cluster-management-addon.yaml
- resources/cluster-role.yaml
- resources/managed-cluster-set-binding.yaml
- resources/placement.yaml

images:
- name: example-addon-image
newName: quay.io/open-cluster-management/resource-usage-collect-addon
newName: zheshen/resource-usage-collect-addon
newTag: latest
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
100 changes: 100 additions & 0 deletions resource-usage-collect-addon/deploy/resources/addon-template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: AddOnTemplate
metadata:
name: resource-usage-collect
spec:
addonName: resource-usage-collect
agentSpec:
workload:
manifests:
- kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: resource-usage-collect-agent
rules:
- apiGroups: [ "" ]
resources: [ "nodes","configmaps", "pods", "events" ]
verbs: [ "get", "list", "watch", "create", "update", "delete", "deletecollection", "patch" ]
- apiGroups: [ "coordination.k8s.io" ]
resources: [ "leases" ]
verbs: [ "create", "get", "list", "update", "watch", "patch" ]
- kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: resource-usage-collect-agent
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: resource-usage-collect-agent
subjects:
- kind: ServiceAccount
name: resource-usage-collect-agent-sa
namespace: open-cluster-management-agent-addon
- kind: Deployment
apiVersion: apps/v1
metadata:
name: resource-usage-collect-agent
namespace: open-cluster-management-agent-addon
labels:
app: resource-usage-collect-agent
spec:
replicas: 1
selector:
matchLabels:
app: resource-usage-collect-agent
template:
metadata:
labels:
app: resource-usage-collect-agent
spec:
serviceAccount: resource-usage-collect-agent-sa
containers:
- name: resource-usage-collect-agent
image: zheshen/resource-usage-collect-addon:latest
imagePullPolicy: Always
args:
- "/addon"
- "agent"
- "--hub-kubeconfig={{HUB_KUBECONFIG}}"
- "--cluster-name={{CLUSTER_NAME}}"
- "--addon-namespace=open-cluster-management-agent-addon"
- kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: open-cluster-management:resource-usage-collect:agent
namespace: open-cluster-management-agent-addon
rules:
- apiGroups: [ "" ]
resources: [ "nodes","configmaps", "pods" ]
verbs: [ "get", "list", "watch" ]
- apiGroups: [ "cluster.open-cluster-management.io" ]
resources: [ "addonplacementscores" ]
verbs: [ "get", "list", "watch", "create", "update", "delete", "deletecollection", "patch" ]
- apiGroups: [ "cluster.open-cluster-management.io" ]
resources: [ "addonplacementscores/status" ]
verbs: [ "update", "patch" ]
- kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: open-cluster-management:resource-usage-collect:agent
namespace: open-cluster-management-agent-addon
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: open-cluster-management:resource-usage-collect:agent
subjects:
- kind: ServiceAccount
name: resource-usage-collect-agent-sa
namespace: open-cluster-management-agent-addon
- kind: ServiceAccount
apiVersion: v1
metadata:
name: resource-usage-collect-agent-sa
namespace: open-cluster-management-agent-addon
registration:
- type: KubeClient
kubeClient:
hubPermissions:
- type: CurrentCluster
currentCluster:
clusterRoleName: open-cluster-management:resource-usage-collect:agent
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ClusterManagementAddOn
metadata:
name: resource-usage-collect
spec:
addOnMeta:
displayName: resource-usage-collect
supportedConfigs:
- group: addon.open-cluster-management.io
resource: addontemplates
defaultConfig:
name: resource-usage-collect
installStrategy:
type: Placements
placements:
- name: placement-all
namespace: default
31 changes: 31 additions & 0 deletions resource-usage-collect-addon/deploy/resources/cluster-role.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: open-cluster-management:resource-usage-collect:agent
rules:
- apiGroups: [ "" ]
resources: [ "nodes","configmaps", "pods", "events" ]
verbs: [ "get", "list", "watch", "create", "update", "delete","deletecollection", "patch" ]
- apiGroups: [ "coordination.k8s.io" ]
resources: [ "leases" ]
verbs: [ "create", "get", "list", "update", "watch", "patch" ]
- apiGroups: [ "cluster.open-cluster-management.io" ]
resources: [ "addonplacementscores" ]
verbs: [ "get", "list", "watch", "create", "update", "delete","deletecollection", "patch" ]
- apiGroups: [ "cluster.open-cluster-management.io" ]
resources: [ "addonplacementscores/status" ]
verbs: [ "update", "patch" ]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: open-cluster-management-addon-manager-resource-usage-collect
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: open-cluster-management:resource-usage-collect:agent
subjects:
- kind: ServiceAccount
name: addon-manager-controller-sa
namespace: open-cluster-management-hub

This file was deleted.

Loading

0 comments on commit 83f79ab

Please sign in to comment.