✨ Operator managed quota. #199

jortel · 2024-08-15T14:33:52Z

No description provided.

Signed-off-by: Jeff Ortel <[email protected]>

aufi

Looks good, just added few notes.

aufi · 2024-08-21T15:11:10Z

enhancements/operator-quota/README.md

+
+## Design Details
+
+### Test Plan


Some kind of scale test should be probably defined to prove the quota helps handling higher loads.

Also, do we have any upper / lower bounds? 99%, 100% would probably give basically no protection, while 1%, 2% etc. might make it just about impossible to schedule anything.

Perhaps but hard to say. 2% of a huge cluster may be enough. On the flip side, 100% is essentially worthless but not broken.

Some kind of scale test should be probably defined to prove the quota helps handling higher loads.

This would be very complicated.

enhancements/operator-quota/README.md

jmontleon · 2024-08-21T18:13:12Z

The logic (mostly for me later if we decide to do this) to get total CPUs on worker nodes is something like this. This should be easy enough to reproduce with ansible.

$ tot_cpu=0; for n in $(oc get node --selector='node-role.kubernetes.io/worker=' -o name); do node_cpu=$(oc get $n -o jsonpath='{.status.capacity.cpu}'); ((tot_cpu+=node_cpu)); done; echo $tot_cpu
16

And memory (in Ki):

tot_mem=0; for n in $(oc get node --selector='node-role.kubernetes.io/worker=' -o name); do node_mem=$(oc get $n -o jsonpath='{.status.capacity.memory}'); ((tot_mem+=$(echo $node_mem | sed 's/Ki//') )); done; echo $tot_mem
65825548

https://access.redhat.com/solutions/6968485

shawn-hurley

Overall I like the idea, I agree with others, that documentation around best practices and that it should probably default to off.

I had one other concern, but I think it is managable, and may even be a concern for a different time.

shawn-hurley · 2024-08-21T21:36:22Z

enhancements/operator-quota/README.md

+
+## Proposal
+
+Enhance the operator to optionally manage a Resource Quota in the namespace.


One thing that I would be concerned about, but I think there options around this, is this scenario:

Pods are created to MAX of the resource Quota

The HUB pod is killed and now is unable to be scheduled

This leads to konveyor being taken down for some period of time (as the pods complete it will free up space IIUC.

I think one thing to get around this, is have the hub/ui pods be of a higher priority class, and all the task pods be of another priority class, and use the resouce quota on just the task the pods using the resource quota scope of priority clas.

Just a thought, and may be a pre-mature optimization but just a concern.

The quota limits the creation of the Pod resource. So, if the hub crashes or is OOM killed, the quota would not impact restarting the container. I think the potentially concerning use case would be related to the operator needing to restart pods. Such as upgrade while tasks are in flight. Or, and edit of the Tackle CR causing the operator to update a deployment.

Some options (not in any order). Though, I think #4 is the simplest/cleanest.

Option 1

Perhaps this could be mitigated by having the operator reconcile in the following order:

delete the quota

reconcile everything else.

reconcile the quota.

This seems fragile and does not address all of the use cases.

Option 2

Priority classes (suggested by @shawn-hurley)
Unfortunately priority classes are cluster scoped and without a default priority class (which does not exist by default), pods without an assigned class default to Zero(0). IIUC, this means that tackle pods will run at higher priority than all other pods in the cluster (including controllers). Also, our operator would need perms to create the cluster-scoped priority classes which I highly doubt could be acceptable. All of this adds up to - we can't use/rely-on priority classes.

$ oc get pc NAME VALUE GLOBAL-DEFAULT AGE system-cluster-critical 2000000000 false 33d system-node-critical 2000001000 false 33d

Option 3

It is also unfortunate that Quota's don't support labels selectors.

The only option I have considered to address this concern is for task pods (and quota) to be created in another namespace. Example: konveyor-task. This would have the benefits of:

Would eliminate clutter in our main in konveyor-tackle.

Quota would only affect task pods.

But with the following drawbacks:

mainly that dealing with multiple namespaces would be kind of a pain. And, perhaps confusing for users.

Option 4

Perhaps a better option would be to implement the quota in the hub. The hub would list the nodes and calculate the allocatable cpu and memory. Then, (optionally) create task pods within the requested quota also expressed as a % passed to the hub ENVAR. I had a PR for this at one time but discarded it in favor or ResourceQuota. The algorithm/calculation is straight forward. Given the limitations of ResourceQuota, this may be the better way to go.

I think that if the other suggestions don't work then as described it makes sense, sorry for the wild goose chase. Just thought there as maybe something that we could do.

With that said the usecase I am worried about is an edge case and baring good options (1-3 I agree are out) and I don't want to re-create a scheduler (option 4) then I think what we have is sufficient

Well, we already have a scheduler. Option 4 only proposed to make it a little smarter.

Signed-off-by: Jeff Ortel <[email protected]>

Add enhancement.

b11da0a

Signed-off-by: Jeff Ortel <[email protected]>

jortel changed the title ~~✨ Add enhancement.~~ ✨ Operator manged quota. Aug 15, 2024

jortel added this to the v0.6.0 milestone Aug 15, 2024

checkpoint

f72cec1

Signed-off-by: Jeff Ortel <[email protected]>

aufi reviewed Aug 21, 2024

View reviewed changes

enhancements/operator-quota/README.md Outdated Show resolved Hide resolved

jortel changed the title ~~✨ Operator manged quota.~~ ✨ Operator managed quota. Aug 21, 2024

shawn-hurley approved these changes Aug 21, 2024

View reviewed changes

checkpoint

611c186

Signed-off-by: Jeff Ortel <[email protected]>

aufi approved these changes Aug 29, 2024

View reviewed changes

jortel merged commit 3084f00 into konveyor:master Aug 30, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Operator managed quota. #199

✨ Operator managed quota. #199

jortel commented Aug 15, 2024

aufi left a comment

aufi Aug 21, 2024

jmontleon Aug 21, 2024

jortel Aug 21, 2024

jortel Aug 28, 2024 •

edited

Loading

jmontleon commented Aug 21, 2024 •

edited

Loading

shawn-hurley left a comment

shawn-hurley Aug 21, 2024

jortel Aug 27, 2024 •

edited

Loading

shawn-hurley Aug 27, 2024

jortel Aug 27, 2024 •

edited

Loading


		## Proposal

		Enhance the operator to optionally manage a Resource Quota in the namespace.

✨ Operator managed quota. #199

✨ Operator managed quota. #199

Conversation

jortel commented Aug 15, 2024

aufi left a comment

Choose a reason for hiding this comment

aufi Aug 21, 2024

Choose a reason for hiding this comment

jmontleon Aug 21, 2024

Choose a reason for hiding this comment

jortel Aug 21, 2024

Choose a reason for hiding this comment

jortel Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

jmontleon commented Aug 21, 2024 • edited Loading

shawn-hurley left a comment

Choose a reason for hiding this comment

shawn-hurley Aug 21, 2024

Choose a reason for hiding this comment

jortel Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Option 1

Option 2

Option 3

Option 4

shawn-hurley Aug 27, 2024

Choose a reason for hiding this comment

jortel Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

jortel Aug 28, 2024 •

edited

Loading

jmontleon commented Aug 21, 2024 •

edited

Loading

jortel Aug 27, 2024 •

edited

Loading

jortel Aug 27, 2024 •

edited

Loading