Add cpu/memory requests and limits #1819

hj-johannes-lee · 2024-08-28T11:09:31Z

Operator maturity level 3 requires cpu/memory requests and limits for operands. Add them to all plugins deployed by operator.

deployments/dlb_plugin/base/intel-dlb-plugin.yaml

mythi · 2024-08-29T05:01:22Z

(converted to Draft)

hj-johannes-lee · 2024-09-04T22:22:14Z

dlb plugin
%CPU 0.7-3.3 (avg: 0.78) RES: 12544

dsa plugin
%CPU 1.0-4.2 (avg: 1.65) RES: 20796

fpga plugin
%CPU 0.3-7.4 (avg: 1.01) RES: 25716

gpu plugin
%CPU 1.7-3.3 (avg. 2.73) RES: 44556

iaa plugin
%CPU 1.0-5.2 (avg: 1.61) RES: 21260

qat plugin
%CPU 3.3-6.2 (avg: 3.48) RES: 14336

sgx plugin
%CPU 0.7-2.0 (avg: 0.89) RES: 12544

Based on this, I set the values in the yaml files. Please review.!

Except GPU plugin, I could not see other plugins have that much meaningfully different figures between when therei s no workload and when workloads are run.
I put CPU min. 40m for requests and somewhat higher than the biggest figure that I collected.
I put memory a slightly bigger value than the collected value as requests, and doubled for limits.

deployments/dlb_plugin/base/intel-dlb-plugin.yaml

deployments/dsa_plugin/base/intel-dsa-plugin.yaml

deployments/gpu_plugin/base/intel-gpu-plugin.yaml

deployments/sgx_plugin/base/intel-sgx-plugin.yaml

hj-johannes-lee · 2024-09-05T11:34:37Z

I set
2* the maximum figure that I see => requests
and 2* of reqeusts => limits

If you think it is fine, let me update the test files.!

mythi · 2024-09-05T11:45:03Z

I set 2* the maximum figure that I see => requests and 2* of reqeusts => limits

If you think it is fine, let me update the test files.!

@eero-t had this suggestion:
"If the values are really small, I'd round them up quite a bit. E.g. ~40m at least to 100m (0.1 CPU cores)."

hj-johannes-lee · 2024-09-05T12:33:51Z

I 'rounded them up" and made them to be minimum 40m. Didn't he mean that?

Do you think he meant to "add up" 40m-100m more?
Or.. if it is less than 40m, make it to be 100m?
Sorry, that is not really clear to me.

eero-t · 2024-09-05T14:05:30Z

Or.. if it is less than 40m, make it to be 100m? Sorry, that is not really clear to me.

I meant that if 2x CPU usage is <100m, just use 100m, because it's better for the limits to be (clearly) larger than needed, that process being needlessly throttled.

Memory throttling means paging, which means that perf of all processes in that node will suffer in effort to keep given container's memory usage within some artificial limit.

eero-t · 2024-09-05T14:12:41Z

I meant that if 2x CPU usage is <100m, just use 100m, because it's better for the limits to be (clearly) larger than needed, that process being needlessly throttled.

Memory throttling means paging, which means that perf of all processes in that node will suffer in effort to keep given container's memory usage within some artificial limit.

Another reason for having higher limits is that plugins resource usage is not going to be measured and limits updated for every release. Therefore there should be plenty of extra space in case plugin size / resource usage increases due to added features (GPU plugin is largest, so it can act as guideline how large they could grow).

Scheduler is not going to block additional workloads going to a node usage because it has couple of pods requesting smallish fraction of single core. :-)

mythi · 2024-09-06T08:12:23Z

If you think it is fine, let me update the test files.!

It would be good to do this anyways to get the e2e tests going. It would also be useful to come up with a simple test that tries to stress the plugins a bit.

hj-johannes-lee · 2024-09-09T09:39:26Z

@eero-t Thanks for the explanation.!
I increased the CPU limits' minimum value as 100m for all plugins.

So, in conclusion:

requests => max value what I saw (round up to nearest decimal, min. 40m)
limits => 2 * of reqeusts (min. 100m)

I thought increasing limits would be enough from the last commit.

eero-t · 2024-09-09T09:47:52Z

I thought increasing limits would be enough from the last commit.

Not that whether requests match limits, changes Pod QoS: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/

hj-johannes-lee · 2024-09-09T10:05:51Z

umm, @mythi what do you think?
You said they do not need to be the same.

And,, could anyone give help for adding the testcase for the /pkg/controllers//controller_test.go?
I have no idea how to set i and s correctly.

mythi · 2024-09-09T10:17:13Z

umm, @mythi what do you think?
You said they do not need to be the same.

What's the question?

And,, could anyone give help for adding the testcase for the /pkg/controllers//controller_test.go?

We have examples under test/e2e:

intel-device-plugins-for-kubernetes/test/e2e/qat/qatplugin_dpdk.go

Lines 185 to 188 in e578ab9

 Resources: v1.ResourceRequirements{ 

 Requests: v1.ResourceList{resourceName: resource.MustParse("1")}, 

 Limits: v1.ResourceList{resourceName: resource.MustParse("1")}, 

 },

hj-johannes-lee · 2024-09-09T10:24:55Z

Isn't Eero pointing out that limits and requests are different and they need to be equal for QoS?
You said they do not need to be the same and I made (especially memory) requests lower than limits after talking with you.

eero-t · 2024-09-09T10:32:57Z

My point was about QoS is wanted for the plugin pods?

If it's "Guaranteed", then limits & requests should be equal (and CPU & memory requests should be specified for all containers, if Pod has multiple ones).

hj-johannes-lee · 2024-09-09T11:37:23Z

@eero-t Thanks! Mikko said they do not need to be guaranteed but burstable would be enough. And, I have the same opinion. So, I am keeping them as now.! If you have different persepective, let me know.! :)

mythi · 2024-09-10T05:49:23Z

I thought about guaranteed a bit more. We don't give any possibilities for the users to update these values so we need to select them carefully. Are we going to see cases where the kubelet starts evicting the plugin pods on overloaded nodes? AFAUI, it should be enough if the requests are set so that we don't exceed them in normal cases...

hj-johannes-lee · 2024-09-12T10:43:48Z

@mythi Umm, we are adding the cpu/memory requests and limits because of the maturity level 3 of the operator. So, if only requests are set, it still does not fit the criteria of the maturity level 3. Would we then go with guaranteed by setting the same amount for limits as the requests?

Or, should we give possibility for users of the operator so the deployment/<plugin>/base/.*.yaml do not need to be changed at all?

mythi · 2024-09-13T05:05:17Z

So, if only requests are set,

I did not propose/mean this. I wrote that if we select this value carefully perhaps we are safe (but we would still set limits also). I prefer not to have these in the CRD API to keep it simple.

hj-johannes-lee · 2024-09-16T12:51:18Z

it should be enough if the requests are set so that we don't exceed them in normal cases...

I thought you meant that we do not need limits since it is enough to set requests.

I did not propose/mean this. I wrote that if we select this value carefully perhaps we are safe (but we would still set limits also).

I think I still do not get your point. We already set requests and limits safely, and you said about guaranteed. Are you suggesting to change to be guaranteed?

I prefer not to have these in the CRD API to keep it simple.

'These' means about the part 'giving possibility to users', right?

mythi · 2024-09-17T10:45:30Z

We already set requests and limits safely,

But the plugin pods can still be evicted, right? My question is how likely problem that is with the current settings. Also, have you tested what happens to running workloads that are using devices in case a plugin pod gets evicted.

hj-johannes-lee · 2024-09-17T13:46:22Z

I have not tested about what happens to the workloads when the pod gets evicted.
Would it work like this?

deploy a pluglin
run workload
delete the plugin while running (idk if it can be same as "getting evicted"

If not, what would be the way to make the pod evicted?

But to be honest, I do not see the matter of pods getting evicted is related to this PR...
My point is...
Of course, it is good to avoid such situations, but before this PR we did not have any requests/limits => so the plugins have been BestEfforts until now.
The current change makes them Burstable.

No one cared about the matter of pods getting evicted though they have been the least priority until now...
This PR will make them have higher priorities, then,, why does it matter? (did I understand wrongly? Please teach me if I did!)

eero-t · 2024-09-17T14:20:52Z

Yes, according to pod k8s QoS doc, a best-effort pod (earlier plugins) will be evicted before burstable ones (this PR). I assume this to be case even when latter pod exceeds it's resource request.

However, if pod exceeds its memory limit, it gets OOM killed by kernel & restarted by k8s scheduler (if its spec allows restart), and eventually scheduler backs off from restarting it. Regardless of whether node is short on resources or not. Something like that could happen if there's a condition that triggers plugin memory leak, so it may be relevant to check.

mythi · 2024-09-17T15:10:18Z

but before this PR we did not have any requests/limits

they were still controllable through LimitRange for example. Now were are locking things down.

hj-johannes-lee · 2024-09-17T20:31:11Z

Thanks for the explanations.!

However, if pod exceeds its memory limit, it gets OOM killed by kernel & restarted by k8s scheduler (if its spec allows restart), and eventually scheduler backs off from restarting it. Regardless of whether node is short on resources or not. Something like that could happen if there's a condition that triggers plugin memory leak, so it may be relevant to check.

Wouldn't this happen even when the pods QoS is guaranteed?

If yes... what do you suggest now?

I am lost. Just, let me know what I need to do next.!

mythi · 2024-09-18T06:38:10Z

what do you suggest now?

I think we all agreed that additional testing is still needed, especially in node cpu/memory pressure cases. It could be implemented as an e2e test also as suggested earlier.

eero-t · 2024-09-18T08:37:48Z

Wouldn't this happen even when the pods QoS is guaranteed?

Yes. But not for best-effort containers, if node has free resources. Node running out of resources is a case, where many other things could also fail, so it's hard to predict what happens in that case with best-effort containers, so testing such condition was IMHO less of a priority. Whereas with specified limits, it's clear what happens, when and to which pod, so testing that could make more sense.

If yes... what do you suggest now?

Testing what happens when limits are crossed..

Startup

Issue when new version will cross limits already at startup, can be tested by setting limits too low. But startup failures are not really interesting, as then there's no chance for workloads to request devices provided by the plugin.

Run-time

I do not think kernel allows lowering memory cgroup limits below container's current usage. Which means that container resource utilization would need to be increased at run-time, instead of limits being lowered.

Because current containers include just static plugin binary, not any tools, their allocation cannot cannot be easily increased by kubectl execing more stuff in them. Meaning that plugin or container content needs to be modified to support that.

This could be done either by adding some trivial tool for allocating + dirtying (writing to) given amount of memory, or by adding such option to plugin:

-test-alloc-and-use-cpu  Allocate given number of MBs, and busy loop, to simulate large resource usage for testing.

(That option should do just that, not any of the other plugin functionality, to avoid messing up things otherwise.)

...and using that at suitable moment in test:

kubectl exec -it <plugin pod> <container name> -- /path/to/plugin -test-alloc-and-use-cpu 200

eero-t · 2024-09-18T08:40:15Z

IMHO such changes and test(s) verifying that workloads behave as expected [1] would belong to another PR though.

[1] Workloads that already got a device, still work, but new workloads do not get devices until plugin has successfully restarted.

mythi · 2024-09-18T16:17:32Z

[1] Workloads that already got a device, still work

if this is the case, then my main concerns is covered. I was wondering if a guaranteed QoS class workload would be evicted if the plugin device resources go away.

eero-t · 2024-09-18T19:26:25Z

[1] Workloads that already got a device, still work

if this is the case, then my main concerns is covered. I was wondering if a guaranteed QoS class workload would be evicted if the plugin device resources go away.

AFAIK scheduler considers extendended resources only when pod is scheduled, and kubelet plugin can do something when the container is created (e.g. assign different device instead), but after that pod is completely on its own in this regard.

mythi · 2024-09-20T07:26:42Z

From my side, I'm fine merging once @hj-johannes-lee confirms enough (manual) testing is done

hj-johannes-lee · 2024-09-24T13:15:36Z

@tkatila Thank you so much. He is helping me with this matter now.!

tkatila · 2024-09-24T14:30:52Z

I verified that when a plugin is removed, the workload utilizing the resource will stay on. If another workload is scheduled, it will go into Pending state.

tkatila

With my limited testing, I'd say these current limits are good to go.

mythi · 2024-09-25T07:37:33Z

@tkatila thanks!

@hj-johannes-lee the two commits belong to one so please squash the commits and rebase

Operator maturity level 3 requires cpu/memory requests and limits for operands. Add them to all plugins deployed by operator Signed-off-by: Hyeongju Johannes Lee <[email protected]>

hj-johannes-lee · 2024-09-25T10:48:02Z

@tkatila Thank you again!
@mythi I did.!

hj-johannes-lee requested review from bart0sh, mythi and tkatila as code owners August 28, 2024 11:09

tkatila reviewed Aug 28, 2024

View reviewed changes

deployments/dlb_plugin/base/intel-dlb-plugin.yaml Outdated Show resolved Hide resolved

mythi marked this pull request as draft August 29, 2024 05:01

hj-johannes-lee force-pushed the PR-2024-005 branch from 315bc3f to c9debae Compare September 4, 2024 22:21

tkatila reviewed Sep 5, 2024

View reviewed changes

hj-johannes-lee force-pushed the PR-2024-005 branch from c9debae to 74fb954 Compare September 5, 2024 11:33

hj-johannes-lee force-pushed the PR-2024-005 branch from 74fb954 to 8f576fc Compare September 9, 2024 09:35

hj-johannes-lee marked this pull request as ready for review September 9, 2024 13:14

tkatila approved these changes Sep 25, 2024

View reviewed changes

Add cpu/memory requests and limits

3b08a90

Operator maturity level 3 requires cpu/memory requests and limits for operands. Add them to all plugins deployed by operator Signed-off-by: Hyeongju Johannes Lee <[email protected]>

hj-johannes-lee force-pushed the PR-2024-005 branch from eb88c82 to 3b08a90 Compare September 25, 2024 10:42

tkatila merged commit 4b62154 into intel:main Sep 25, 2024
75 checks passed

Add cpu/memory requests and limits #1819

Add cpu/memory requests and limits #1819

Conversation

hj-johannes-lee commented Aug 28, 2024 • edited Loading

mythi commented Aug 29, 2024

hj-johannes-lee commented Sep 4, 2024 • edited Loading

hj-johannes-lee commented Sep 5, 2024 • edited Loading

mythi commented Sep 5, 2024

hj-johannes-lee commented Sep 5, 2024 • edited Loading

eero-t commented Sep 5, 2024

eero-t commented Sep 5, 2024

mythi commented Sep 6, 2024

hj-johannes-lee commented Sep 9, 2024 • edited Loading

eero-t commented Sep 9, 2024

hj-johannes-lee commented Sep 9, 2024

mythi commented Sep 9, 2024

hj-johannes-lee commented Sep 9, 2024 • edited Loading

eero-t commented Sep 9, 2024 • edited Loading

hj-johannes-lee commented Sep 9, 2024 • edited Loading

mythi commented Sep 10, 2024 • edited Loading

hj-johannes-lee commented Sep 12, 2024 • edited Loading

mythi commented Sep 13, 2024

hj-johannes-lee commented Sep 16, 2024

mythi commented Sep 17, 2024

hj-johannes-lee commented Sep 17, 2024 • edited Loading

eero-t commented Sep 17, 2024 • edited Loading

mythi commented Sep 17, 2024

hj-johannes-lee commented Sep 17, 2024 • edited Loading

mythi commented Sep 18, 2024 • edited Loading

eero-t commented Sep 18, 2024

eero-t commented Sep 18, 2024 • edited Loading

mythi commented Sep 18, 2024

eero-t commented Sep 18, 2024

mythi commented Sep 20, 2024

hj-johannes-lee commented Sep 24, 2024

tkatila commented Sep 24, 2024

tkatila left a comment

Choose a reason for hiding this comment

mythi commented Sep 25, 2024

hj-johannes-lee commented Sep 25, 2024

hj-johannes-lee commented Aug 28, 2024 •

edited

Loading

hj-johannes-lee commented Sep 4, 2024 •

edited

Loading

hj-johannes-lee commented Sep 5, 2024 •

edited

Loading

hj-johannes-lee commented Sep 5, 2024 •

edited

Loading

hj-johannes-lee commented Sep 9, 2024 •

edited

Loading

hj-johannes-lee commented Sep 9, 2024 •

edited

Loading

eero-t commented Sep 9, 2024 •

edited

Loading

hj-johannes-lee commented Sep 9, 2024 •

edited

Loading

mythi commented Sep 10, 2024 •

edited

Loading

hj-johannes-lee commented Sep 12, 2024 •

edited

Loading

hj-johannes-lee commented Sep 17, 2024 •

edited

Loading

eero-t commented Sep 17, 2024 •

edited

Loading

hj-johannes-lee commented Sep 17, 2024 •

edited

Loading

mythi commented Sep 18, 2024 •

edited

Loading

eero-t commented Sep 18, 2024 •

edited

Loading