Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cpu/memory requests and limits #1819

Merged
merged 1 commit into from
Sep 25, 2024
Merged

Conversation

hj-johannes-lee
Copy link
Contributor

@hj-johannes-lee hj-johannes-lee commented Aug 28, 2024

Operator maturity level 3 requires cpu/memory requests and limits for operands. Add them to all plugins deployed by operator.

@mythi mythi marked this pull request as draft August 29, 2024 05:01
@mythi
Copy link
Contributor

mythi commented Aug 29, 2024

(converted to Draft)

@hj-johannes-lee
Copy link
Contributor Author

hj-johannes-lee commented Sep 4, 2024

dlb plugin
%CPU 0.7-3.3 (avg: 0.78) RES: 12544

dsa plugin
%CPU 1.0-4.2 (avg: 1.65) RES: 20796

fpga plugin
%CPU 0.3-7.4 (avg: 1.01) RES: 25716

gpu plugin
%CPU 1.7-3.3 (avg. 2.73) RES: 44556

iaa plugin
%CPU 1.0-5.2 (avg: 1.61) RES: 21260

qat plugin
%CPU 3.3-6.2 (avg: 3.48) RES: 14336

sgx plugin
%CPU 0.7-2.0 (avg: 0.89) RES: 12544

Based on this, I set the values in the yaml files. Please review.!

  • Except GPU plugin, I could not see other plugins have that much meaningfully different figures between when therei s no workload and when workloads are run.
  • I put CPU min. 40m for requests and somewhat higher than the biggest figure that I collected.
  • I put memory a slightly bigger value than the collected value as requests, and doubled for limits.

deployments/dlb_plugin/base/intel-dlb-plugin.yaml Outdated Show resolved Hide resolved
deployments/dsa_plugin/base/intel-dsa-plugin.yaml Outdated Show resolved Hide resolved
deployments/gpu_plugin/base/intel-gpu-plugin.yaml Outdated Show resolved Hide resolved
deployments/sgx_plugin/base/intel-sgx-plugin.yaml Outdated Show resolved Hide resolved
@hj-johannes-lee
Copy link
Contributor Author

hj-johannes-lee commented Sep 5, 2024

I set
2* the maximum figure that I see => requests
and 2* of reqeusts => limits

If you think it is fine, let me update the test files.!

@mythi
Copy link
Contributor

mythi commented Sep 5, 2024

I set 2* the maximum figure that I see => requests and 2* of reqeusts => limits

If you think it is fine, let me update the test files.!

@eero-t had this suggestion:
"If the values are really small, I'd round them up quite a bit. E.g. ~40m at least to 100m (0.1 CPU cores)."

@hj-johannes-lee
Copy link
Contributor Author

hj-johannes-lee commented Sep 5, 2024

I 'rounded them up" and made them to be minimum 40m. Didn't he mean that?

Do you think he meant to "add up" 40m-100m more?
Or.. if it is less than 40m, make it to be 100m?
Sorry, that is not really clear to me.

@eero-t
Copy link
Contributor

eero-t commented Sep 5, 2024

Or.. if it is less than 40m, make it to be 100m? Sorry, that is not really clear to me.

I meant that if 2x CPU usage is <100m, just use 100m, because it's better for the limits to be (clearly) larger than needed, that process being needlessly throttled.

Memory throttling means paging, which means that perf of all processes in that node will suffer in effort to keep given container's memory usage within some artificial limit.

@eero-t
Copy link
Contributor

eero-t commented Sep 5, 2024

I meant that if 2x CPU usage is <100m, just use 100m, because it's better for the limits to be (clearly) larger than needed, that process being needlessly throttled.

Memory throttling means paging, which means that perf of all processes in that node will suffer in effort to keep given container's memory usage within some artificial limit.

Another reason for having higher limits is that plugins resource usage is not going to be measured and limits updated for every release. Therefore there should be plenty of extra space in case plugin size / resource usage increases due to added features (GPU plugin is largest, so it can act as guideline how large they could grow).

Scheduler is not going to block additional workloads going to a node usage because it has couple of pods requesting smallish fraction of single core. :-)

@mythi
Copy link
Contributor

mythi commented Sep 6, 2024

If you think it is fine, let me update the test files.!

It would be good to do this anyways to get the e2e tests going. It would also be useful to come up with a simple test that tries to stress the plugins a bit.

@hj-johannes-lee
Copy link
Contributor Author

hj-johannes-lee commented Sep 9, 2024

@eero-t Thanks for the explanation.!
I increased the CPU limits' minimum value as 100m for all plugins.

So, in conclusion:

  • requests => max value what I saw (round up to nearest decimal, min. 40m)
  • limits => 2 * of reqeusts (min. 100m)

I thought increasing limits would be enough from the last commit.

@eero-t
Copy link
Contributor

eero-t commented Sep 9, 2024

I thought increasing limits would be enough from the last commit.

Not that whether requests match limits, changes Pod QoS: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/

@hj-johannes-lee
Copy link
Contributor Author

umm, @mythi what do you think?
You said they do not need to be the same.

And,, could anyone give help for adding the testcase for the /pkg/controllers//controller_test.go?
I have no idea how to set i and s correctly.

@mythi
Copy link
Contributor

mythi commented Sep 9, 2024

umm, @mythi what do you think?
You said they do not need to be the same.

What's the question?

And,, could anyone give help for adding the testcase for the /pkg/controllers//controller_test.go?

We have examples under test/e2e:

Resources: v1.ResourceRequirements{
Requests: v1.ResourceList{resourceName: resource.MustParse("1")},
Limits: v1.ResourceList{resourceName: resource.MustParse("1")},
},

@hj-johannes-lee
Copy link
Contributor Author

hj-johannes-lee commented Sep 9, 2024

Isn't Eero pointing out that limits and requests are different and they need to be equal for QoS?
You said they do not need to be the same and I made (especially memory) requests lower than limits after talking with you.

@eero-t
Copy link
Contributor

eero-t commented Sep 9, 2024

My point was about QoS is wanted for the plugin pods?

If it's "Guaranteed", then limits & requests should be equal (and CPU & memory requests should be specified for all containers, if Pod has multiple ones).

@hj-johannes-lee
Copy link
Contributor Author

hj-johannes-lee commented Sep 9, 2024

@eero-t Thanks! Mikko said they do not need to be guaranteed but burstable would be enough. And, I have the same opinion. So, I am keeping them as now.! If you have different persepective, let me know.! :)

@hj-johannes-lee hj-johannes-lee marked this pull request as ready for review September 9, 2024 13:14
@mythi
Copy link
Contributor

mythi commented Sep 10, 2024

I thought about guaranteed a bit more. We don't give any possibilities for the users to update these values so we need to select them carefully. Are we going to see cases where the kubelet starts evicting the plugin pods on overloaded nodes? AFAUI, it should be enough if the requests are set so that we don't exceed them in normal cases...

@hj-johannes-lee
Copy link
Contributor Author

hj-johannes-lee commented Sep 12, 2024

@mythi Umm, we are adding the cpu/memory requests and limits because of the maturity level 3 of the operator. So, if only requests are set, it still does not fit the criteria of the maturity level 3. Would we then go with guaranteed by setting the same amount for limits as the requests?

Or, should we give possibility for users of the operator so the deployment/<plugin>/base/.*.yaml do not need to be changed at all?

@mythi
Copy link
Contributor

mythi commented Sep 13, 2024

So, if only requests are set,

I did not propose/mean this. I wrote that if we select this value carefully perhaps we are safe (but we would still set limits also). I prefer not to have these in the CRD API to keep it simple.

@hj-johannes-lee
Copy link
Contributor Author

it should be enough if the requests are set so that we don't exceed them in normal cases...

I thought you meant that we do not need limits since it is enough to set requests.

I did not propose/mean this. I wrote that if we select this value carefully perhaps we are safe (but we would still set limits also).

I think I still do not get your point. We already set requests and limits safely, and you said about guaranteed. Are you suggesting to change to be guaranteed?

I prefer not to have these in the CRD API to keep it simple.

'These' means about the part 'giving possibility to users', right?

@mythi
Copy link
Contributor

mythi commented Sep 17, 2024

We already set requests and limits safely,

But the plugin pods can still be evicted, right? My question is how likely problem that is with the current settings. Also, have you tested what happens to running workloads that are using devices in case a plugin pod gets evicted.

@hj-johannes-lee
Copy link
Contributor Author

hj-johannes-lee commented Sep 17, 2024

I have not tested about what happens to the workloads when the pod gets evicted.
Would it work like this?

  • deploy a pluglin
  • run workload
  • delete the plugin while running (idk if it can be same as "getting evicted"

If not, what would be the way to make the pod evicted?


But to be honest, I do not see the matter of pods getting evicted is related to this PR...
My point is...
Of course, it is good to avoid such situations, but before this PR we did not have any requests/limits => so the plugins have been BestEfforts until now.
The current change makes them Burstable.

No one cared about the matter of pods getting evicted though they have been the least priority until now...
This PR will make them have higher priorities, then,, why does it matter? (did I understand wrongly? Please teach me if I did!)

@eero-t
Copy link
Contributor

eero-t commented Sep 17, 2024

Yes, according to pod k8s QoS doc, a best-effort pod (earlier plugins) will be evicted before burstable ones (this PR). I assume this to be case even when latter pod exceeds it's resource request.

However, if pod exceeds its memory limit, it gets OOM killed by kernel & restarted by k8s scheduler (if its spec allows restart), and eventually scheduler backs off from restarting it. Regardless of whether node is short on resources or not. Something like that could happen if there's a condition that triggers plugin memory leak, so it may be relevant to check.

@mythi
Copy link
Contributor

mythi commented Sep 17, 2024

but before this PR we did not have any requests/limits

they were still controllable through LimitRange for example. Now were are locking things down.

@hj-johannes-lee
Copy link
Contributor Author

hj-johannes-lee commented Sep 17, 2024

Thanks for the explanations.!

However, if pod exceeds its memory limit, it gets OOM killed by kernel & restarted by k8s scheduler (if its spec allows restart), and eventually scheduler backs off from restarting it. Regardless of whether node is short on resources or not. Something like that could happen if there's a condition that triggers plugin memory leak, so it may be relevant to check.

Wouldn't this happen even when the pods QoS is guaranteed?

If yes... what do you suggest now?

I am lost. Just, let me know what I need to do next.!

@mythi
Copy link
Contributor

mythi commented Sep 18, 2024

what do you suggest now?

I think we all agreed that additional testing is still needed, especially in node cpu/memory pressure cases. It could be implemented as an e2e test also as suggested earlier.

@eero-t
Copy link
Contributor

eero-t commented Sep 18, 2024

Wouldn't this happen even when the pods QoS is guaranteed?

Yes. But not for best-effort containers, if node has free resources. Node running out of resources is a case, where many other things could also fail, so it's hard to predict what happens in that case with best-effort containers, so testing such condition was IMHO less of a priority. Whereas with specified limits, it's clear what happens, when and to which pod, so testing that could make more sense.

If yes... what do you suggest now?

Testing what happens when limits are crossed..

Startup

Issue when new version will cross limits already at startup, can be tested by setting limits too low. But startup failures are not really interesting, as then there's no chance for workloads to request devices provided by the plugin.

Run-time

I do not think kernel allows lowering memory cgroup limits below container's current usage. Which means that container resource utilization would need to be increased at run-time, instead of limits being lowered.

Because current containers include just static plugin binary, not any tools, their allocation cannot cannot be easily increased by kubectl execing more stuff in them. Meaning that plugin or container content needs to be modified to support that.

This could be done either by adding some trivial tool for allocating + dirtying (writing to) given amount of memory, or by adding such option to plugin:

-test-alloc-and-use-cpu  Allocate given number of MBs, and busy loop, to simulate large resource usage for testing.

(That option should do just that, not any of the other plugin functionality, to avoid messing up things otherwise.)

...and using that at suitable moment in test:

kubectl exec -it <plugin pod> <container name> -- /path/to/plugin -test-alloc-and-use-cpu 200

@eero-t
Copy link
Contributor

eero-t commented Sep 18, 2024

IMHO such changes and test(s) verifying that workloads behave as expected [1] would belong to another PR though.

[1] Workloads that already got a device, still work, but new workloads do not get devices until plugin has successfully restarted.

@mythi
Copy link
Contributor

mythi commented Sep 18, 2024

[1] Workloads that already got a device, still work

if this is the case, then my main concerns is covered. I was wondering if a guaranteed QoS class workload would be evicted if the plugin device resources go away.

@eero-t
Copy link
Contributor

eero-t commented Sep 18, 2024

[1] Workloads that already got a device, still work

if this is the case, then my main concerns is covered. I was wondering if a guaranteed QoS class workload would be evicted if the plugin device resources go away.

AFAIK scheduler considers extendended resources only when pod is scheduled, and kubelet plugin can do something when the container is created (e.g. assign different device instead), but after that pod is completely on its own in this regard.

@mythi
Copy link
Contributor

mythi commented Sep 20, 2024

From my side, I'm fine merging once @hj-johannes-lee confirms enough (manual) testing is done

@hj-johannes-lee
Copy link
Contributor Author

@tkatila Thank you so much. He is helping me with this matter now.!

@tkatila
Copy link
Contributor

tkatila commented Sep 24, 2024

I verified that when a plugin is removed, the workload utilizing the resource will stay on. If another workload is scheduled, it will go into Pending state.

Copy link
Contributor

@tkatila tkatila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my limited testing, I'd say these current limits are good to go.

@mythi
Copy link
Contributor

mythi commented Sep 25, 2024

@tkatila thanks!

@hj-johannes-lee the two commits belong to one so please squash the commits and rebase

Operator maturity level 3 requires cpu/memory requests and limits
for operands. Add them to all plugins deployed by operator

Signed-off-by: Hyeongju Johannes Lee <[email protected]>
@hj-johannes-lee
Copy link
Contributor Author

@tkatila Thank you again!
@mythi I did.!

@tkatila tkatila merged commit 4b62154 into intel:main Sep 25, 2024
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants