-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle #24304
base: main
Are you sure you want to change the base?
scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle #24304
Conversation
60f051d
to
e598fdc
Compare
e598fdc
to
a1aecbc
Compare
a1aecbc
to
498ff4c
Compare
498ff4c
to
fe008c9
Compare
fe008c9
to
bdffd59
Compare
@mvegter just a heads up that we see this but it's going to take into next week to review... scheduler changes are particularly nasty and easy to get wrong, so its extra careful going. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mvegter! Thanks for this PR! I had a chat with some of folks internally and we think you're close in terms of the reasoning behind this, but the implementation leaves a spot for additional bugs (including in quotas).
First, let's look at what the ComparableResources
type is supposed to be doing with lifecycles. Suppose I have an allocation with a main task, a post-start sidecar, and a prestart non-sidecar. To determine how much memory that allocation needs, I need to add up the memory usage required by all the tasks that are running at the same time. In this example, I don't need to include memory for the prestart non-sidecar because that memory will be freed by the time the other two tasks start.
But this only works because memory is fungible! We don't care which specific bytes of memory are use, as they have no identity (giant asterix on that of course around NUMA, but not relevant at the moment).
Let's rework the example for cores. Here, we can see that we need to reserve cores 0-1 for the prestart task even though it's not going to be running (partially because Nomad doesn't coordinate between tasks of different allocations, but also because we could potentially restart the task). Cores have their own identities so we need to ensure, as you've noted, that all the cores are reserved regardless of lifecycle.
Where the implementation in this PR falls apart a little bit is that we're creating a new function to produce the set of cores, rather than fixing the existing logic in (AllocatedCpuResources).Add
and its caller (AllocatedTaskResources).Add
. So what we should do is move the new logic you wrote into those functions, so that it can be used in the plan applier and quota checker. You can see that in (AllocatedTaskResources).Add
we have logic for handling resources with unique identities (like networks), so there's good precedent for that there.
bdffd59
to
2b388b4
Compare
…hose part of the largest lifecycle Fixes a bug in the AllocatedResources.Comparable method, where the scheduler would only take into account the cpusets of the tasks in the largest lifecycle. This could result in overlapping cgroup cpusets. Now we make the distinction between reserved and fungible resources throughout the lifespan of the alloc. In addition, added logging in case of future regressions thus not requiring manual inspection of cgroup files.
2b388b4
to
f645eda
Compare
Hey @tgross , thank you for the extensive explanation ! I have made the changes within the Comparable() method as it would otherwise require quite some data structure changes to keep track of per-cpu MHz and add/substract/max correctly. Similar to the Devices/Network we always include the |
In our production environment where we run Nomad on
v1.8.2
we noticed overlapping cpusets and the Nomad reserve/share slices being out of sync. Specifically, the below setup where we have various task inprestart
andpoststart
that are part of themain
lifecycle.I managed to reproduce it with the below job spec on the latest main (v1.9.1) in my sandbox environment :
Spinning up two jobs with this spec resulted in the following overlap :
Full output