scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle #24304

mvegter · 2024-10-27T14:17:40Z

In our production environment where we run Nomad on v1.8.2 we noticed overlapping cpusets and the Nomad reserve/share slices being out of sync. Specifically, the below setup where we have various task in prestart and poststart that are part of the main lifecycle.

I managed to reproduce it with the below job spec on the latest main (v1.9.1) in my sandbox environment :

job "redis-job-{{SOME_SED_MAGIC}}" {
  type = "service"
  group "cache" {
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
      }
      resources {
        cores = 4
      }
    }

    task "redis-start-side" {
      lifecycle {
        hook    = "poststart"
        sidecar = true
      }
      driver = "docker"
      config {
        image = "redis:3.2"
      }
      resources {
        cores = 4
      }
    }
  }
}

Spinning up two jobs with this spec resulted in the following overlap :

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/6e06a9ed1631/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/a52a46cfa489/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/c9049b1b3f2c/cpuset.effective_cpus  8-11

Full output

[sandbox@nomad-dev nomad]$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED          STATUS          PORTS      NAMES
a52a46cfa489   redis:3.2   "docker-entrypoint.s…"   19 seconds ago   Up 18 seconds   6379/tcp   redis-start-side-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
ec9220fbe2d0   redis:3.2   "docker-entrypoint.s…"   19 seconds ago   Up 18 seconds   6379/tcp   redis-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0

[sandbox@nomad-dev nomad]$ grep -H . /sys/fs/cgroup/cpuset/nomad/{reserve,share}/cpuset.effective_cpus
/sys/fs/cgroup/cpuset/nomad/reserve/cpuset.effective_cpus:0-7
/sys/fs/cgroup/cpuset/nomad/share/cpuset.effective_cpus:8-123

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084/cpuset.effective_cpus  4-7

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs docker inspect | egrep '(CpusetCpus|NOMAD_CPU_LIMIT|Id)'
        "Id": "a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213",
            "CpusetCpus": "0,1,2,3",
                "NOMAD_CPU_LIMIT=8980",

[sandbox@nomad-dev nomad]$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED          STATUS          PORTS      NAMES
c9049b1b3f2c   redis:3.2   "docker-entrypoint.s…"   16 seconds ago   Up 15 seconds   6379/tcp   redis-start-side-50ef4e44-0e41-b273-7915-bfd0c2fc2ec2
6e06a9ed1631   redis:3.2   "docker-entrypoint.s…"   16 seconds ago   Up 16 seconds   6379/tcp   redis-50ef4e44-0e41-b273-7915-bfd0c2fc2ec2
a52a46cfa489   redis:3.2   "docker-entrypoint.s…"   3 minutes ago    Up 3 minutes    6379/tcp   redis-start-side-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
ec9220fbe2d0   redis:3.2   "docker-entrypoint.s…"   3 minutes ago    Up 3 minutes    6379/tcp   redis-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0

[sandbox@nomad-dev nomad]$ grep -H . /sys/fs/cgroup/cpuset/nomad/{reserve,share}/cpuset.effective_cpus
/sys/fs/cgroup/cpuset/nomad/reserve/cpuset.effective_cpus:0-11
/sys/fs/cgroup/cpuset/nomad/share/cpuset.effective_cpus:12-123

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/6e06a9ed1631758827aa4136690818d04c050c55559fb9f74b780b6ff8d33728/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/c9049b1b3f2c2bbfebc6ec8e2f3aa280a9ab23b86322452a54575b1cba3ae179/cpuset.effective_cpus  8-11

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs docker inspect | egrep '(CpusetCpus|NOMAD_CPU_LIMIT|Id)'
        "Id": "c9049b1b3f2c2bbfebc6ec8e2f3aa280a9ab23b86322452a54575b1cba3ae179",
            "CpusetCpus": "8,9,10,11",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "6e06a9ed1631758827aa4136690818d04c050c55559fb9f74b780b6ff8d33728",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213",
            "CpusetCpus": "0,1,2,3",
                "NOMAD_CPU_LIMIT=8980",

Fixes a bug in the BinPackIterator.Next method, where the scheduler would only
take into account the cpusets of the tasks in the largest lifecycle. This could
result in overlapping cgroup cpusets. By using the Allocation.ReservedCores, the
scheduler will use the same cpuset view as Partition.Reserve. Added logging in
case of future regressions thus not requiring manual inspection of cgroup files.

mvegter · 2024-11-08T15:57:28Z

Hey @tgross @jrasell , if you have some time available in the next few business days I would appreciate a review on this PR, curious to your feedback !

tgross · 2024-11-08T20:28:43Z

@mvegter just a heads up that we see this but it's going to take into next week to review... scheduler changes are particularly nasty and easy to get wrong, so its extra careful going.

tgross

Hi @mvegter! Thanks for this PR! I had a chat with some of folks internally and we think you're close in terms of the reasoning behind this, but the implementation leaves a spot for additional bugs (including in quotas).

First, let's look at what the ComparableResources type is supposed to be doing with lifecycles. Suppose I have an allocation with a main task, a post-start sidecar, and a prestart non-sidecar. To determine how much memory that allocation needs, I need to add up the memory usage required by all the tasks that are running at the same time. In this example, I don't need to include memory for the prestart non-sidecar because that memory will be freed by the time the other two tasks start.

But this only works because memory is fungible! We don't care which specific bytes of memory are use, as they have no identity (giant asterix on that of course around NUMA, but not relevant at the moment).

Let's rework the example for cores. Here, we can see that we need to reserve cores 0-1 for the prestart task even though it's not going to be running (partially because Nomad doesn't coordinate between tasks of different allocations, but also because we could potentially restart the task). Cores have their own identities so we need to ensure, as you've noted, that all the cores are reserved regardless of lifecycle.

Where the implementation in this PR falls apart a little bit is that we're creating a new function to produce the set of cores, rather than fixing the existing logic in (AllocatedCpuResources).Add and its caller (AllocatedTaskResources).Add. So what we should do is move the new logic you wrote into those functions, so that it can be used in the plan applier and quota checker. You can see that in (AllocatedTaskResources).Add we have logic for handling resources with unique identities (like networks), so there's good precedent for that there.

…hose part of the largest lifecycle Fixes a bug in the AllocatedResources.Comparable method, where the scheduler would only take into account the cpusets of the tasks in the largest lifecycle. This could result in overlapping cgroup cpusets. Now we make the distinction between reserved and fungible resources throughout the lifespan of the alloc. In addition, added logging in case of future regressions thus not requiring manual inspection of cgroup files.

mvegter · 2024-11-13T16:26:50Z

Hey @tgross , thank you for the extensive explanation ! I have made the changes within the Comparable() method as it would otherwise require quite some data structure changes to keep track of per-cpu MHz and add/substract/max correctly. Similar to the Devices/Network we always include the AllocatedCpuResources as part of Flattened and only look at cpu/share based values in the Add/Max . This way we correctly reflect the difference between the share and reserve slice for CPU , while not touching the memory logic. Let me know what you think of this , the tests from the initial commit are still passing; and manually testing the reproducer on master still fails but with these changes it's correctly working.

vercel bot deployed to Preview – nomad-ui October 27, 2024 14:18 View deployment

mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from 60f051d to e598fdc Compare October 27, 2024 14:19

vercel bot deployed to Preview – nomad-ui October 27, 2024 14:20 View deployment

mvegter marked this pull request as ready for review October 27, 2024 14:46

mvegter mentioned this pull request Oct 27, 2024

scheduler: fixed a bug where resource calculation did not account correctly for poststart tasks #24297

Merged

mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from e598fdc to a1aecbc Compare October 27, 2024 14:56

vercel bot deployed to Preview – nomad-ui October 27, 2024 14:57 View deployment

mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from a1aecbc to 498ff4c Compare October 28, 2024 12:04

vercel bot deployed to Preview – nomad-ui October 28, 2024 12:05 View deployment

mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from 498ff4c to fe008c9 Compare October 28, 2024 12:47

vercel bot deployed to Preview – nomad-ui October 28, 2024 12:49 View deployment

mvegter mentioned this pull request Oct 29, 2024

cpuset: no space left on device #23405

Open

hc-github-team-nomad-core mentioned this pull request Nov 5, 2024

Backport of scheduler: fixed a bug where resource calculation did not account correctly for poststart tasks into release/1.9.x #24361

Merged

mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from fe008c9 to bdffd59 Compare November 8, 2024 15:54

vercel bot deployed to Preview – nomad-ui November 8, 2024 15:55 View deployment

tgross self-requested a review November 8, 2024 17:15

tgross self-assigned this Nov 8, 2024

tgross added theme/scheduling type/bug labels Nov 8, 2024

tgross requested changes Nov 12, 2024

View reviewed changes

mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from bdffd59 to 2b388b4 Compare November 13, 2024 16:20

mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from 2b388b4 to f645eda Compare November 13, 2024 16:21

vercel bot deployed to Preview – nomad-ui November 13, 2024 16:22 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle #24304

scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle #24304

mvegter commented Oct 27, 2024 •

edited

Loading

mvegter commented Nov 8, 2024

tgross commented Nov 8, 2024

tgross left a comment

mvegter commented Nov 13, 2024

scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle #24304

Are you sure you want to change the base?

scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle #24304

Conversation

mvegter commented Oct 27, 2024 • edited Loading

mvegter commented Nov 8, 2024

tgross commented Nov 8, 2024

tgross left a comment

Choose a reason for hiding this comment

mvegter commented Nov 13, 2024

mvegter commented Oct 27, 2024 •

edited

Loading