Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: take all assigned cpu cores into account instead of only those part of the largest lifecycle #24304

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mvegter
Copy link
Contributor

@mvegter mvegter commented Oct 27, 2024

In our production environment where we run Nomad on v1.8.2 we noticed overlapping cpusets and the Nomad reserve/share slices being out of sync. Specifically, the below setup where we have various task in prestart and poststart that are part of the main lifecycle.
image

I managed to reproduce it with the below job spec on the latest main (v1.9.1) in my sandbox environment :

job "redis-job-{{SOME_SED_MAGIC}}" {
  type = "service"
  group "cache" {
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
      }
      resources {
        cores = 4
      }
    }

    task "redis-start-side" {
      lifecycle {
        hook    = "poststart"
        sidecar = true
      }
      driver = "docker"
      config {
        image = "redis:3.2"
      }
      resources {
        cores = 4
      }
    }
  }
}

Spinning up two jobs with this spec resulted in the following overlap :

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/6e06a9ed1631/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/a52a46cfa489/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/c9049b1b3f2c/cpuset.effective_cpus  8-11

Full output

[sandbox@nomad-dev nomad]$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED          STATUS          PORTS      NAMES
a52a46cfa489   redis:3.2   "docker-entrypoint.s…"   19 seconds ago   Up 18 seconds   6379/tcp   redis-start-side-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
ec9220fbe2d0   redis:3.2   "docker-entrypoint.s…"   19 seconds ago   Up 18 seconds   6379/tcp   redis-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0

[sandbox@nomad-dev nomad]$ grep -H . /sys/fs/cgroup/cpuset/nomad/{reserve,share}/cpuset.effective_cpus
/sys/fs/cgroup/cpuset/nomad/reserve/cpuset.effective_cpus:0-7
/sys/fs/cgroup/cpuset/nomad/share/cpuset.effective_cpus:8-123

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084/cpuset.effective_cpus  4-7

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs docker inspect | egrep '(CpusetCpus|NOMAD_CPU_LIMIT|Id)'
        "Id": "a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213",
            "CpusetCpus": "0,1,2,3",
                "NOMAD_CPU_LIMIT=8980",
[sandbox@nomad-dev nomad]$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED          STATUS          PORTS      NAMES
c9049b1b3f2c   redis:3.2   "docker-entrypoint.s…"   16 seconds ago   Up 15 seconds   6379/tcp   redis-start-side-50ef4e44-0e41-b273-7915-bfd0c2fc2ec2
6e06a9ed1631   redis:3.2   "docker-entrypoint.s…"   16 seconds ago   Up 16 seconds   6379/tcp   redis-50ef4e44-0e41-b273-7915-bfd0c2fc2ec2
a52a46cfa489   redis:3.2   "docker-entrypoint.s…"   3 minutes ago    Up 3 minutes    6379/tcp   redis-start-side-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0
ec9220fbe2d0   redis:3.2   "docker-entrypoint.s…"   3 minutes ago    Up 3 minutes    6379/tcp   redis-4d6d1f92-fab2-f2bb-ca79-1f56ad3772c0

[sandbox@nomad-dev nomad]$ grep -H . /sys/fs/cgroup/cpuset/nomad/{reserve,share}/cpuset.effective_cpus
/sys/fs/cgroup/cpuset/nomad/reserve/cpuset.effective_cpus:0-11
/sys/fs/cgroup/cpuset/nomad/share/cpuset.effective_cpus:12-123

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs -I {} bash -c 'grep -H . /sys/fs/cgroup/cpuset/docker/{}*/cpuset.effective_cpus' | column -s: -t | sort -n -k2
/sys/fs/cgroup/cpuset/docker/ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213/cpuset.effective_cpus  0-3
/sys/fs/cgroup/cpuset/docker/6e06a9ed1631758827aa4136690818d04c050c55559fb9f74b780b6ff8d33728/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084/cpuset.effective_cpus  4-7
/sys/fs/cgroup/cpuset/docker/c9049b1b3f2c2bbfebc6ec8e2f3aa280a9ab23b86322452a54575b1cba3ae179/cpuset.effective_cpus  8-11

[sandbox@nomad-dev nomad]$ docker ps --format '{{.ID}}' | xargs docker inspect | egrep '(CpusetCpus|NOMAD_CPU_LIMIT|Id)'
        "Id": "c9049b1b3f2c2bbfebc6ec8e2f3aa280a9ab23b86322452a54575b1cba3ae179",
            "CpusetCpus": "8,9,10,11",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "6e06a9ed1631758827aa4136690818d04c050c55559fb9f74b780b6ff8d33728",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "a52a46cfa489fe815fcbd11019c391d7fe771b878f77ddb3c993ab5cd98d8084",
            "CpusetCpus": "4,5,6,7",
                "NOMAD_CPU_LIMIT=8980",
        "Id": "ec9220fbe2d0edef8bd9f67cabd7da226f32d346f65d196463bc4d6701864213",
            "CpusetCpus": "0,1,2,3",
                "NOMAD_CPU_LIMIT=8980",
Fixes a bug in the BinPackIterator.Next method, where the scheduler would only
take into account the cpusets of the tasks in the largest lifecycle. This could
result in overlapping cgroup cpusets. By using the Allocation.ReservedCores, the
scheduler will use the same cpuset view as Partition.Reserve. Added logging in
case of future regressions thus not requiring manual inspection of cgroup files.

@mvegter mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from 60f051d to e598fdc Compare October 27, 2024 14:19
@mvegter mvegter marked this pull request as ready for review October 27, 2024 14:46
@mvegter mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from e598fdc to a1aecbc Compare October 27, 2024 14:56
@mvegter mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from a1aecbc to 498ff4c Compare October 28, 2024 12:04
@mvegter mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from 498ff4c to fe008c9 Compare October 28, 2024 12:47
@mvegter mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from fe008c9 to bdffd59 Compare November 8, 2024 15:54
@mvegter
Copy link
Contributor Author

mvegter commented Nov 8, 2024

Hey @tgross @jrasell , if you have some time available in the next few business days I would appreciate a review on this PR, curious to your feedback !

@tgross
Copy link
Member

tgross commented Nov 8, 2024

@mvegter just a heads up that we see this but it's going to take into next week to review... scheduler changes are particularly nasty and easy to get wrong, so its extra careful going.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mvegter! Thanks for this PR! I had a chat with some of folks internally and we think you're close in terms of the reasoning behind this, but the implementation leaves a spot for additional bugs (including in quotas).

First, let's look at what the ComparableResources type is supposed to be doing with lifecycles. Suppose I have an allocation with a main task, a post-start sidecar, and a prestart non-sidecar. To determine how much memory that allocation needs, I need to add up the memory usage required by all the tasks that are running at the same time. In this example, I don't need to include memory for the prestart non-sidecar because that memory will be freed by the time the other two tasks start.

memory

But this only works because memory is fungible! We don't care which specific bytes of memory are use, as they have no identity (giant asterix on that of course around NUMA, but not relevant at the moment).

Let's rework the example for cores. Here, we can see that we need to reserve cores 0-1 for the prestart task even though it's not going to be running (partially because Nomad doesn't coordinate between tasks of different allocations, but also because we could potentially restart the task). Cores have their own identities so we need to ensure, as you've noted, that all the cores are reserved regardless of lifecycle.

cores

Where the implementation in this PR falls apart a little bit is that we're creating a new function to produce the set of cores, rather than fixing the existing logic in (AllocatedCpuResources).Add and its caller (AllocatedTaskResources).Add. So what we should do is move the new logic you wrote into those functions, so that it can be used in the plan applier and quota checker. You can see that in (AllocatedTaskResources).Add we have logic for handling resources with unique identities (like networks), so there's good precedent for that there.

@mvegter mvegter force-pushed the mvegter-fix-missing-exclusion-of-cpusets branch from bdffd59 to 2b388b4 Compare November 13, 2024 16:20
…hose part of the largest lifecycle

Fixes a bug in the AllocatedResources.Comparable method, where the scheduler
would only take into account the cpusets of the tasks in the largest lifecycle.
This could result in overlapping cgroup cpusets. Now we make the distinction
between reserved and fungible resources throughout the lifespan of the alloc.
In addition, added logging in case of future regressions thus not requiring
manual inspection of cgroup files.
@mvegter
Copy link
Contributor Author

mvegter commented Nov 13, 2024

Hey @tgross , thank you for the extensive explanation ! I have made the changes within the Comparable() method as it would otherwise require quite some data structure changes to keep track of per-cpu MHz and add/substract/max correctly. Similar to the Devices/Network we always include the AllocatedCpuResources as part of Flattened and only look at cpu/share based values in the Add/Max . This way we correctly reflect the difference between the share and reserve slice for CPU , while not touching the memory logic. Let me know what you think of this , the tests from the initial commit are still passing; and manually testing the reproducer on master still fails but with these changes it's correctly working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

2 participants