driver/docker: Fix container CPU stats collection #24768

jrasell · 2025-01-02T10:17:36Z

Description

The recent change to collection via a "one-shot" Docker API call did not update the stream boolean argument. This results in the PreCPUStats values being zero and therefore breaking the CPU calculations which rely on this data. The base fix is to update the passed boolean parameter to match the desired non-streaming behaviour. The non-streaming API call correctly returns the PreCPUStats data which can be seen in the added unit test and the soak testing details seen below.

The most recent change also modified the behaviour of the collectStats go routine, so that any error encountered results in the routine exiting. In the event this was a transient error, the container will continue to run, however, no stats will be collected until the task is stopped and replaced. This PR reverts the behaviour, so that an error encountered during a stats collection run results in the error being logged but the collection process continuing with a backoff used.

Testing & Reproduction steps

I used this lab to run a 1 server, 1 client cluster; Nomad was running the modified code from this PR. I then ran a Prometheus/Grafana job and the example Redis job with Prometheus scraping the local Nomad client every 1 second.

promana.nomad.hcl

job "promana" {
  group "promana" {
    network {
      mode = "bridge"
      port "prometheus" {
        to = 9090
      }
      port "grafana" {
        to = 3000
      }
    }

    service {
      name     = "prometheus-server"
      port     = "prometheus"
      provider = "nomad"
    }
    service {
      name     = "grafana-server"
      port     = "grafana"
      provider = "nomad"
    }

    task "prometheus" {
      driver = "docker"
      config {
        image = "prom/prometheus:v3.0.1"
        ports = ["prometheus"]
        args  = [
          "--config.file=${NOMAD_TASK_DIR}/config/prometheus.yml",
          "--storage.tsdb.path=/prometheus",
          "--web.listen-address=0.0.0.0:9090",
          "--web.console.libraries=/usr/share/prometheus/console_libraries",
          "--web.console.templates=/usr/share/prometheus/consoles",
        ]

        volumes = [
          "local/config:/etc/prometheus/config",
        ]
      }

      template {
        data = <<EOH
---
global:
  scrape_interval:     1s
  evaluation_interval: 1s

scrape_configs:
  - job_name: "nomad_server"
    metrics_path: "/v1/metrics"
    scheme: "http"
    params:
      format:
        - "prometheus"
    static_configs:
      - targets:
        - {{ env "attr.unique.network.ip-address" }}:4646
EOH
        change_mode   = "signal"
        change_signal = "SIGHUP"
        destination   = "local/config/prometheus.yml"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }

    task "grafana" {
      driver = "docker"

      config {
        image   = "grafana/grafana:11.4.0"
        volumes = [
          "local/datasources:/etc/grafana/provisioning/datasources",
        ]
      }

      template {
        data = <<EOH
apiVersion: 1
datasources:
- name: Prometheus
  type: prometheus
  access: proxy
  url: http://0.0.0.0:9090
  isDefault: true
  version: 1
  editable: false
EOH

        destination = "local/datasources/datasources.yaml"
      }

      resources {
        cpu    = 200
        memory = 256
      }
    }
  }
}

example.nomad.hcl

job "example" {

  group "cache" {
    network {
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

      identity {
        env  = true
        file = true
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

The cluster and jobs were left to run for 6hrs before taking a look at the available metrics, including previously affected CPU percentage and client go routine count.

screenshots

Links

Closes: #24740
Internal: https://hashicorp.atlassian.net/browse/NET-11922
Historical:

Docker Client migration: docker: use official client instead of fsouza/go-dockerclient #23966
Collection via streaming: docker: use streaming stats collection to correct CPU stats #24229
Collection via one-shot: [gh-24339] Move from streaming stats to polling for docker #24525

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

The recent change to collection via a "one-shot" Docker API call did not update the stream boolean argument. This results in the PreCPUStats values being zero and therefore breaking the CPU calculations which rely on this data. The base fix is to update the passed boolean parameter to match the desired non-streaming behaviour. The non-streaming API call correctly returns the PreCPUStats data which can be seen in the added unit test. The most recent change also modified the behaviour of the collectStats go routine, so that any error encountered results in the routine exiting. In the event this was a transient error, the container will continue to run, however, no stats will be collected until the task is stopped and replaced. This PR reverts the behaviour, so that an error encountered during a stats collection run results in the error being logged but the collection process continuing with a backoff used.

Juanadelacuesta · 2025-01-02T15:07:55Z

drivers/docker/stats.go

-				h.logger.Debug("error collecting stats from container", "error", err)
-				return
+			stats, err := h.collectDockerStats(ctx)
+			switch err {


Is there a reason to make this error check a switch instead of the normal if err != nil? Are we planning on having custom errors?

It's my personal preference when an if/else has more than a couple of lines as I find readability easier and the functionality is identical.

Juanadelacuesta · 2025-01-02T15:21:52Z

drivers/docker/stats.go

+			default:
+				h.logger.Error("error collecting stats from container", "error", err)
+				ticker.Reset(helper.Backoff(statsCollectorBackoffBaseline, statsCollectorBackoffLimit, retry))
+				retry++


I dont see any circuit breaker, if the error never stops, we will be logging the error forever but ir won't stop the driver, is this the intended behaviour?

Yes this is the intended behaviour and was the prior behaviour. If the container itself is running OK, passing checks, and behaving as expected but the stats API is misbehaving, I think the priority of keeping the container up is the correct choice.

In the future, if we wanted to have the driver stop the container based on the stats API failure, we could plumb this through, but it would need a little work.

In practice, I don't expect the Docker stats API to consistency return errors when everything else is working. The backoff for the most part helps handle transient failures which recover immediately after.

pkazmierczak

LGTM!

tgross

LGTM! Thanks for the thorough end-to-end testing on this one!

jrasell self-assigned this Jan 2, 2025

jrasell force-pushed the b-NET-11922 branch from 4c2c2a3 to b2450ae Compare January 2, 2025 10:39

vercel bot deployed to Preview – nomad-ui January 2, 2025 10:40 View deployment

jrasell force-pushed the b-NET-11922 branch from b2450ae to c666199 Compare January 2, 2025 10:51

vercel bot deployed to Preview – nomad-ui January 2, 2025 10:53 View deployment

jrasell force-pushed the b-NET-11922 branch from c666199 to 5e35fd1 Compare January 2, 2025 11:02

vercel bot deployed to Preview – nomad-ui January 2, 2025 11:03 View deployment

jrasell force-pushed the b-NET-11922 branch from 4390efa to 05121f0 Compare January 2, 2025 11:18

vercel bot deployed to Preview – nomad-ui January 2, 2025 11:20 View deployment

jrasell force-pushed the b-NET-11922 branch from 05121f0 to fc63034 Compare January 2, 2025 14:40

vercel bot deployed to Preview – nomad-ui January 2, 2025 14:41 View deployment

changelog: add entry for #24768

51b699b

jrasell added the backport/1.9.x backport to 1.9.x release line label Jan 2, 2025

vercel bot deployed to Preview – nomad-ui January 2, 2025 14:43 View deployment

jrasell requested review from shoenig, pkazmierczak, tgross and Juanadelacuesta January 2, 2025 15:02

jrasell marked this pull request as ready for review January 2, 2025 15:02

jrasell requested review from a team as code owners January 2, 2025 15:02

Juanadelacuesta reviewed Jan 2, 2025

View reviewed changes

pkazmierczak approved these changes Jan 3, 2025

View reviewed changes

tgross approved these changes Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

driver/docker: Fix container CPU stats collection #24768

driver/docker: Fix container CPU stats collection #24768

jrasell commented Jan 2, 2025 •

edited

Loading

Juanadelacuesta Jan 2, 2025

jrasell Jan 2, 2025

Juanadelacuesta Jan 2, 2025

jrasell Jan 2, 2025

pkazmierczak left a comment

tgross left a comment

driver/docker: Fix container CPU stats collection #24768

Are you sure you want to change the base?

driver/docker: Fix container CPU stats collection #24768

Conversation

jrasell commented Jan 2, 2025 • edited Loading

Description

Testing & Reproduction steps

Links

Contributor Checklist

Reviewer Checklist

Juanadelacuesta Jan 2, 2025

Choose a reason for hiding this comment

jrasell Jan 2, 2025

Choose a reason for hiding this comment

Juanadelacuesta Jan 2, 2025

Choose a reason for hiding this comment

jrasell Jan 2, 2025

Choose a reason for hiding this comment

pkazmierczak left a comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

jrasell commented Jan 2, 2025 •

edited

Loading