Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

driver/docker: Fix container CPU stats collection #24768

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

jrasell
Copy link
Member

@jrasell jrasell commented Jan 2, 2025

Description

The recent change to collection via a "one-shot" Docker API call did not update the stream boolean argument. This results in the PreCPUStats values being zero and therefore breaking the CPU calculations which rely on this data. The base fix is to update the passed boolean parameter to match the desired non-streaming behaviour. The non-streaming API call correctly returns the PreCPUStats data which can be seen in the added unit test and the soak testing details seen below.

The most recent change also modified the behaviour of the collectStats go routine, so that any error encountered results in the routine exiting. In the event this was a transient error, the container will continue to run, however, no stats will be collected until the task is stopped and replaced. This PR reverts the behaviour, so that an error encountered during a stats collection run results in the error being logged but the collection process continuing with a backoff used.

Testing & Reproduction steps

I used this lab to run a 1 server, 1 client cluster; Nomad was running the modified code from this PR. I then ran a Prometheus/Grafana job and the example Redis job with Prometheus scraping the local Nomad client every 1 second.

promana.nomad.hcl
job "promana" {
  group "promana" {
    network {
      mode = "bridge"
      port "prometheus" {
        to = 9090
      }
      port "grafana" {
        to = 3000
      }
    }

    service {
      name     = "prometheus-server"
      port     = "prometheus"
      provider = "nomad"
    }
    service {
      name     = "grafana-server"
      port     = "grafana"
      provider = "nomad"
    }

    task "prometheus" {
      driver = "docker"
      config {
        image = "prom/prometheus:v3.0.1"
        ports = ["prometheus"]
        args  = [
          "--config.file=${NOMAD_TASK_DIR}/config/prometheus.yml",
          "--storage.tsdb.path=/prometheus",
          "--web.listen-address=0.0.0.0:9090",
          "--web.console.libraries=/usr/share/prometheus/console_libraries",
          "--web.console.templates=/usr/share/prometheus/consoles",
        ]

        volumes = [
          "local/config:/etc/prometheus/config",
        ]
      }

      template {
        data = <<EOH
---
global:
  scrape_interval:     1s
  evaluation_interval: 1s

scrape_configs:
  - job_name: "nomad_server"
    metrics_path: "/v1/metrics"
    scheme: "http"
    params:
      format:
        - "prometheus"
    static_configs:
      - targets:
        - {{ env "attr.unique.network.ip-address" }}:4646
EOH
        change_mode   = "signal"
        change_signal = "SIGHUP"
        destination   = "local/config/prometheus.yml"
      }

      resources {
        cpu    = 500
        memory = 512
      }
    }

    task "grafana" {
      driver = "docker"

      config {
        image   = "grafana/grafana:11.4.0"
        volumes = [
          "local/datasources:/etc/grafana/provisioning/datasources",
        ]
      }

      template {
        data = <<EOH
apiVersion: 1
datasources:
- name: Prometheus
  type: prometheus
  access: proxy
  url: http://0.0.0.0:9090
  isDefault: true
  version: 1
  editable: false
EOH

        destination = "local/datasources/datasources.yaml"
      }

      resources {
        cpu    = 200
        memory = 256
      }
    }
  }
}
example.nomad.hcl
job "example" {

  group "cache" {
    network {
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

      identity {
        env  = true
        file = true
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

The cluster and jobs were left to run for 6hrs before taking a look at the available metrics, including previously affected CPU percentage and client go routine count.

screenshots

image
image
image

Links

Closes: #24740
Internal: https://hashicorp.atlassian.net/browse/NET-11922
Historical:

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

The recent change to collection via a "one-shot" Docker API call
did not update the stream boolean argument. This results in the
PreCPUStats values being zero and therefore breaking the CPU
calculations which rely on this data. The base fix is to update
the passed boolean parameter to match the desired non-streaming
behaviour. The non-streaming API call correctly returns the
PreCPUStats data which can be seen in the added unit test.

The most recent change also modified the behaviour of the
collectStats go routine, so that any error encountered results in
the routine exiting. In the event this was a transient error, the
container will continue to run, however, no stats will be collected
until the task is stopped and replaced. This PR reverts the
behaviour, so that an error encountered during a stats collection
run results in the error being logged but the collection process
continuing with a backoff used.
@jrasell jrasell added the backport/1.9.x backport to 1.9.x release line label Jan 2, 2025
@jrasell jrasell marked this pull request as ready for review January 2, 2025 15:02
@jrasell jrasell requested review from a team as code owners January 2, 2025 15:02
h.logger.Debug("error collecting stats from container", "error", err)
return
stats, err := h.collectDockerStats(ctx)
switch err {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to make this error check a switch instead of the normal if err != nil? Are we planning on having custom errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my personal preference when an if/else has more than a couple of lines as I find readability easier and the functionality is identical.

default:
h.logger.Error("error collecting stats from container", "error", err)
ticker.Reset(helper.Backoff(statsCollectorBackoffBaseline, statsCollectorBackoffLimit, retry))
retry++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see any circuit breaker, if the error never stops, we will be logging the error forever but ir won't stop the driver, is this the intended behaviour?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the intended behaviour and was the prior behaviour. If the container itself is running OK, passing checks, and behaving as expected but the stats API is misbehaving, I think the priority of keeping the container up is the correct choice.

In the future, if we wanted to have the driver stop the container based on the stats API failure, we could plumb this through, but it would need a little work.

In practice, I don't expect the Docker stats API to consistency return errors when everything else is working. The backoff for the most part helps handle transient failures which recover immediately after.

Copy link
Contributor

@pkazmierczak pkazmierczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the thorough end-to-end testing on this one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.9.x backport to 1.9.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker driver: ContainerStats with stream=true returns empty PreCPUStats causing incorrect CPU metrics
4 participants