-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
driver/docker: Fix container CPU stats collection #24768
base: main
Are you sure you want to change the base?
Conversation
The recent change to collection via a "one-shot" Docker API call did not update the stream boolean argument. This results in the PreCPUStats values being zero and therefore breaking the CPU calculations which rely on this data. The base fix is to update the passed boolean parameter to match the desired non-streaming behaviour. The non-streaming API call correctly returns the PreCPUStats data which can be seen in the added unit test. The most recent change also modified the behaviour of the collectStats go routine, so that any error encountered results in the routine exiting. In the event this was a transient error, the container will continue to run, however, no stats will be collected until the task is stopped and replaced. This PR reverts the behaviour, so that an error encountered during a stats collection run results in the error being logged but the collection process continuing with a backoff used.
h.logger.Debug("error collecting stats from container", "error", err) | ||
return | ||
stats, err := h.collectDockerStats(ctx) | ||
switch err { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to make this error check a switch instead of the normal if err != nil
? Are we planning on having custom errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's my personal preference when an if/else has more than a couple of lines as I find readability easier and the functionality is identical.
default: | ||
h.logger.Error("error collecting stats from container", "error", err) | ||
ticker.Reset(helper.Backoff(statsCollectorBackoffBaseline, statsCollectorBackoffLimit, retry)) | ||
retry++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see any circuit breaker, if the error never stops, we will be logging the error forever but ir won't stop the driver, is this the intended behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is the intended behaviour and was the prior behaviour. If the container itself is running OK, passing checks, and behaving as expected but the stats API is misbehaving, I think the priority of keeping the container up is the correct choice.
In the future, if we wanted to have the driver stop the container based on the stats API failure, we could plumb this through, but it would need a little work.
In practice, I don't expect the Docker stats API to consistency return errors when everything else is working. The backoff for the most part helps handle transient failures which recover immediately after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the thorough end-to-end testing on this one!
Description
The recent change to collection via a "one-shot" Docker API call did not update the stream boolean argument. This results in the
PreCPUStats
values being zero and therefore breaking the CPU calculations which rely on this data. The base fix is to update the passed boolean parameter to match the desired non-streaming behaviour. The non-streaming API call correctly returns thePreCPUStats
data which can be seen in the added unit test and the soak testing details seen below.The most recent change also modified the behaviour of the
collectStats
go routine, so that any error encountered results in the routine exiting. In the event this was a transient error, the container will continue to run, however, no stats will be collected until the task is stopped and replaced. This PR reverts the behaviour, so that an error encountered during a stats collection run results in the error being logged but the collection process continuing with a backoff used.Testing & Reproduction steps
I used this lab to run a 1 server, 1 client cluster; Nomad was running the modified code from this PR. I then ran a Prometheus/Grafana job and the example Redis job with Prometheus scraping the local Nomad client every 1 second.
promana.nomad.hcl
example.nomad.hcl
The cluster and jobs were left to run for 6hrs before taking a look at the available metrics, including previously affected CPU percentage and client go routine count.
screenshots
Links
Closes: #24740
Internal: https://hashicorp.atlassian.net/browse/NET-11922
Historical:
Contributor Checklist
changelog entry using the
make cl
command.ensure regressions will be caught.
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.
Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.