Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics gap due to "Treat failure to collect metric as fatal" #101

Open
pracucci opened this issue Jun 30, 2020 · 4 comments
Open

Metrics gap due to "Treat failure to collect metric as fatal" #101

pracucci opened this issue Jun 30, 2020 · 4 comments

Comments

@pracucci
Copy link
Contributor

We're experiencing small gaps in the metrics exported by the stackdriver exporter and, after some investigation, ended up to be caused by the stackdriver exporter exiting each time an API request timeout:

level=error ts=2020-06-30T12:54:18.862Z caller=monitoring_collector.go:194 msg="Error while getting Google Stackdriver Monitoring metrics" err="Get \"REDACTED\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

This behavioural change has been introduced in #83 and I'm wondering how we could improve it. Few ideas/options:

  1. Revert PR Treat failure to collect metric as fatal #83
  2. Exit with fatal error only for specific errors (in this case I would need to better understand the specific error returned by the Recover from expired/replaced credentials #66 use case)
  3. Add a CLI flag to change the "exit on metric collection error" behaviour, to be able to enable/disable it

Any other idea to fix it?

Example of gap

Screen Shot 2020-06-30 at 17 36 52

@SuperQ
Copy link
Contributor

SuperQ commented Jul 1, 2020

Yes, I had some conversation with @gouthamve about this. We should probably extend the collector to have a stackdriver_up metric. I'm thinking the best option for now is to simply revert the Fatal.

@SuperQ
Copy link
Contributor

SuperQ commented Jul 3, 2020

Ping @cpick. It turns out the error handling change introduces too many problems. I think you would be better off not crashing the exporter and fixing your Vault automation to restart on credential rotation.

@SuperQ
Copy link
Contributor

SuperQ commented Jul 3, 2020

I figured out a simple way to reproduce the issue.

It looks like the error is: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden.

This seems to come from google.golang.org/api/googleapi/googleapi.go.

@cpick
Copy link

cpick commented Jul 9, 2020

It makes sense to me to specifically identify authentication issues and only exit when one is detected.

@pracucci:

We're experiencing small gaps in the metrics exported by the stackdriver exporter and, after some investigation, ended up to be caused by the stackdriver exporter exiting each time an API request timeout:

Just out of curiosity, if the underlying request to the Stackdriver API times out then that will still result in a gap in metrics (whether or not stackdriver-exporter restarts), right?

@SuperQ:

We should probably extend the collector to have a stackdriver_up metric. I'm thinking the best option for now is to simply revert the Fatal.

Being able to continue to publish this proposed stackdriver_up metric (and alert on it) sounds like a good reason to not exit on all failures. (My original patch in #83 did not try to distinguish between different error types in the hope that it might also fix other transient error-types. We also used any resulting alerts about stackdriver-exporter restarting frequently as a sign that the Stackdriver API may not be healthy, but can certainly understand how others may not want that.)

@SuperQ:

It looks like the error is: googleapi: Error 403: Permission monitoring.metricDescriptors.list denied (or the resource may not exist)., forbidden.

Exactly, that's one example and here's another:

FATA[0003] Error while getting Google Stackdriver Monitoring metrics: Get https://monitoring.googleapis.com/v3/projects/example/metricDescriptors?alt=json&filter=metric.type+%3D+starts_with%28%22pubsub.googleapis.com%2Fsubscription%2Fnum_undelivered_messages%22%29: oauth2: cannot fetch token: 400 Bad Request
Response: {"error":"invalid_grant","error_description":"Invalid JWT Signature."} source="monitoring_collector.go:138"

If we exit on those two cases I think that'd be a great middle ground.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants