Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus.remote_write: mark component unhealthy if sending samples fails #823

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

captncraig
Copy link
Contributor

It accomplishes this by observing the log entries from the remote storage writer. If we see "non-recoverable error" messages, we assume there is some problem (usually with a bad token or some kind of networking or configuration issue). It definitely means no samples are getting through.

Detecting a recovery is a bit harder. There is no clear log message from the prometheus code (even at debug level) to indicate things have resumed. It is possible if we also hooked into the sample append hooks we could find a combination of metrics that would indicate recovery, but for now I am just assuming if we don't see an error log for 2 minutes (fairly arbitrary, may need tuning) that it is recovered. Even flapping health status is better than false positives all the time like we have now.

Copy link
Contributor

This PR has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If you do not have enough time to follow up on this PR or you think it's no longer relevant, consider closing it.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your PR will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant