-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metric displaying whether replication is configured or not #604
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: mamiller <[email protected]>
You don't need a new metric for this, you can use any existing ones
returns 1 if slave is not configured or nothing if it is. |
As I mentioned, absent() is not ideal here because we have hundreds of mysql hosts that we need to alert on in the event of replication misconfiguration, and defining an alert for When replication is not configured at all (after a |
All you need is a regex to match all your mysql hosts |
Sorry, no, this doesn't work. in order for absent() to return one on a regex match, ALL hosts that match the regex must be missing the metric. I need to know if any single host is missing the metric. |
I just spend hours trying to create an alert if one of my replica doesn't replicate. The stale metrics render things very difficult. I can confirm what @mmiller1 is explaining to @roman-vynar, I tried group_left, unless, offset, absent_over_time, nothing works. I'm using the alerts from awesome-prometheus and was surprised that no alerts are triggered for replica that has been reset (I have a switchover script that fails sometimes). |
How about this?
This means instances named mysql* will have |
Genius @roman-vynar thanks for your help! So the correct expression for the alert that works if the metric == 0 AND is staled:
db_role : 0 => replica It works perfectly. But you have to wait 5 minutes for the metric to be staled. But that's far better than no alert at all! |
This metric would be useful in the event that after a crash (or just as a result of human error) the replication status is reset. Currently if the replication configuration is removed, all metrics related to replication disappear making it cumbersome to alert on. I considered using the absent() functions in prometheus to detect this, but that would depend on creating alert definitions for each host that you expect the metrics to exist for, which is not very practical.