Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metric displaying whether replication is configured or not #604

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mmiller1
Copy link
Contributor

@mmiller1 mmiller1 commented Dec 3, 2021

This metric would be useful in the event that after a crash (or just as a result of human error) the replication status is reset. Currently if the replication configuration is removed, all metrics related to replication disappear making it cumbersome to alert on. I considered using the absent() functions in prometheus to detect this, but that would depend on creating alert definitions for each host that you expect the metrics to exist for, which is not very practical.

@roman-vynar
Copy link
Contributor

You don't need a new metric for this, you can use any existing ones mysql_slave_status_*.
For example

absent(mysql_slave_status_connect_retry{instance="foobar"})

returns 1 if slave is not configured or nothing if it is.

@mmiller1
Copy link
Contributor Author

mmiller1 commented Dec 8, 2021

As I mentioned, absent() is not ideal here because we have hundreds of mysql hosts that we need to alert on in the event of replication misconfiguration, and defining an alert for instance="foo001" through instance="foo999" is not practical. With this metric all we need to do is define a single alert: mysql_slave_status_is_configured == 0.

When replication is not configured at all (after a reset slave command, for example) none of the mysql_slave_status_* metrics exist prior to this PR.

@roman-vynar
Copy link
Contributor

All you need is a regex to match all your mysql hosts {instance=~"foo\d{3}"} or some any other label you may want to tag your mysql instances to be captured.

@mmiller1
Copy link
Contributor Author

mmiller1 commented Dec 9, 2021

Sorry, no, this doesn't work. in order for absent() to return one on a regex match, ALL hosts that match the regex must be missing the metric. I need to know if any single host is missing the metric.

@laurent-indermuehle
Copy link

I just spend hours trying to create an alert if one of my replica doesn't replicate. The stale metrics render things very difficult. I can confirm what @mmiller1 is explaining to @roman-vynar, I tried group_left, unless, offset, absent_over_time, nothing works.
Isn't stale metrics a bad practice for Prometheus anyway?

I'm using the alerts from awesome-prometheus and was surprised that no alerts are triggered for replica that has been reset (I have a switchover script that fails sometimes).

@roman-vynar
Copy link
Contributor

roman-vynar commented May 23, 2024

How about this?

expr: count(node_uname_info{instance=~"mysql.+"}) by (instance) unless count(mysql_slave_status_connect_retry) by (instance)

This means instances named mysql* will have node_uname_info metric but not mysql_slave_status_connect_retry and then to alert.

@laurent-indermuehle
Copy link

Genius @roman-vynar thanks for your help!
My luck is that I created a metric called db_role with a textfile collector that read every minute a custom facts that is written by Ansible. This way, even after a switchover, db_role is always right.

So the correct expression for the alert that works if the metric == 0 AND is staled:

expr: count(db_role == 0) by (instance) unless count(mysql_slave_status_slave_io_running == 1) by (instance) == 1

db_role : 0 => replica
db_role : 1 => primary

It works perfectly. But you have to wait 5 minutes for the metric to be staled. But that's far better than no alert at all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants