add metric displaying whether replication is configured or not #604

mmiller1 · 2021-12-03T16:39:28Z

This metric would be useful in the event that after a crash (or just as a result of human error) the replication status is reset. Currently if the replication configuration is removed, all metrics related to replication disappear making it cumbersome to alert on. I considered using the absent() functions in prometheus to detect this, but that would depend on creating alert definitions for each host that you expect the metrics to exist for, which is not very practical.

Signed-off-by: mamiller <[email protected]>

roman-vynar · 2021-12-08T20:32:17Z

You don't need a new metric for this, you can use any existing ones mysql_slave_status_*.
For example

absent(mysql_slave_status_connect_retry{instance="foobar"})

returns 1 if slave is not configured or nothing if it is.

mmiller1 · 2021-12-08T21:00:33Z

As I mentioned, absent() is not ideal here because we have hundreds of mysql hosts that we need to alert on in the event of replication misconfiguration, and defining an alert for instance="foo001" through instance="foo999" is not practical. With this metric all we need to do is define a single alert: mysql_slave_status_is_configured == 0.

When replication is not configured at all (after a reset slave command, for example) none of the mysql_slave_status_* metrics exist prior to this PR.

roman-vynar · 2021-12-09T09:08:04Z

All you need is a regex to match all your mysql hosts {instance=~"foo\d{3}"} or some any other label you may want to tag your mysql instances to be captured.

mmiller1 · 2021-12-09T13:58:02Z

Sorry, no, this doesn't work. in order for absent() to return one on a regex match, ALL hosts that match the regex must be missing the metric. I need to know if any single host is missing the metric.

laurent-indermuehle · 2024-05-23T14:49:00Z

I just spend hours trying to create an alert if one of my replica doesn't replicate. The stale metrics render things very difficult. I can confirm what @mmiller1 is explaining to @roman-vynar, I tried group_left, unless, offset, absent_over_time, nothing works.
Isn't stale metrics a bad practice for Prometheus anyway?

I'm using the alerts from awesome-prometheus and was surprised that no alerts are triggered for replica that has been reset (I have a switchover script that fails sometimes).

roman-vynar · 2024-05-23T18:43:24Z

How about this?

expr: count(node_uname_info{instance=~"mysql.+"}) by (instance) unless count(mysql_slave_status_connect_retry) by (instance)

This means instances named mysql* will have node_uname_info metric but not mysql_slave_status_connect_retry and then to alert.

laurent-indermuehle · 2024-05-24T09:06:16Z

Genius @roman-vynar thanks for your help!
My luck is that I created a metric called db_role with a textfile collector that read every minute a custom facts that is written by Ansible. This way, even after a switchover, db_role is always right.

So the correct expression for the alert that works if the metric == 0 AND is staled:

expr: count(db_role == 0) by (instance) unless count(mysql_slave_status_slave_io_running == 1) by (instance) == 1

db_role : 0 => replica
db_role : 1 => primary

It works perfectly. But you have to wait 5 minutes for the metric to be staled. But that's far better than no alert at all!

add metrics displaying whether replication is configured or not

34ab91f

Signed-off-by: mamiller <[email protected]>

mmiller1 force-pushed the main branch from 5fea338 to 34ab91f Compare December 3, 2021 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metric displaying whether replication is configured or not #604

add metric displaying whether replication is configured or not #604

mmiller1 commented Dec 3, 2021

roman-vynar commented Dec 8, 2021

mmiller1 commented Dec 8, 2021 •

edited

Loading

roman-vynar commented Dec 9, 2021

mmiller1 commented Dec 9, 2021

laurent-indermuehle commented May 23, 2024

roman-vynar commented May 23, 2024 •

edited

Loading

laurent-indermuehle commented May 24, 2024

add metric displaying whether replication is configured or not #604

Are you sure you want to change the base?

add metric displaying whether replication is configured or not #604

Conversation

mmiller1 commented Dec 3, 2021

roman-vynar commented Dec 8, 2021

mmiller1 commented Dec 8, 2021 • edited Loading

roman-vynar commented Dec 9, 2021

mmiller1 commented Dec 9, 2021

laurent-indermuehle commented May 23, 2024

roman-vynar commented May 23, 2024 • edited Loading

laurent-indermuehle commented May 24, 2024

mmiller1 commented Dec 8, 2021 •

edited

Loading

roman-vynar commented May 23, 2024 •

edited

Loading