Skip to content

Commit

Permalink
check-mysql-replication-status: add lag flapping protection
Browse files Browse the repository at this point in the history
  • Loading branch information
Jan Kunzmann committed Dec 21, 2018
1 parent 3883967 commit b8db04e
Show file tree
Hide file tree
Showing 4 changed files with 99 additions and 17 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ This CHANGELOG follows the format listed [here](https://github.com/sensu-plugins
### Changed
- check-mysql-replication-status: refactoring & spec tests (@DrMurx)

### Added
- check-mysql-replication-status: added protection against `SHOW SLAVE STATUS` high lag reporting bug (@DrMurx)

## [3.1.0] - 2018-12-15
### Added
- metrics-mysql-multiple-select-countcript (@nagyt234)
Expand Down
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,15 @@ $ /opt/sensu/embedded/bin/check-mysql-threads.rb --host=<DBHOST> --ini=/etc/sens
$ /opt/sensu/embedded/bin/check-mysql-replication-status.rb --host=<SLAVE> --ini=/etc/sensu/my.ini
```

**check-mysql-replication-status** example with flapping protection

MariaDB/MySQL sometimes wrongly reports a very high replication lag for a short moment. Flapping protection helps mitigating this issue
better than setting `occurrences` in sensu's `checks` definition because you don't lose any alerting granularity.

```bash
$ /opt/sensu/embedded/bin/check-mysql-replication-status.rb --host=<SLAVE> --ini=/etc/sensu/my.ini --flapping-retry=1 --flapping-lag=86400 --flapping-sleep=2
```

**check-mysql-msr-replication-status** example
```bash
$ /opt/sensu/embedded/bin/check-mysql-replication-status.rb --host=<SLAVE> --ini=/etc/sensu/my.ini
Expand Down
57 changes: 44 additions & 13 deletions bin/check-mysql-replication-status.rb
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,27 @@ class CheckMysqlReplicationStatus < Sensu::Plugin::Check::CLI
# #YELLOW
proc: lambda { |s| s.to_i } # rubocop:disable Lambda

option :flapping_lag,
short: '-l',
long: '--flapping-lag=VALUE',
description: 'Lag threshold to trigger flapping protection',
default: 100000,
proc: lambda { |s| s.to_i } # rubocop:disable Lambda

option :flapping_retry,
short: '-r',
long: '--flapping-retry=VALUE',
description: 'Number of retries when lag flapping protection is triggered',
default: 0,
proc: lambda { |s| s.to_i } # rubocop:disable Lambda

option :flapping_sleep,
long: '--flapping-sleep=VALUE',
description: 'Sleep between flapping protection retries',
default: 1,
proc: lambda { |s| s.to_i } # rubocop:disable Lambda


def detect_replication_status?(row)
%w[
Slave_IO_State
Expand Down Expand Up @@ -176,19 +197,29 @@ def ok_slave_message
def run
db = open_connection

row = query_slave_status(db)
ok 'show slave status was nil. This server is not a slave.' if row.nil?
warn "couldn't detect replication status" unless detect_replication_status?(row)

slave_running = slave_running?(row)
critical broken_slave_message(row) unless slave_running

replication_delay = row['Seconds_Behind_Master'].to_i
message = "replication delayed by #{replication_delay}"
# TODO (breaking change): Thresholds are exclusive which is not consistent with all other checks
critical message if replication_delay > config[:crit]
warning message if replication_delay > config[:warn]
ok "#{ok_slave_message}, #{message}"
retries = config[:flapping_retry]
while retries >= 0
row = query_slave_status(db)
ok 'show slave status was nil. This server is not a slave.' if row.nil?
warn "couldn't detect replication status" unless detect_replication_status?(row)

slave_running = slave_running?(row)
critical broken_slave_message(row) unless slave_running

replication_delay = row['Seconds_Behind_Master'].to_i
retries -= 1
if replication_delay >= config[:flapping_lag] && retries >= 0
sleep config[:flapping_sleep]
next
end

message = "replication delayed by #{replication_delay}"
# TODO (breaking change): Thresholds are exclusive which is not consistent with all other checks
critical message if replication_delay > config[:crit]
warning message if replication_delay > config[:warn]
ok "#{ok_slave_message}, #{message}"
end
unknown "unable to retrieve slave status"
rescue Mysql::Error => e
errstr = "Error code: #{e.errno} Error message: #{e.error}"
critical "#{errstr} SQLSTATE: #{e.sqlstate}" if e.respond_to?('sqlstate')
Expand Down
47 changes: 43 additions & 4 deletions test/check-mysql-replication-status_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,10 @@ def checker.critical(*_args)
['No', 'Yes', nil, 2, 'critical'],
['Yes', 'No', nil, 2, 'critical'],
['No', 'No', nil, 2, 'critical'],
['Yes', 'Yes', 899, 0, 'ok'],
['Yes', 'Yes', 900, 1, 'warning'],
['Yes', 'Yes', 1799, 1, 'warning'],
['Yes', 'Yes', 1800, 2, 'critical'],
['Yes', 'Yes', 900, 0, 'ok'],
['Yes', 'Yes', 901, 1, 'warning'],
['Yes', 'Yes', 1800, 1, 'warning'],
['Yes', 'Yes', 1801, 2, 'critical'],
].each do |testdata|
it "returns #{testdata[4]} for default thresholds" do
slave_status_row = {
Expand All @@ -76,4 +76,43 @@ def checker.critical(*_args)
expect(exit_code).to eq testdata[3]
end
end

[
[ 0, 0, 'ok'],
[99999, 2, 'critical'],
].each do |testdata|
it "sleeps with flapping protection and returns #{testdata[2]} for default thresholds" do
checker.config[:flapping_retry] = 1
checker.config[:flapping_sleep] = 10

slave_status_row = [
{
"Slave_IO_State" => '',
"Slave_IO_Running" => 'Yes',
"Slave_SQL_Running" => 'Yes',
"Last_IO_Error" => '',
"Last_SQL_Error" => '',
"Seconds_Behind_Master" => 100000
},
{
"Slave_IO_State" => '',
"Slave_IO_Running" => 'Yes',
"Slave_SQL_Running" => 'Yes',
"Last_IO_Error" => '',
"Last_SQL_Error" => '',
"Seconds_Behind_Master" => testdata[0]
}
]

begin
allow(checker).to receive(:open_connection) # do nothing
allow(checker).to receive(:query_slave_status).and_return slave_status_row[0], slave_status_row[1]
expect(checker).to receive(:sleep).with(10)
checker.run
rescue SystemExit => e
exit_code = e.status
end
expect(exit_code).to eq testdata[1]
end
end
end

0 comments on commit b8db04e

Please sign in to comment.