-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check-mysql-replication-status: Lag flapping protection #92
base: master
Are you sure you want to change the base?
Conversation
@majormoses The |
ya we kinda abandoned that as we had pretty low unit test coverage and I have decided to focus more on integration testing which can't really be measured in the same way. We can drop it and bring it back later if we get enough coverage to see the value in it. |
Thanks for your contribution to Sensu plugins! Without people like you submitting PRs we couldn't run the project. I will review it shortly. |
I will review the code in a minute but I did want to point out that sensu natively has its own flap detection algorithm borrowed from nagios. The Documentation is here look for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work I have a couple questions, comments, and recommendations.
# check-mysql-replication-status_spec | ||
# | ||
# DESCRIPTION: | ||
# rspec tests for check-mysql-replication-status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for writing tests, honestly I keep maintaining all these plugins and its a lot of work, it really helps when there are tests because I won't pretend to know every service in as much detail as all the awesome people (like yourself) that know them. The biggest bang for buck testing IMHO opinion is integration testing. I have written a blog post on writing integration tests for infrastructure: https://blog.sensuapp.org/writing-sensu-plugin-tests-with-test-kitchen-and-serverspec-b646d2eeee51 if you want any ideas or help please feel free to hit me up in slack on tag me in an issue/pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see an issue with my choice of unit test - since the checkscript itself is require
d by the test, the check will be executed once after the rspec tests have been finished and fail with unknown
due to missing MySQL credentials.
I didn't come up with this pattern for a checkscript unit test myself, just borrowed it from another plugin's test. Any suggestions how to solve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be what you are looking for: https://github.com/sensu-plugins/sensu-plugins-rabbitmq/blob/7.0.0/test/spec_helper.rb#L7-L27
Sensu's flapping protection is meant to prevent an unnecessary storm of alerting events in case a check is flapping. This PR does a completely different thing: it works around a condition of MySQL/MariaDB during which a replication slave incorrectly reports very high lag times (in my case it's >10 days!) for a short moment. Of course I want any real lag to trigger a prompt alert, but I want the check to requery Maybe "flapping protection" isn't the best name; maybe you have a different suggestion? |
787474c
to
b8db04e
Compare
@majormoses I'd like to go on with this one. With your experience in the plugin ecosystem, can you guide me to a different way of testing without including the check script? And maybe you have a better proposal for naming the feature because "flapping" is already a standard term in monitoring solutions? As I wrote earlier, we're dealing with an bug/side effect of MySQL/MariaDB replication which rarely report outliers in the replication lag. |
long: '--flapping-sleep=VALUE', | ||
description: 'Sleep between flapping protection retries', | ||
default: 1, | ||
proc: lambda { |s| s.to_i } # rubocop:disable Lambda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to use a lambda over a proc that I am not seeing? This would remove the need to disable the cop.
proc: lambda { |s| s.to_i } # rubocop:disable Lambda | |
proc: proc { |s| s.to_i } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest rework
I am having a hard time coming up with a good name as its pretty much trying to work around a reporting bug. Maybe something like |
Properly detect when server is not a slave
b8db04e
to
98c20a9
Compare
@majormoses I've picked up work on this PR again, and I renamed the feature to Tests are now running properly. I removed the CodeClimate from |
d46418e
to
bd31e24
Compare
…execution of check
bd31e24
to
67c3823
Compare
Pull Request Checklist
General
Update Changelog following the conventions laid out here
Update README with any necessary configuration snippets
RuboCop passes
Existing tests pass
First test written 😄
Purpose
MariaDB/MySQL sometimes wrongly reports a very high replication lag for a short moment. Flapping protection helps mitigating this issue better than setting
occurrences
in sensu'schecks
definition because you don't lose any alerting granularity.I had to do major refactoring on
check-mysql-replication-status
to allow testing.While doing so, I discovered two minor flaws. I've fixed one of them in dedicated commit upfront so that they could be cherry-picked independently if needed.
The other flaw affects the warning/critical thresholds which are exclusive, contrary to the common practice in other checks. Fixing this would be a breaking change, though, so I left a comment in code for a future release.
Known Compatibility Issues