-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip invalid lines #137
base: master
Are you sure you want to change the base?
Strip invalid lines #137
Conversation
Hmm the change makes sense but in some of the cases I am wondering if we should do some extra manipulation for those cases so that we can try to include those. For example:
It seems like we can in that case probably just drop the Maybe in these cases:
We can probably walk the additionally nested key values and return that. I am certainly OK with having some fallback to discard but I would like to handle and return as much data as possible. WDYT? |
@boutetnico checking back and seeing what your thoughts were on my feedback. |
@majormoses I agree and will try to implement your suggestions. |
@majormoses I have made some progress, here is a diff of the keys from the ouput compared before and after this PR. I have not included keys that contain a path because it could lead to very long metrics keys. What do you think? 1d0
< elk01.elasticsearch.os.load_average
91d89
< elk01.elasticsearch.indices.segments.file_sizes
222d219
< elk01.elasticsearch.fs.least_usage_estimate.path
226d222
< elk01.elasticsearch.fs.most_usage_estimate.path
230,231c226,235
< elk01.elasticsearch.fs.io_stats.devices
< elk01.elasticsearch.fs.io_stats.total
---
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.operations
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.read_operations
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.write_operations
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.read_kilobytes
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.write_kilobytes
> elk01.elasticsearch.fs.io_stats.total.operations
> elk01.elasticsearch.fs.io_stats.total.read_operations
> elk01.elasticsearch.fs.io_stats.total.write_operations
> elk01.elasticsearch.fs.io_stats.total.read_kilobytes
> elk01.elasticsearch.fs.io_stats.total.write_kilobytes |
bin/metrics-es-node-graphite.rb
Outdated
@@ -264,7 +264,9 @@ def run | |||
node['indices'].each do |type, index| | |||
index.each do |k, v| | |||
# #YELLOW | |||
unless k =~ /(_time$)/ || v =~ /\d+/ | |||
if k.end_with? 'is_throttled' | |||
metrics["indices.#{type}.#{k}"] = v ? 1 : 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im somewhat confused by this line v
is not really a condition unless since its a truthy value. I actually thought it would be a string and therefore not work. Am I understanding that correctly? If that's the case we probably want to either create a helper method to convert string representation to a boolean or comment the above so that I can recall a year later why this works.
Here is a really simple example of doing this with a helper function
# define our function
irb(main):004:0> def true?(obj)
irb(main):005:1> obj.to_s == 'true'
irb(main):006:1> end
# leverage our function
irb(main):008:0> true?('true')
=> true
irb(main):009:0> true?('false')
=> false
irb(main):010:0> true?('foo')
=> false
One thing you will notice in this approach is that it will treat all non true values as false
which could be a good or bad thing. The good thing is that this means that if a non string was passed to this it would not break but would be returning false data. The bad thing is that it would be returning false data which could be argued is as bad. We could also flip that around from a true to a false just as easily which while it would return false data it would be returning that it is being throttled which may prompt engineers to look at it and report a bug.
If we wanted to be even safer we could define it like so:
irb(main):011:0> def true?(obj)
irb(main):012:1> if obj.to_s == "true"
irb(main):013:2> true
irb(main):014:2> elsif obj.to_s == "false"
irb(main):015:2> false
irb(main):016:2> else
irb(main):017:2* "#{obj.to_s} is not a truthy value, please open an issue with this output so we can fix it"
irb(main):018:2> end
irb(main):019:1> end
=> :true?
To leverage it we could do something like this:
irb(main):020:0> true?('true')
=> true
irb(main):021:0> true?(true)
=> true
irb(main):022:0> true?('false')
=> false
irb(main):023:0> true?('foo')
=> "foo is not a truthy value, please open an issue with this output so we can fix it"
Here is an example of how we can then leverage the building blocks and decide what to do with it:
irb(main):025:0> if true?('foo').class == String
irb(main):026:1> raise(true?('foo'))
irb(main):027:1> else
irb(main):028:1* true?('true')
irb(main):029:1> end
RuntimeError: foo is not a truthy value, please open an issue with this output so we can fix it
from (irb):26
from /home/babrams/.rbenv/versions/2.4.1/bin/irb:11:in `<main>'
I need to think about it more and am open to feedback on what's the "best" thing to do in the face of such an issue. I think we have a couple options:
- not attempt to handle at all and bail with an ugly stack trace
- catch the error and return incorrect data if so do we return true or false?
- catch the error and abort with a meaningful message
- catch the error and write to STDERR rather than STDOUT, not sure if handlers take both stdout and and stderr input for metric input, I have honestly spent more time writing state checks than metric gathering scripts. I will either hunt through the code on this or reach out and get back to you on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. I have followed your suggestions, can you please have a look at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do I am also gonna see if someone else can review this particular portion and give me some feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents:
not attempt to handle at all and bail with an ugly stack trace
Ugh, I've seen this approach taken, strongly prefer to not doing this as if we know of a failure mode.
catch the error and return incorrect data if so do we return true or false?
My gut says it's better to not return any data versus returning incorrect data. I'd rather someone debug why data isn't there versus using data that's not correct to make a design.
catch the error and abort with a meaningful message
❤️
catch the error and write to STDERR rather than STDOUT...
Hmmm, this is an interesting idea https://github.com/sensu/sensu/blob/master/lib/sensu/client/process.rb#L167 and seem to imply that STDERR would be captured, https://github.com/sensu/sensu/blob/master/lib/sensu/client/process.rb#L198-L200 so that's not crazy in my opinion. If we do do that, I'd like to have someone quickly test what STDERR ends up looking like in the sensu event just to make sure nothing weird happens there (since I've never used that approach).
Overall, I agree the testing for the truthiness of the value and returning it and bailing cleanly if we don't get one is probably the best move here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am leaning towards the printing to STDERR
unless someone has a better idea.
Co-Authored-By: boutetnico <[email protected]>
Co-Authored-By: boutetnico <[email protected]>
Co-Authored-By: boutetnico <[email protected]>
@@ -264,7 +264,9 @@ def run | |||
node['indices'].each do |type, index| | |||
index.each do |k, v| | |||
# #YELLOW | |||
unless k =~ /(_time$)/ || v =~ /\d+/ | |||
if k.end_with? 'is_throttled' | |||
metrics["indices.#{type}.#{k}"] = true?(v) ? 1 : 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I am not sure on is what is the best thing to do here. I am leaning towards:
case true?(v)
when true
1
when false
0
else
# not sure if we should raise, ok, critical, or return a `(0|1)`
raise(true?(v))
end
I would not worry about implementing it until I have had someone else take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm does it make sense to have true?
raise in the:
else
"#{obj} is not a truthy value, please open an issue with this output so we can fix it"
bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that initially but was not sure what we wanted to do so I illustrated what could be done and this allowed us to always return
something and then the handling of that could be implemented independently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the feedback and what I was already leaning towards I think the correct solution would be to do something like:
else
STDERR.puts "#{obj} is not a truthy value, please open an issue with this output so we can fix it"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, I see. So off hand, I'm not aware of a good way for a metric script to actually log to sensu (although the STDERR bit you mentioned might work, just worried that'll also get sent along).
Unless there's something I'm not aware of, I'd lean towards marking the current metric as unknown if possible otherwise, let's catch and raise
here. I hope this condition is rare enough that this code path never has to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, is_throttled
also feels like something that might work better as a dedicated Sensu check outside of this metric script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the change to have it print to STDERR
and our assumptions and understanding of the code are correct I think the logic here would look something like this:
if k.end_with? 'is_throttled'
state = true?(v)
unless state.class == 'String'
metrics["indices.#{type}.#{k}"] = true?(v) ? 1 : 0
end
end
that would return the right value when it can and when it can't it should return nothing but print a useful message to STDERR
that sensu-client will pick up. I would think that any handler would not be affected as I would be surprised if sensu client sent it as STDIN
which is typically what a pipe handler will by default work with. I would love some clarification from someone at sensu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW,
is_throttled
also feels like something that might work better as a dedicated Sensu check outside of this metric script.
I can see that but IMHO they are separate concerns, I tend to try to conceptualize all sensu "checks" into the following broad categories:
- state checking: this can be something against a local host, a remote host, etc and can either be looking at the current state of a system or some kind of representation of previous states. To further clarify you could search against a TSDB to get the average, min, max, standard deviation, etc of the last
n
number of state results. This is what I internally refer to as metric checks but avoid this terminology externally because it can confuse people. - data collection: these are what are commonly referred to the "metric" scripts in the sensu community. I think they should attempt to report data, bail with an error, or potentially a combination.
I think that yes it makes a lot of sensu to create a "state check" for that but the question is do you want to check against the current state or against something else.
Anyways those are just some thoughts/opinions and I doubt there is a "right" or "wrong" answer and its a matter of perspective. Perhaps we make this a CLI arg to control this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I thought about it and I think we should return a String
or a (True|False)Class
object and do the handling in the other function:
if k.end_with? 'is_throttled'
state = true?(v)
if state.class == 'String'
STDERR.puts state
# optionally we could additionally set the value to `nil` here but that relies on the appropriate system from handling this properly which seems less than ideal
# to play devils advocate though we could have something at the end of doing all the metric massinging drop any metrics with a `nil` value which I think does have some real value as it means that we can do a final cleanup/sanitization of the output before we ship it.
else
metrics["indices.#{type}.#{k}"] = true?(v) ? 1 : 0
end
end
I'm still not confident that this is the correct approach either but its something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned about throwing anything to stderr, as it will garble the resulting metrics output.
Sensu combines both strerr and stdout into the same stream when running a check command
See sensu-spawn for reference on that.
Also it looks to me like this check logic already silently ignores any metric key/value pair where value is not Numeric type...including booleans... Ruby true is not a Numeric. The logic skips all non-Numeric from final output. Again.. probably to make sure the output is sane metrics format.
at the end of the run() statement... metrics key/value are iterated and checked for Numeric type
metrics.each do |k, v|
if v.is_a? Numeric
output([config[:scheme], k].join('.'), v, timestamp)
end
end
This is a full scrubbing of metrics for sanity.
Here's what I would do.
- if known truthy have true() return true => store Numeric 1 into metric key value
- if known falsey have true() return false => store Numeric 0 into metric key value
- unknown have true() return explainable String => store String into metric key value
this will be scrubbed from metrics. - if there are any scrubbed non-numeric metrics in that final output scrub set output status to 1 == WARNING. This tells us the metric logic is still sub-optimal but allows the check to operate.
- Optionally create new command line switch to raise an error msg listing all the scrubbed metrics if any metric key value is not Numeric for debugging purposes. Allows us to understand why status is WARNING so we can drive a fix into the logic.
@majormoses going back to the initial idea behind this PR, can you advise on what I should change so that it can be merged? |
@boutetnico sounds like we are supposed to silently swallow errors which does not make me feel good. IMHO when its a check type of |
Pull Request Checklist
Is this in reference to an existing issue? No
General
Update Changelog following the conventions laid out here
Update README with any necessary configuration snippets
Binstubs are created if needed
RuboCop passes
Existing tests pass
Purpose
This PR removes the following invalid lines from the output of the plugin:
These lines are ignored by carbon and cause warning. This PR allows to remove these warnings.
Known Compatibility Issues
None