Strip invalid lines #137

boutetnico · 2019-02-01T16:01:52Z

Pull Request Checklist

Is this in reference to an existing issue? No

General

Update Changelog following the conventions laid out here
Update README with any necessary configuration snippets
Binstubs are created if needed
RuboCop passes
Existing tests pass

Purpose

This PR removes the following invalid lines from the output of the plugin:

elk01.elasticsearch.os.load_average
elk01.elasticsearch.indices.indexing.is_throttled false 1548954513
elk01.elasticsearch.indices.segments.file_sizes {} 1548954513
elk01.elasticsearch.fs.least_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1548954513
elk01.elasticsearch.fs.most_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1548954513
elk01.elasticsearch.fs.io_stats.devices {"device_name"=>"nvme1n1p1", "operations"=>157872407, "read_operations"=>691872, "write_operations"=>157180535, "read_kilobytes"=>50042456, "write_kilobytes"=>2478783444} 1548954513
elk01.elasticsearch.fs.io_stats.total {"operations"=>157872407, "read_operations"=>691872, "write_operations"=>157180535, "read_kilobytes"=>50042456, "write_kilobytes"=>2478783444} 1548954513

These lines are ignored by carbon and cause warning. This PR allows to remove these warnings.

Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.indices.indexing.is_throttled false 1549036789]
Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.indices.segments.file_sizes {} 1549036789]
Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.fs.least_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1549036789]
Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.fs.most_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1549036789]
Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.fs.io_stats.devices {"device_name"=>"nvme1n1p1", "operations"=>160317981, "read_operations"=>705794, "write_operations"=>159612187, "read_kilobytes"=>51039824, "write_kilobytes"=>2539726412} 1549036789]
Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.fs.io_stats.total {"operations"=>160317981, "read_operations"=>705794, "write_operations"=>159612187, "read_kilobytes"=>51039824, "write_kilobytes"=>2539726412} 1549036789]

Known Compatibility Issues

None

majormoses · 2019-02-17T22:41:22Z

Hmm the change makes sense but in some of the cases I am wondering if we should do some extra manipulation for those cases so that we can try to include those. For example:

elk01.elasticsearch.indices.indexing.is_throttled false 1548954513

It seems like we can in that case probably just drop the false and send it.

Maybe in these cases:

elk01.elasticsearch.fs.least_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1548954513
elk01.elasticsearch.fs.most_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1548954513
elk01.elasticsearch.fs.io_stats.devices {"device_name"=>"nvme1n1p1", "operations"=>157872407, "read_operations"=>691872, "write_operations"=>157180535, "read_kilobytes"=>50042456, "write_kilobytes"=>2478783444} 1548954513

We can probably walk the additionally nested key values and return that. I am certainly OK with having some fallback to discard but I would like to handle and return as much data as possible. WDYT?

majormoses · 2019-03-05T02:02:52Z

@boutetnico checking back and seeing what your thoughts were on my feedback.

boutetnico · 2019-03-05T15:56:12Z

@majormoses I agree and will try to implement your suggestions.

boutetnico · 2019-03-05T18:11:05Z

@majormoses I have made some progress, here is a diff of the keys from the ouput compared before and after this PR. I have not included keys that contain a path because it could lead to very long metrics keys. What do you think?

1d0
< elk01.elasticsearch.os.load_average
91d89
< elk01.elasticsearch.indices.segments.file_sizes
222d219
< elk01.elasticsearch.fs.least_usage_estimate.path
226d222
< elk01.elasticsearch.fs.most_usage_estimate.path
230,231c226,235
< elk01.elasticsearch.fs.io_stats.devices
< elk01.elasticsearch.fs.io_stats.total
---
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.operations
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.read_operations
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.write_operations
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.read_kilobytes
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.write_kilobytes
> elk01.elasticsearch.fs.io_stats.total.operations
> elk01.elasticsearch.fs.io_stats.total.read_operations
> elk01.elasticsearch.fs.io_stats.total.write_operations
> elk01.elasticsearch.fs.io_stats.total.read_kilobytes
> elk01.elasticsearch.fs.io_stats.total.write_kilobytes

bin/metrics-es-node-graphite.rb

majormoses · 2019-03-06T03:14:24Z

bin/metrics-es-node-graphite.rb

@@ -264,7 +264,9 @@ def run
    node['indices'].each do |type, index|
      index.each do |k, v|
        # #YELLOW
-        unless k =~ /(_time$)/ || v =~ /\d+/
+        if k.end_with? 'is_throttled'
+          metrics["indices.#{type}.#{k}"] = v ? 1 : 0


Im somewhat confused by this line v is not really a condition unless since its a truthy value. I actually thought it would be a string and therefore not work. Am I understanding that correctly? If that's the case we probably want to either create a helper method to convert string representation to a boolean or comment the above so that I can recall a year later why this works.

Here is a really simple example of doing this with a helper function

# define our function irb(main):004:0> def true?(obj) irb(main):005:1> obj.to_s == 'true' irb(main):006:1> end # leverage our function irb(main):008:0> true?('true') => true irb(main):009:0> true?('false') => false irb(main):010:0> true?('foo') => false

One thing you will notice in this approach is that it will treat all non true values as false which could be a good or bad thing. The good thing is that this means that if a non string was passed to this it would not break but would be returning false data. The bad thing is that it would be returning false data which could be argued is as bad. We could also flip that around from a true to a false just as easily which while it would return false data it would be returning that it is being throttled which may prompt engineers to look at it and report a bug.

If we wanted to be even safer we could define it like so:

irb(main):011:0> def true?(obj) irb(main):012:1> if obj.to_s == "true" irb(main):013:2> true irb(main):014:2> elsif obj.to_s == "false" irb(main):015:2> false irb(main):016:2> else irb(main):017:2* "#{obj.to_s} is not a truthy value, please open an issue with this output so we can fix it" irb(main):018:2> end irb(main):019:1> end => :true?

To leverage it we could do something like this:

irb(main):020:0> true?('true') => true irb(main):021:0> true?(true) => true irb(main):022:0> true?('false') => false irb(main):023:0> true?('foo') => "foo is not a truthy value, please open an issue with this output so we can fix it"

Here is an example of how we can then leverage the building blocks and decide what to do with it:

irb(main):025:0> if true?('foo').class == String irb(main):026:1> raise(true?('foo')) irb(main):027:1> else irb(main):028:1* true?('true') irb(main):029:1> end RuntimeError: foo is not a truthy value, please open an issue with this output so we can fix it from (irb):26 from /home/babrams/.rbenv/versions/2.4.1/bin/irb:11:in `<main>'

I need to think about it more and am open to feedback on what's the "best" thing to do in the face of such an issue. I think we have a couple options:

not attempt to handle at all and bail with an ugly stack trace

catch the error and return incorrect data if so do we return true or false?

catch the error and abort with a meaningful message

catch the error and write to STDERR rather than STDOUT, not sure if handlers take both stdout and and stderr input for metric input, I have honestly spent more time writing state checks than metric gathering scripts. I will either hunt through the code on this or reach out and get back to you on this.

Thanks for this. I have followed your suggestions, can you please have a look at it?

will do I am also gonna see if someone else can review this particular portion and give me some feedback.

My 2 cents:

not attempt to handle at all and bail with an ugly stack trace

Ugh, I've seen this approach taken, strongly prefer to not doing this as if we know of a failure mode.

catch the error and return incorrect data if so do we return true or false?

My gut says it's better to not return any data versus returning incorrect data. I'd rather someone debug why data isn't there versus using data that's not correct to make a design.

catch the error and abort with a meaningful message

❤️

catch the error and write to STDERR rather than STDOUT...

Hmmm, this is an interesting idea https://github.com/sensu/sensu/blob/master/lib/sensu/client/process.rb#L167 and seem to imply that STDERR would be captured, https://github.com/sensu/sensu/blob/master/lib/sensu/client/process.rb#L198-L200 so that's not crazy in my opinion. If we do do that, I'd like to have someone quickly test what STDERR ends up looking like in the sensu event just to make sure nothing weird happens there (since I've never used that approach).

Overall, I agree the testing for the truthiness of the value and returning it and bailing cleanly if we don't get one is probably the best move here.

I think I am leaning towards the printing to STDERR unless someone has a better idea.

Co-Authored-By: boutetnico <[email protected]>

majormoses · 2019-03-12T03:37:08Z

bin/metrics-es-node-graphite.rb

@@ -264,7 +264,9 @@ def run
    node['indices'].each do |type, index|
      index.each do |k, v|
        # #YELLOW
-        unless k =~ /(_time$)/ || v =~ /\d+/
+        if k.end_with? 'is_throttled'
+          metrics["indices.#{type}.#{k}"] = true?(v) ? 1 : 0


The only thing I am not sure on is what is the best thing to do here. I am leaning towards:

case true?(v) when true 1 when false 0 else # not sure if we should raise, ok, critical, or return a `(0|1)` raise(true?(v)) end

I would not worry about implementing it until I have had someone else take a look.

Hmm does it make sense to have true? raise in the:

else "#{obj} is not a truthy value, please open an issue with this output so we can fix it"

bit?

I thought about that initially but was not sure what we wanted to do so I illustrated what could be done and this allowed us to always return something and then the handling of that could be implemented independently.

based on the feedback and what I was already leaning towards I think the correct solution would be to do something like:

else STDERR.puts "#{obj} is not a truthy value, please open an issue with this output so we can fix it"

Ah I see, I see. So off hand, I'm not aware of a good way for a metric script to actually log to sensu (although the STDERR bit you mentioned might work, just worried that'll also get sent along).

Unless there's something I'm not aware of, I'd lean towards marking the current metric as unknown if possible otherwise, let's catch and raise here. I hope this condition is rare enough that this code path never has to happen.

FWIW, is_throttled also feels like something that might work better as a dedicated Sensu check outside of this metric script.

Assuming the change to have it print to STDERR and our assumptions and understanding of the code are correct I think the logic here would look something like this:

if k.end_with? 'is_throttled' state = true?(v) unless state.class == 'String' metrics["indices.#{type}.#{k}"] = true?(v) ? 1 : 0 end end

that would return the right value when it can and when it can't it should return nothing but print a useful message to STDERR that sensu-client will pick up. I would think that any handler would not be affected as I would be surprised if sensu client sent it as STDIN which is typically what a pipe handler will by default work with. I would love some clarification from someone at sensu.

FWIW, is_throttled also feels like something that might work better as a dedicated Sensu check outside of this metric script.

I can see that but IMHO they are separate concerns, I tend to try to conceptualize all sensu "checks" into the following broad categories:

state checking: this can be something against a local host, a remote host, etc and can either be looking at the current state of a system or some kind of representation of previous states. To further clarify you could search against a TSDB to get the average, min, max, standard deviation, etc of the last n number of state results. This is what I internally refer to as metric checks but avoid this terminology externally because it can confuse people.

data collection: these are what are commonly referred to the "metric" scripts in the sensu community. I think they should attempt to report data, bail with an error, or potentially a combination.

I think that yes it makes a lot of sensu to create a "state check" for that but the question is do you want to check against the current state or against something else.

Anyways those are just some thoughts/opinions and I doubt there is a "right" or "wrong" answer and its a matter of perspective. Perhaps we make this a CLI arg to control this?

Actually I thought about it and I think we should return a String or a (True|False)Class object and do the handling in the other function:

if k.end_with? 'is_throttled' state = true?(v) if state.class == 'String' STDERR.puts state # optionally we could additionally set the value to `nil` here but that relies on the appropriate system from handling this properly which seems less than ideal # to play devils advocate though we could have something at the end of doing all the metric massinging drop any metrics with a `nil` value which I think does have some real value as it means that we can do a final cleanup/sanitization of the output before we ship it. else metrics["indices.#{type}.#{k}"] = true?(v) ? 1 : 0 end end

I'm still not confident that this is the correct approach either but its something to consider.

I'm a little concerned about throwing anything to stderr, as it will garble the resulting metrics output.

Sensu combines both strerr and stdout into the same stream when running a check command
See sensu-spawn for reference on that.

Also it looks to me like this check logic already silently ignores any metric key/value pair where value is not Numeric type...including booleans... Ruby true is not a Numeric. The logic skips all non-Numeric from final output. Again.. probably to make sure the output is sane metrics format.

at the end of the run() statement... metrics key/value are iterated and checked for Numeric type

metrics.each do |k, v| if v.is_a? Numeric output([config[:scheme], k].join('.'), v, timestamp) end end

This is a full scrubbing of metrics for sanity.

Here's what I would do.

if known truthy have true() return true => store Numeric 1 into metric key value

if known falsey have true() return false => store Numeric 0 into metric key value

unknown have true() return explainable String => store String into metric key value
this will be scrubbed from metrics.

if there are any scrubbed non-numeric metrics in that final output scrub set output status to 1 == WARNING. This tells us the metric logic is still sub-optimal but allows the check to operate.

Optionally create new command line switch to raise an error msg listing all the scrubbed metrics if any metric key value is not Numeric for debugging purposes. Allows us to understand why status is WARNING so we can drive a fix into the logic.

boutetnico · 2019-05-10T12:30:30Z

@majormoses going back to the initial idea behind this PR, can you advise on what I should change so that it can be merged?

majormoses · 2019-05-15T01:07:46Z

@boutetnico sounds like we are supposed to silently swallow errors which does not make me feel good. IMHO when its a check type of metric while sensu should capture the stdout and stderr I think the stderr should be reserved for reporting an error back to the operator and the handlers when shipping data should ignore stderr. @jspaleta I think this warrants more discussion but for now that seems like the appropriate path.

boutetnico added 2 commits January 31, 2019 18:21

Strip invalid lines

1df0f17

Update changelog

17393f8

majormoses self-assigned this Feb 17, 2019

majormoses self-requested a review February 17, 2019 22:41

Explore deeper into values instead of dropping them

6ea82a5

majormoses reviewed Mar 6, 2019

View reviewed changes

majormoses added the Status: Awaiting Response label Mar 6, 2019

majormoses and others added 5 commits March 6, 2019 16:34

Update bin/metrics-es-node-graphite.rb

8739f82

Co-Authored-By: boutetnico <[email protected]>

Update bin/metrics-es-node-graphite.rb

ecb3318

Co-Authored-By: boutetnico <[email protected]>

Update bin/metrics-es-node-graphite.rb

6ebb1ad

Co-Authored-By: boutetnico <[email protected]>

Update according to feedbacks

f472b38

Fix RuboCop warnings

06c0a55

majormoses removed the Status: Awaiting Response label Mar 12, 2019

majormoses reviewed Mar 12, 2019

View reviewed changes

boutetnico mentioned this pull request Jun 4, 2020

metrics-es-node-graphite.rb: drop invalid lines boutetnico/sensu-plugins-elasticsearch#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip invalid lines #137

Strip invalid lines #137

boutetnico commented Feb 1, 2019 •

edited

Loading

majormoses commented Feb 17, 2019

majormoses commented Mar 5, 2019

boutetnico commented Mar 5, 2019

boutetnico commented Mar 5, 2019

majormoses Mar 6, 2019 •

edited

Loading

boutetnico Mar 6, 2019

majormoses Mar 12, 2019

jaredledvina Mar 12, 2019 •

edited by majormoses

Loading

majormoses Mar 12, 2019

majormoses Mar 12, 2019

jaredledvina Mar 12, 2019

majormoses Mar 12, 2019

majormoses Mar 12, 2019

jaredledvina Mar 12, 2019

jaredledvina Mar 12, 2019

majormoses Mar 12, 2019

majormoses Mar 12, 2019

majormoses Mar 12, 2019

jspaleta Mar 12, 2019

boutetnico commented May 10, 2019

majormoses commented May 15, 2019

Strip invalid lines #137

Are you sure you want to change the base?

Strip invalid lines #137

Conversation

boutetnico commented Feb 1, 2019 • edited Loading

Pull Request Checklist

General

Purpose

Known Compatibility Issues

majormoses commented Feb 17, 2019

majormoses commented Mar 5, 2019

boutetnico commented Mar 5, 2019

boutetnico commented Mar 5, 2019

majormoses Mar 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaredledvina Mar 12, 2019 • edited by majormoses Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boutetnico commented May 10, 2019

majormoses commented May 15, 2019

boutetnico commented Feb 1, 2019 •

edited

Loading

majormoses Mar 6, 2019 •

edited

Loading

jaredledvina Mar 12, 2019 •

edited by majormoses

Loading