Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip invalid lines #137

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

boutetnico
Copy link

@boutetnico boutetnico commented Feb 1, 2019

Pull Request Checklist

Is this in reference to an existing issue? No

General

  • Update Changelog following the conventions laid out here

  • Update README with any necessary configuration snippets

  • Binstubs are created if needed

  • RuboCop passes

  • Existing tests pass

Purpose

This PR removes the following invalid lines from the output of the plugin:

elk01.elasticsearch.os.load_average
elk01.elasticsearch.indices.indexing.is_throttled false 1548954513
elk01.elasticsearch.indices.segments.file_sizes {} 1548954513
elk01.elasticsearch.fs.least_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1548954513
elk01.elasticsearch.fs.most_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1548954513
elk01.elasticsearch.fs.io_stats.devices {"device_name"=>"nvme1n1p1", "operations"=>157872407, "read_operations"=>691872, "write_operations"=>157180535, "read_kilobytes"=>50042456, "write_kilobytes"=>2478783444} 1548954513
elk01.elasticsearch.fs.io_stats.total {"operations"=>157872407, "read_operations"=>691872, "write_operations"=>157180535, "read_kilobytes"=>50042456, "write_kilobytes"=>2478783444} 1548954513

These lines are ignored by carbon and cause warning. This PR allows to remove these warnings.

Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.indices.indexing.is_throttled false 1549036789]
Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.indices.segments.file_sizes {} 1549036789]
Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.fs.least_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1549036789]
Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.fs.most_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1549036789]
Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.fs.io_stats.devices {"device_name"=>"nvme1n1p1", "operations"=>160317981, "read_operations"=>705794, "write_operations"=>159612187, "read_kilobytes"=>51039824, "write_kilobytes"=>2539726412} 1549036789]
Feb  1 15:59:49 sensu01 carbon-cache.py[480]: 01/02/2019 15:59:49 :: [listener] invalid line received from client 127.0.0.1:42532, ignoring [elk01.elasticsearch.fs.io_stats.total {"operations"=>160317981, "read_operations"=>705794, "write_operations"=>159612187, "read_kilobytes"=>51039824, "write_kilobytes"=>2539726412} 1549036789]

Known Compatibility Issues

None

@majormoses
Copy link
Member

Hmm the change makes sense but in some of the cases I am wondering if we should do some extra manipulation for those cases so that we can try to include those. For example:

elk01.elasticsearch.indices.indexing.is_throttled false 1548954513

It seems like we can in that case probably just drop the false and send it.

Maybe in these cases:

elk01.elasticsearch.fs.least_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1548954513
elk01.elasticsearch.fs.most_usage_estimate.path /data/elasticsearch/data/elk01-elk01/nodes/0 1548954513
elk01.elasticsearch.fs.io_stats.devices {"device_name"=>"nvme1n1p1", "operations"=>157872407, "read_operations"=>691872, "write_operations"=>157180535, "read_kilobytes"=>50042456, "write_kilobytes"=>2478783444} 1548954513

We can probably walk the additionally nested key values and return that. I am certainly OK with having some fallback to discard but I would like to handle and return as much data as possible. WDYT?

@majormoses majormoses self-assigned this Feb 17, 2019
@majormoses majormoses self-requested a review February 17, 2019 22:41
@majormoses
Copy link
Member

@boutetnico checking back and seeing what your thoughts were on my feedback.

@boutetnico
Copy link
Author

@majormoses I agree and will try to implement your suggestions.

@boutetnico
Copy link
Author

@majormoses I have made some progress, here is a diff of the keys from the ouput compared before and after this PR. I have not included keys that contain a path because it could lead to very long metrics keys. What do you think?

1d0
< elk01.elasticsearch.os.load_average
91d89
< elk01.elasticsearch.indices.segments.file_sizes
222d219
< elk01.elasticsearch.fs.least_usage_estimate.path
226d222
< elk01.elasticsearch.fs.most_usage_estimate.path
230,231c226,235
< elk01.elasticsearch.fs.io_stats.devices
< elk01.elasticsearch.fs.io_stats.total
---
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.operations
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.read_operations
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.write_operations
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.read_kilobytes
> elk01.elasticsearch.fs.io_stats.devices.nvme1n1p1.write_kilobytes
> elk01.elasticsearch.fs.io_stats.total.operations
> elk01.elasticsearch.fs.io_stats.total.read_operations
> elk01.elasticsearch.fs.io_stats.total.write_operations
> elk01.elasticsearch.fs.io_stats.total.read_kilobytes
> elk01.elasticsearch.fs.io_stats.total.write_kilobytes

bin/metrics-es-node-graphite.rb Outdated Show resolved Hide resolved
bin/metrics-es-node-graphite.rb Outdated Show resolved Hide resolved
bin/metrics-es-node-graphite.rb Outdated Show resolved Hide resolved
bin/metrics-es-node-graphite.rb Outdated Show resolved Hide resolved
@@ -264,7 +264,9 @@ def run
node['indices'].each do |type, index|
index.each do |k, v|
# #YELLOW
unless k =~ /(_time$)/ || v =~ /\d+/
if k.end_with? 'is_throttled'
metrics["indices.#{type}.#{k}"] = v ? 1 : 0
Copy link
Member

@majormoses majormoses Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im somewhat confused by this line v is not really a condition unless since its a truthy value. I actually thought it would be a string and therefore not work. Am I understanding that correctly? If that's the case we probably want to either create a helper method to convert string representation to a boolean or comment the above so that I can recall a year later why this works.

Here is a really simple example of doing this with a helper function

# define our function
irb(main):004:0> def true?(obj)
irb(main):005:1> obj.to_s == 'true'
irb(main):006:1> end

# leverage our function
irb(main):008:0> true?('true')
=> true
irb(main):009:0> true?('false')
=> false
irb(main):010:0> true?('foo')
=> false

One thing you will notice in this approach is that it will treat all non true values as false which could be a good or bad thing. The good thing is that this means that if a non string was passed to this it would not break but would be returning false data. The bad thing is that it would be returning false data which could be argued is as bad. We could also flip that around from a true to a false just as easily which while it would return false data it would be returning that it is being throttled which may prompt engineers to look at it and report a bug.

If we wanted to be even safer we could define it like so:

irb(main):011:0> def true?(obj)
irb(main):012:1> if obj.to_s == "true"
irb(main):013:2> true
irb(main):014:2> elsif obj.to_s == "false"
irb(main):015:2> false
irb(main):016:2> else
irb(main):017:2* "#{obj.to_s} is not a truthy value, please open an issue with this output so we can fix it"
irb(main):018:2> end
irb(main):019:1> end
=> :true?

To leverage it we could do something like this:

irb(main):020:0> true?('true')
=> true
irb(main):021:0> true?(true)
=> true
irb(main):022:0> true?('false')
=> false
irb(main):023:0> true?('foo')
=> "foo is not a truthy value, please open an issue with this output so we can fix it"

Here is an example of how we can then leverage the building blocks and decide what to do with it:

irb(main):025:0> if true?('foo').class == String
irb(main):026:1> raise(true?('foo'))
irb(main):027:1> else
irb(main):028:1* true?('true')
irb(main):029:1> end
RuntimeError: foo is not a truthy value, please open an issue with this output so we can fix it
	from (irb):26
	from /home/babrams/.rbenv/versions/2.4.1/bin/irb:11:in `<main>'

I need to think about it more and am open to feedback on what's the "best" thing to do in the face of such an issue. I think we have a couple options:

  • not attempt to handle at all and bail with an ugly stack trace
  • catch the error and return incorrect data if so do we return true or false?
  • catch the error and abort with a meaningful message
  • catch the error and write to STDERR rather than STDOUT, not sure if handlers take both stdout and and stderr input for metric input, I have honestly spent more time writing state checks than metric gathering scripts. I will either hunt through the code on this or reach out and get back to you on this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I have followed your suggestions, can you please have a look at it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do I am also gonna see if someone else can review this particular portion and give me some feedback.

Copy link
Member

@jaredledvina jaredledvina Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents:

not attempt to handle at all and bail with an ugly stack trace

Ugh, I've seen this approach taken, strongly prefer to not doing this as if we know of a failure mode.

catch the error and return incorrect data if so do we return true or false?

My gut says it's better to not return any data versus returning incorrect data. I'd rather someone debug why data isn't there versus using data that's not correct to make a design.

catch the error and abort with a meaningful message

❤️

catch the error and write to STDERR rather than STDOUT...

Hmmm, this is an interesting idea https://github.com/sensu/sensu/blob/master/lib/sensu/client/process.rb#L167 and seem to imply that STDERR would be captured, https://github.com/sensu/sensu/blob/master/lib/sensu/client/process.rb#L198-L200 so that's not crazy in my opinion. If we do do that, I'd like to have someone quickly test what STDERR ends up looking like in the sensu event just to make sure nothing weird happens there (since I've never used that approach).

Overall, I agree the testing for the truthiness of the value and returning it and bailing cleanly if we don't get one is probably the best move here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am leaning towards the printing to STDERR unless someone has a better idea.

@@ -264,7 +264,9 @@ def run
node['indices'].each do |type, index|
index.each do |k, v|
# #YELLOW
unless k =~ /(_time$)/ || v =~ /\d+/
if k.end_with? 'is_throttled'
metrics["indices.#{type}.#{k}"] = true?(v) ? 1 : 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I am not sure on is what is the best thing to do here. I am leaning towards:

case true?(v)
when true
  1
when false
  0
else
  # not sure if we should raise, ok, critical, or return a `(0|1)`
  raise(true?(v))
end

I would not worry about implementing it until I have had someone else take a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm does it make sense to have true? raise in the:

  else
    "#{obj} is not a truthy value, please open an issue with this output so we can fix it"

bit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that initially but was not sure what we wanted to do so I illustrated what could be done and this allowed us to always return something and then the handling of that could be implemented independently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the feedback and what I was already leaning towards I think the correct solution would be to do something like:

else
    STDERR.puts "#{obj} is not a truthy value, please open an issue with this output so we can fix it"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I see. So off hand, I'm not aware of a good way for a metric script to actually log to sensu (although the STDERR bit you mentioned might work, just worried that'll also get sent along).

Unless there's something I'm not aware of, I'd lean towards marking the current metric as unknown if possible otherwise, let's catch and raise here. I hope this condition is rare enough that this code path never has to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, is_throttled also feels like something that might work better as a dedicated Sensu check outside of this metric script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the change to have it print to STDERR and our assumptions and understanding of the code are correct I think the logic here would look something like this:

if k.end_with? 'is_throttled'
  state = true?(v)
  unless state.class == 'String'
    metrics["indices.#{type}.#{k}"] = true?(v) ? 1 : 0
  end
end

that would return the right value when it can and when it can't it should return nothing but print a useful message to STDERR that sensu-client will pick up. I would think that any handler would not be affected as I would be surprised if sensu client sent it as STDIN which is typically what a pipe handler will by default work with. I would love some clarification from someone at sensu.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, is_throttled also feels like something that might work better as a dedicated Sensu check outside of this metric script.

I can see that but IMHO they are separate concerns, I tend to try to conceptualize all sensu "checks" into the following broad categories:

  • state checking: this can be something against a local host, a remote host, etc and can either be looking at the current state of a system or some kind of representation of previous states. To further clarify you could search against a TSDB to get the average, min, max, standard deviation, etc of the last n number of state results. This is what I internally refer to as metric checks but avoid this terminology externally because it can confuse people.
  • data collection: these are what are commonly referred to the "metric" scripts in the sensu community. I think they should attempt to report data, bail with an error, or potentially a combination.

I think that yes it makes a lot of sensu to create a "state check" for that but the question is do you want to check against the current state or against something else.

Anyways those are just some thoughts/opinions and I doubt there is a "right" or "wrong" answer and its a matter of perspective. Perhaps we make this a CLI arg to control this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I thought about it and I think we should return a String or a (True|False)Class object and do the handling in the other function:

if k.end_with? 'is_throttled'
  state = true?(v)
  if state.class == 'String'
    STDERR.puts state
   # optionally we could additionally set the value to `nil` here but that relies on the appropriate system from handling this properly which seems less than ideal
  # to play devils advocate though we could have something at the end of doing all the metric massinging drop any metrics with a `nil` value which I think does have some real value as it means that we can do a final cleanup/sanitization of the output before we ship it.
  else
    metrics["indices.#{type}.#{k}"] = true?(v) ? 1 : 0
  end
end

I'm still not confident that this is the correct approach either but its something to consider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about throwing anything to stderr, as it will garble the resulting metrics output.

Sensu combines both strerr and stdout into the same stream when running a check command
See sensu-spawn for reference on that.

Also it looks to me like this check logic already silently ignores any metric key/value pair where value is not Numeric type...including booleans... Ruby true is not a Numeric. The logic skips all non-Numeric from final output. Again.. probably to make sure the output is sane metrics format.

at the end of the run() statement... metrics key/value are iterated and checked for Numeric type

metrics.each do |k, v|
      if v.is_a? Numeric
        output([config[:scheme], k].join('.'), v, timestamp)
      end
end

This is a full scrubbing of metrics for sanity.

Here's what I would do.

  1. if known truthy have true() return true => store Numeric 1 into metric key value
  2. if known falsey have true() return false => store Numeric 0 into metric key value
  3. unknown have true() return explainable String => store String into metric key value
    this will be scrubbed from metrics.
  4. if there are any scrubbed non-numeric metrics in that final output scrub set output status to 1 == WARNING. This tells us the metric logic is still sub-optimal but allows the check to operate.
  5. Optionally create new command line switch to raise an error msg listing all the scrubbed metrics if any metric key value is not Numeric for debugging purposes. Allows us to understand why status is WARNING so we can drive a fix into the logic.

@boutetnico
Copy link
Author

@majormoses going back to the initial idea behind this PR, can you advise on what I should change so that it can be merged?

@majormoses
Copy link
Member

@boutetnico sounds like we are supposed to silently swallow errors which does not make me feel good. IMHO when its a check type of metric while sensu should capture the stdout and stderr I think the stderr should be reserved for reporting an error back to the operator and the handlers when shipping data should ignore stderr. @jspaleta I think this warrants more discussion but for now that seems like the appropriate path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants