Expose job latency metric via ActiveSupport Notifications job middleware #366

mnbbrown · 2022-07-08T15:02:43Z

...picking up from @stephenbinns's good work in #362 from a fork in the gocardless GH org.

This adds ActiveSupport notifications as Job Middleware, there is also a change to expose the latency for the job to be picked up which is useful to work out how close to capacity the workers are.

ZimbiX · 2022-07-11T01:26:00Z

@mnbbrown I don't understand - this PR has the exact same content as #362

mnbbrown · 2022-08-26T07:41:23Z

@mnbbrown I don't understand - this PR has the exact same content as #362

That PR's upstream is @stephenbinns's personal github. We've forked onto the @gocardless org so we can all contribute.
You should consider this the canonical PR.

Sorry for the confusion! I knew this would happen :(

mnbbrown · 2022-08-26T08:03:45Z

As mentioned previously our fork implements "first-class" support for prometheus. We thought instead of upstreaming all that, we'd just make a generic way for users to implement their own metric collection - irrespective of if they're using prometheus or not.

We've been testing this internally - and for it to work we need to run que wrapped in a HTTP server that implements /metrics. The current structure of the cli doesn't really lend itself to running that way. We're left with a couple of options:

update upstream to support running the cli from our code. This would involve moving bin/command_line_interface.rb into lib and wrapping some of the code so it could be called like CLI.run from another entrypoint.
upstream the prometheus support instead of using the ActiveSupport Notifiations.
Implement the HTTP server in some form of worker "middleware"

What would be your preferred approach @ZimbiX?

ZimbiX · 2022-08-26T09:06:27Z

Ah ha. Alright, I'll close #362 then, but the feedback left there is still valuable.

As mentioned previously our fork implements "first-class" support for prometheus. We thought instead of upstreaming all that, we'd just make a generic way for users to implement their own metric collection - irrespective of if they're using prometheus or not.

Hmm. It's been a while since I skimmed your Prometheus work, so I don't recall what you had. But yeah, given Que's aim was to be as minimal/focused as possible, maybe Prometheus support should be it's own gem, e.g. que-prometheus, and Que itself should just have whatever mechanisms are required in order to facilitate this kind of monitoring.

We've been testing this internally - and for it to work we need to run que wrapped in a HTTP server that implements /metrics.

That makes sense.

update upstream to support running the cli from our code. This would involve moving bin/command_line_interface.rb into lib and wrapping some of the code so it could be called like CLI.run from another entrypoint.

Ah yeah, moving Que::CommandLineInterface from bin to lib would be a good idea regardless of which way you go - I'd thought that too a while ago. I'd happily merge a PR that's doing only that. What are you thinking with CLI.run though?

upstream the prometheus support instead of using the ActiveSupport Notifiations.

Even if Prometheus support is not added to Que itself, I would prefer using something other than ActiveSupport for the hooks.

Another alternative could be to make use of the que_state_notify trigger. That would also allow it to run as a separate process!

Implement the HTTP server in some form of worker "middleware"

Would you even need middleware to be able to run a HTTP server from the Que process though? I imagine the Ruby file you provide to Que could spawn a thread to run the HTTP server?

Lots of options! =P I'm not yet sure what would be my preferred approach. Something that others can benefit the most from would be best.

mnbbrown · 2022-08-26T14:40:08Z

Ah yeah, moving Que::CommandLineInterface from bin to lib would be a good idea regardless of which way you go - I'd thought that too a while ago. I'd happily merge a PR that's doing only that.

As requested :) #378

mnbbrown · 2022-08-27T17:05:40Z

Bit of a deep dive into options....

listen on `que_state`

Another alternative could be to make use of the #372. That would also allow it to run as a separate process!

I see you've marked this as deprecated in the latest release. I also don't think it's quite fit for purpose here. The events contain only basic information about the job:

{
"message_type"=>"job_change",
 "id"=>47,
 "queue"=>"sample-queue",
 "tags"=>[],
 "run_at"=>"2022-08-27T11:08:12.175831Z",
 "time"=>"2022-08-27T11:08:15.225358Z",
 "job_class"=>"SampleJob",
 "previous_state"=>"ready",
 "current_state"=>"nonexistent"}

Although this is sufficient to measure things like latency, etc it's missing information to measure the time a job takes to be worked as well as other important metadata like job priority.

Listener.subscribe do |event|
  next if event["current_state"] != "nonexistent"

  latency = Time.parse(event["time"]).to_i - Time.parse(event["run_at"]).to_i
  labels = {
    job_class: event["job_class"],
    queue: event["queue"],
    latency: latency,
    priority: NOT_AVAILABLE,
  }
  # set prometheus counters, etc
end

start thread in provided ruby file.

I created a basic implementation here: https://github.com/mnbbrown/que-prometheus

On require it boots the web server on a thread and registers some job middleware. Although it feels a bit clunky it works well for job level metrics and is supported by the current que version with no changes. It's also think it should easily support the queue level metrics via a materialised view like we do in our fork.

It has a couple of limitations though:

Because there's no clean way to introspect the current state of the Locker (that I've thought of....) it currently doesn't support some important metrics like running worker thread count and expect worker thread count.
latency I can't seem to get latency to passed into the job_middleware call method - but this might just require some further debugging on my side.

ZimbiX · 2022-08-28T14:08:47Z

Wow, great work putting that gem together - that was fast! I really like that approach.

I see you've marked this as deprecated in the latest release.

Yeah, that was deprecated perhaps prematurely.

I also don't think it's quite fit for purpose here. The events contain only basic information about the job

Fair enough!

Because there's no clean way to introspect the current state of the Locker (that I've thought of....) it currently doesn't support some important metrics like running worker thread count and expect worker thread count.

Hmm, I can't see a clean way to access the locker either. This could be a good reason to add an accessor for it, like Que.locker. Until that's done, you could do something like:

module Que
  class << self
    attr_accessor :locker
  end

  class Locker
    def initialize(**options)
      Que.locker = self
      super(**options)
    end
  end
end

Then, as you've probably seen, you can access the worker threads with Que.locker.workers.map(&:thread) (and locker thread with Que.locker.thread).

mnbbrown · 2022-08-31T11:26:10Z

Hmm, I can't see a clean way to access the locker either. This could be a good reason to add an accessor for it, like Que.locker. Until that's done, you could do something like:

This works really well #380

See the latest main on https://github.com/mnbbrown/que-prometheus to see it in action.

mnbbrown · 2022-09-08T13:09:33Z

Slowly making progress towards being able to close this PR..

#382 is another PR we need to support que-prometheus properly.

stephenbinns added 3 commits July 8, 2022 15:37

Improve specs and add test for error handling

af01d50

Expose latency in polling

a389b20

Job metrics and subscribers

b5a22ac

mnbbrown marked this pull request as draft July 8, 2022 15:12

ZimbiX mentioned this pull request Aug 26, 2022

Expose job latency metric via ActiveSupport Notifications job middleware #362

Closed

make the current running Locker retrievable via Que.locker

a32ec57

mnbbrown mentioned this pull request Aug 31, 2022

Make the current Locker instance available via Que.locker #380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose job latency metric via ActiveSupport Notifications job middleware #366

Expose job latency metric via ActiveSupport Notifications job middleware #366

mnbbrown commented Jul 8, 2022

ZimbiX commented Jul 11, 2022

mnbbrown commented Aug 26, 2022

mnbbrown commented Aug 26, 2022

ZimbiX commented Aug 26, 2022

mnbbrown commented Aug 26, 2022

mnbbrown commented Aug 27, 2022

ZimbiX commented Aug 28, 2022 •

edited

Loading

mnbbrown commented Aug 31, 2022

mnbbrown commented Sep 8, 2022

Expose job latency metric via ActiveSupport Notifications job middleware #366

Are you sure you want to change the base?

Expose job latency metric via ActiveSupport Notifications job middleware #366

Conversation

mnbbrown commented Jul 8, 2022

ZimbiX commented Jul 11, 2022

mnbbrown commented Aug 26, 2022

mnbbrown commented Aug 26, 2022

ZimbiX commented Aug 26, 2022

mnbbrown commented Aug 26, 2022

mnbbrown commented Aug 27, 2022

listen on que_state

start thread in provided ruby file.

ZimbiX commented Aug 28, 2022 • edited Loading

mnbbrown commented Aug 31, 2022

mnbbrown commented Sep 8, 2022

listen on `que_state`

ZimbiX commented Aug 28, 2022 •

edited

Loading