Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose job latency metric via ActiveSupport Notifications job middleware #366

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mnbbrown
Copy link
Contributor

@mnbbrown mnbbrown commented Jul 8, 2022

...picking up from @stephenbinns's good work in #362 from a fork in the gocardless GH org.

This adds ActiveSupport notifications as Job Middleware, there is also a change to expose the latency for the job to be picked up which is useful to work out how close to capacity the workers are.

@mnbbrown mnbbrown marked this pull request as draft July 8, 2022 15:12
@ZimbiX
Copy link
Member

ZimbiX commented Jul 11, 2022

@mnbbrown I don't understand - this PR has the exact same content as #362

@mnbbrown
Copy link
Contributor Author

@mnbbrown I don't understand - this PR has the exact same content as #362

That PR's upstream is @stephenbinns's personal github. We've forked onto the @gocardless org so we can all contribute.
You should consider this the canonical PR.

Sorry for the confusion! I knew this would happen :(

@mnbbrown
Copy link
Contributor Author

As mentioned previously our fork implements "first-class" support for prometheus. We thought instead of upstreaming all that, we'd just make a generic way for users to implement their own metric collection - irrespective of if they're using prometheus or not.

We've been testing this internally - and for it to work we need to run que wrapped in a HTTP server that implements /metrics. The current structure of the cli doesn't really lend itself to running that way. We're left with a couple of options:

  1. update upstream to support running the cli from our code. This would involve moving bin/command_line_interface.rb into lib and wrapping some of the code so it could be called like CLI.run from another entrypoint.
  2. upstream the prometheus support instead of using the ActiveSupport Notifiations.
  3. Implement the HTTP server in some form of worker "middleware"

What would be your preferred approach @ZimbiX?

@ZimbiX
Copy link
Member

ZimbiX commented Aug 26, 2022

Ah ha. Alright, I'll close #362 then, but the feedback left there is still valuable.

As mentioned previously our fork implements "first-class" support for prometheus. We thought instead of upstreaming all that, we'd just make a generic way for users to implement their own metric collection - irrespective of if they're using prometheus or not.

Hmm. It's been a while since I skimmed your Prometheus work, so I don't recall what you had. But yeah, given Que's aim was to be as minimal/focused as possible, maybe Prometheus support should be it's own gem, e.g. que-prometheus, and Que itself should just have whatever mechanisms are required in order to facilitate this kind of monitoring.

We've been testing this internally - and for it to work we need to run que wrapped in a HTTP server that implements /metrics.

That makes sense.

  1. update upstream to support running the cli from our code. This would involve moving bin/command_line_interface.rb into lib and wrapping some of the code so it could be called like CLI.run from another entrypoint.

Ah yeah, moving Que::CommandLineInterface from bin to lib would be a good idea regardless of which way you go - I'd thought that too a while ago. I'd happily merge a PR that's doing only that. What are you thinking with CLI.run though?

  1. upstream the prometheus support instead of using the ActiveSupport Notifiations.

Even if Prometheus support is not added to Que itself, I would prefer using something other than ActiveSupport for the hooks.

Another alternative could be to make use of the que_state_notify trigger. That would also allow it to run as a separate process!

  1. Implement the HTTP server in some form of worker "middleware"

Would you even need middleware to be able to run a HTTP server from the Que process though? I imagine the Ruby file you provide to Que could spawn a thread to run the HTTP server?


Lots of options! =P I'm not yet sure what would be my preferred approach. Something that others can benefit the most from would be best.

@mnbbrown
Copy link
Contributor Author

Ah yeah, moving Que::CommandLineInterface from bin to lib would be a good idea regardless of which way you go - I'd thought that too a while ago. I'd happily merge a PR that's doing only that.

As requested :) #378

@mnbbrown
Copy link
Contributor Author

Bit of a deep dive into options....

listen on que_state

Another alternative could be to make use of the #372. That would also allow it to run as a separate process!

I see you've marked this as deprecated in the latest release. I also don't think it's quite fit for purpose here. The events contain only basic information about the job:

{
"message_type"=>"job_change",
 "id"=>47,
 "queue"=>"sample-queue",
 "tags"=>[],
 "run_at"=>"2022-08-27T11:08:12.175831Z",
 "time"=>"2022-08-27T11:08:15.225358Z",
 "job_class"=>"SampleJob",
 "previous_state"=>"ready",
 "current_state"=>"nonexistent"}

Although this is sufficient to measure things like latency, etc it's missing information to measure the time a job takes to be worked as well as other important metadata like job priority.

Listener.subscribe do |event|
  next if event["current_state"] != "nonexistent"

  latency = Time.parse(event["time"]).to_i - Time.parse(event["run_at"]).to_i
  labels = {
    job_class: event["job_class"],
    queue: event["queue"],
    latency: latency,
    priority: NOT_AVAILABLE,
  }
  # set prometheus counters, etc
end

start thread in provided ruby file.

I created a basic implementation here: https://github.com/mnbbrown/que-prometheus

On require it boots the web server on a thread and registers some job middleware. Although it feels a bit clunky it works well for job level metrics and is supported by the current que version with no changes. It's also think it should easily support the queue level metrics via a materialised view like we do in our fork.

It has a couple of limitations though:

  1. Because there's no clean way to introspect the current state of the Locker (that I've thought of....) it currently doesn't support some important metrics like running worker thread count and expect worker thread count.
  2. latency I can't seem to get latency to passed into the job_middleware call method - but this might just require some further debugging on my side.

@ZimbiX
Copy link
Member

ZimbiX commented Aug 28, 2022

Wow, great work putting that gem together - that was fast! I really like that approach.

I see you've marked this as deprecated in the latest release.

Yeah, that was deprecated perhaps prematurely.

I also don't think it's quite fit for purpose here. The events contain only basic information about the job

Fair enough!

Because there's no clean way to introspect the current state of the Locker (that I've thought of....) it currently doesn't support some important metrics like running worker thread count and expect worker thread count.

Hmm, I can't see a clean way to access the locker either. This could be a good reason to add an accessor for it, like Que.locker. Until that's done, you could do something like:

module Que
  class << self
    attr_accessor :locker
  end

  class Locker
    def initialize(**options)
      Que.locker = self
      super(**options)
    end
  end
end

Then, as you've probably seen, you can access the worker threads with Que.locker.workers.map(&:thread) (and locker thread with Que.locker.thread).

@mnbbrown
Copy link
Contributor Author

Hmm, I can't see a clean way to access the locker either. This could be a good reason to add an accessor for it, like Que.locker. Until that's done, you could do something like:

This works really well #380

See the latest main on https://github.com/mnbbrown/que-prometheus to see it in action.

@mnbbrown
Copy link
Contributor Author

mnbbrown commented Sep 8, 2022

Slowly making progress towards being able to close this PR..

#382 is another PR we need to support que-prometheus properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants