Expose job latency metric via ActiveSupport Notifications job middleware #362

stephenbinns · 2022-05-17T10:45:44Z

This adds ActiveSupport notifications as Job Middleware, there is also a change to expose the latency for the job to be picked up which is useful to work out how close to capacity the workers are.

hlascelles · 2022-05-17T11:55:07Z

Good stuff... Related: #325

NB, using now() - (j).run_at as latency/delay can be misleading. If a job errors and retries with a backoff then the run_at changes. It would be best to add an "initial_run_at" column to the main jobs table, but that would be a better change....

ZimbiX

Interesting. I haven't used ActiveSupport notifications before, but knowing the job latency does sound like a useful proxy to understand the level of worker contention.

Would job latency be covered by your Prometheus metrics implementation? (Related issue: #267)

I see this is still in draft. I like where this is headed. What else would be left?

Thanks for the contribution! =)

spec/spec_helper.rb

lib/que/active_support/job_middleware.rb

ZimbiX · 2022-05-17T12:13:22Z

lib/que/active_support/job_middleware.rb

+
+module Que
+  module ActiveSupport
+    module JobMiddleware


Some usage documentation would be in order, I reckon.

Have added to readme do you think its better here?

Ah, nah, I just wanted to start a thread so it'd be able to be marked resolved 😆

Cheers for adding that; looks good

ZimbiX · 2022-05-17T12:24:23Z

lib/que/poller.rb

@@ -63,6 +63,7 @@ class Poller
          SELECT
            (j).*,
            l.locked,
+            extract(epoch from (now() - (j).run_at)) as latency,


NB, using now() - (j).run_at as latency/delay can be misleading. If a job errors and retries with a backoff then the run_at changes. It would be best to add an "initial_run_at" column to the main jobs table, but that would be a better change....

@hlascelles This is a good point. I think the intention here is to permit ascertaining worker contention - by allowing you to analyse at the delay between when a job is desired to be run at (or re-run at), and when it's actually picked up to be run. In terms of that desire, I think this metric is more useful. But I'm sure there's also merit in #325.

lib/que/active_support/job_middleware.spec.rb

ZimbiX · 2022-05-17T12:29:38Z

lib/que/active_support/job_middleware.rb

+
+        started = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+        yield
+      ensure


I'd like a test for the exception case. But is publishing a que_job.worked notification even upon an exception really what you want? I'd have thought that'd mean the job is complete 🤔 There may be an example of this elsewhere..

Sounds good will add one shortly, I think so in terms of the metric were more concerned with how Que is performing rather than if the Jobs were executed correctly.

ameykusurkar · 2022-05-17T14:12:35Z

lib/que/active_support/job_middleware.rb

+          job_class: job.que_attrs[:job_class],
+          priority: job.que_attrs[:priority],
+          queue: job.que_attrs[:queue],
+          latency: job.que_attrs[:latency],


I think it would be also useful to have a duration published, so we know both how long the job was waiting in the queue, and how long it took to execute. WDYT?

Execution duration can be calculated using start and end currently, we could make this more intuitive by doing that for consumers

stephenbinns · 2022-05-17T14:12:37Z

Interesting. I haven't used ActiveSupport notifications before, but knowing the job latency does sound like a useful proxy to understand the level of worker contention.

Would job latency be covered by your Prometheus metrics implementation? (Related issue: #267)

I see this is still in draft. I like where this is headed. What else would be left?

Thanks for the contribution! =)

Indeed it is part of what we're monitoring, the remaining ones are a little more tricky to figure out, as I mentioned previously we'd ideally like to get off our fork and as part of that we need to reach some sort of feature parity here.

ameykusurkar · 2022-05-17T14:27:34Z

lib/que/active_support/job_middleware.rb

+        yield
+      ensure
+        ::ActiveSupport::Notifications.publish(
+          "que_job.worked",


Suggested change

"que_job.worked",

"que_job.job_worked",

Nitpick: what about prefixing the event with job_? This allows us to have different types of events

que.job_worked? =P

ZimbiX · 2022-05-18T09:00:20Z

docs/README.md

@@ -798,6 +798,32 @@ Que.job_middleware.push(
 )
 ```

+#### Existing Middleware
+
+Que ships with middleware to expose job metrics using ActiveSupport notifications to subscribe to it you can implelent the following


*implement ;)

ZimbiX · 2022-05-18T09:05:04Z

docs/README.md

+Que ships with middleware to expose job metrics using ActiveSupport notifications to subscribe to it you can implelent the following
+
+```ruby
+::ActiveSupport::Notifications.subscribe("que_job.worked") do |message, started, finished, labels|


I just looked this up, and it seems that .subscribe doesn't have monotonic timestamps; .monotonic_subscribe does. Whichever you want to use here is fine, but let's document it correctly.

Sorry, you are passing monotonic times from the middleware... I guess I don't understand how this works then 😅

ZimbiX · 2022-05-18T09:36:09Z

spec/que/active_support/job_middleware_spec.rb

+        called = true
+      end
+
+      Que::ActiveSupport::JobMiddleware.call(job) { job.que_error = "error" }


Ah, I was thinking you'd need to raise within the block here, but you're right - an exception from a job does not bubble up to middleware.

Testing job failure is good, but I was meaning that you test that it's in an ensure - when it encounters an exception. If not caused by the job, perhaps by a subsequent middleware.

For realistic data, I think job.que_error should also be set to an instance of an exception, not that it would have any effect here.

Although.. I've just looked up the term job_worked in the code, and found that logging makes a distinction: job_worked vs job_errored. So maybe this middleware should do the same? Or find a more generic term - something like job_attempted?

ZimbiX · 2022-08-26T09:06:58Z

Closing in favour of #366

ZimbiX reviewed May 17, 2022

View reviewed changes

ZimbiX changed the title ~~Job metrics~~ Expose job latency metric via ActiveSupport Notifications job middleware May 17, 2022

stephenbinns added 3 commits May 17, 2022 14:47

Expose latency in polling

298c690

Job metrics and subscribers

1004c10

Improve specs and add test for error handling

ceb42c5

stephenbinns force-pushed the job-metrics branch from 8a42da6 to ceb42c5 Compare May 17, 2022 14:05

ameykusurkar reviewed May 17, 2022

View reviewed changes

ZimbiX reviewed May 18, 2022

View reviewed changes

mnbbrown mentioned this pull request Jul 8, 2022

Expose job latency metric via ActiveSupport Notifications job middleware #366

Draft

ZimbiX closed this Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose job latency metric via ActiveSupport Notifications job middleware #362

Expose job latency metric via ActiveSupport Notifications job middleware #362

stephenbinns commented May 17, 2022 •

edited

Loading

hlascelles commented May 17, 2022

ZimbiX left a comment

ZimbiX May 17, 2022

stephenbinns May 17, 2022

ZimbiX May 18, 2022

ZimbiX May 17, 2022

ZimbiX May 17, 2022

stephenbinns May 17, 2022

ameykusurkar May 17, 2022

stephenbinns May 17, 2022

stephenbinns commented May 17, 2022

ameykusurkar May 17, 2022 •

edited

Loading

ZimbiX May 18, 2022

ZimbiX May 18, 2022

ZimbiX May 18, 2022

ZimbiX May 18, 2022 •

edited

Loading

ZimbiX May 18, 2022

ZimbiX commented Aug 26, 2022

Expose job latency metric via ActiveSupport Notifications job middleware #362

Expose job latency metric via ActiveSupport Notifications job middleware #362

Conversation

stephenbinns commented May 17, 2022 • edited Loading

hlascelles commented May 17, 2022

ZimbiX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephenbinns commented May 17, 2022

ameykusurkar May 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZimbiX May 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZimbiX commented Aug 26, 2022

stephenbinns commented May 17, 2022 •

edited

Loading

ameykusurkar May 17, 2022 •

edited

Loading

ZimbiX May 18, 2022 •

edited

Loading