-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose job latency metric via ActiveSupport Notifications job middleware #362
Conversation
Good stuff... Related: #325 NB, using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I haven't used ActiveSupport notifications before, but knowing the job latency does sound like a useful proxy to understand the level of worker contention.
Would job latency be covered by your Prometheus metrics implementation? (Related issue: #267)
I see this is still in draft. I like where this is headed. What else would be left?
Thanks for the contribution! =)
|
||
module Que | ||
module ActiveSupport | ||
module JobMiddleware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some usage documentation would be in order, I reckon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added to readme do you think its better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nah, I just wanted to start a thread so it'd be able to be marked resolved 😆
Cheers for adding that; looks good
@@ -63,6 +63,7 @@ class Poller | |||
SELECT | |||
(j).*, | |||
l.locked, | |||
extract(epoch from (now() - (j).run_at)) as latency, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB, using
now() - (j).run_at
as latency/delay can be misleading. If a job errors and retries with a backoff then therun_at
changes. It would be best to add an "initial_run_at" column to the main jobs table, but that would be a better change....
@hlascelles This is a good point. I think the intention here is to permit ascertaining worker contention - by allowing you to analyse at the delay between when a job is desired to be run at (or re-run at), and when it's actually picked up to be run. In terms of that desire, I think this metric is more useful. But I'm sure there's also merit in #325.
|
||
started = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
yield | ||
ensure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like a test for the exception case. But is publishing a que_job.worked
notification even upon an exception really what you want? I'd have thought that'd mean the job is complete 🤔 There may be an example of this elsewhere..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good will add one shortly, I think so in terms of the metric were more concerned with how Que is performing rather than if the Jobs were executed correctly.
job_class: job.que_attrs[:job_class], | ||
priority: job.que_attrs[:priority], | ||
queue: job.que_attrs[:queue], | ||
latency: job.que_attrs[:latency], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be also useful to have a duration
published, so we know both how long the job was waiting in the queue, and how long it took to execute. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Execution duration can be calculated using start and end currently, we could make this more intuitive by doing that for consumers
Indeed it is part of what we're monitoring, the remaining ones are a little more tricky to figure out, as I mentioned previously we'd ideally like to get off our fork and as part of that we need to reach some sort of feature parity here. |
yield | ||
ensure | ||
::ActiveSupport::Notifications.publish( | ||
"que_job.worked", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"que_job.worked", | |
"que_job.job_worked", |
Nitpick: what about prefixing the event with job_
? This allows us to have different types of events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
que.job_worked
? =P
@@ -798,6 +798,32 @@ Que.job_middleware.push( | |||
) | |||
``` | |||
|
|||
#### Existing Middleware | |||
|
|||
Que ships with middleware to expose job metrics using ActiveSupport notifications to subscribe to it you can implelent the following |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*implement ;)
Que ships with middleware to expose job metrics using ActiveSupport notifications to subscribe to it you can implelent the following | ||
|
||
```ruby | ||
::ActiveSupport::Notifications.subscribe("que_job.worked") do |message, started, finished, labels| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked this up, and it seems that .subscribe
doesn't have monotonic timestamps; .monotonic_subscribe
does. Whichever you want to use here is fine, but let's document it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you are passing monotonic times from the middleware... I guess I don't understand how this works then 😅
called = true | ||
end | ||
|
||
Que::ActiveSupport::JobMiddleware.call(job) { job.que_error = "error" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was thinking you'd need to raise within the block here, but you're right - an exception from a job does not bubble up to middleware.
Testing job failure is good, but I was meaning that you test that it's in an ensure
- when it encounters an exception. If not caused by the job, perhaps by a subsequent middleware.
For realistic data, I think job.que_error
should also be set to an instance of an exception, not that it would have any effect here.
Although.. I've just looked up the term job_worked
in the code, and found that logging makes a distinction: job_worked
vs job_errored
. So maybe this middleware should do the same? Or find a more generic term - something like job_attempted
?
Closing in favour of #366 |
This adds ActiveSupport notifications as Job Middleware, there is also a change to expose the latency for the job to be picked up which is useful to work out how close to capacity the workers are.