Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose job latency metric via ActiveSupport Notifications job middleware #362

Closed
wants to merge 3 commits into from

Conversation

stephenbinns
Copy link

@stephenbinns stephenbinns commented May 17, 2022

This adds ActiveSupport notifications as Job Middleware, there is also a change to expose the latency for the job to be picked up which is useful to work out how close to capacity the workers are.

@hlascelles
Copy link
Contributor

Good stuff... Related: #325

NB, using now() - (j).run_at as latency/delay can be misleading. If a job errors and retries with a backoff then the run_at changes. It would be best to add an "initial_run_at" column to the main jobs table, but that would be a better change....

Copy link
Member

@ZimbiX ZimbiX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I haven't used ActiveSupport notifications before, but knowing the job latency does sound like a useful proxy to understand the level of worker contention.

Would job latency be covered by your Prometheus metrics implementation? (Related issue: #267)

I see this is still in draft. I like where this is headed. What else would be left?

Thanks for the contribution! =)

spec/spec_helper.rb Outdated Show resolved Hide resolved
lib/que/active_support/job_middleware.rb Outdated Show resolved Hide resolved

module Que
module ActiveSupport
module JobMiddleware
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some usage documentation would be in order, I reckon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added to readme do you think its better here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nah, I just wanted to start a thread so it'd be able to be marked resolved 😆

Cheers for adding that; looks good

@@ -63,6 +63,7 @@ class Poller
SELECT
(j).*,
l.locked,
extract(epoch from (now() - (j).run_at)) as latency,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB, using now() - (j).run_at as latency/delay can be misleading. If a job errors and retries with a backoff then the run_at changes. It would be best to add an "initial_run_at" column to the main jobs table, but that would be a better change....

@hlascelles This is a good point. I think the intention here is to permit ascertaining worker contention - by allowing you to analyse at the delay between when a job is desired to be run at (or re-run at), and when it's actually picked up to be run. In terms of that desire, I think this metric is more useful. But I'm sure there's also merit in #325.

lib/que/active_support/job_middleware.spec.rb Outdated Show resolved Hide resolved

started = Process.clock_gettime(Process::CLOCK_MONOTONIC)
yield
ensure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a test for the exception case. But is publishing a que_job.worked notification even upon an exception really what you want? I'd have thought that'd mean the job is complete 🤔 There may be an example of this elsewhere..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good will add one shortly, I think so in terms of the metric were more concerned with how Que is performing rather than if the Jobs were executed correctly.

@ZimbiX ZimbiX changed the title Job metrics Expose job latency metric via ActiveSupport Notifications job middleware May 17, 2022
job_class: job.que_attrs[:job_class],
priority: job.que_attrs[:priority],
queue: job.que_attrs[:queue],
latency: job.que_attrs[:latency],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be also useful to have a duration published, so we know both how long the job was waiting in the queue, and how long it took to execute. WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Execution duration can be calculated using start and end currently, we could make this more intuitive by doing that for consumers

@stephenbinns
Copy link
Author

Interesting. I haven't used ActiveSupport notifications before, but knowing the job latency does sound like a useful proxy to understand the level of worker contention.

Would job latency be covered by your Prometheus metrics implementation? (Related issue: #267)

I see this is still in draft. I like where this is headed. What else would be left?

Thanks for the contribution! =)

Indeed it is part of what we're monitoring, the remaining ones are a little more tricky to figure out, as I mentioned previously we'd ideally like to get off our fork and as part of that we need to reach some sort of feature parity here.

yield
ensure
::ActiveSupport::Notifications.publish(
"que_job.worked",
Copy link

@ameykusurkar ameykusurkar May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"que_job.worked",
"que_job.job_worked",

Nitpick: what about prefixing the event with job_? This allows us to have different types of events

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

que.job_worked? =P

@@ -798,6 +798,32 @@ Que.job_middleware.push(
)
```

#### Existing Middleware

Que ships with middleware to expose job metrics using ActiveSupport notifications to subscribe to it you can implelent the following
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*implement ;)

Que ships with middleware to expose job metrics using ActiveSupport notifications to subscribe to it you can implelent the following

```ruby
::ActiveSupport::Notifications.subscribe("que_job.worked") do |message, started, finished, labels|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked this up, and it seems that .subscribe doesn't have monotonic timestamps; .monotonic_subscribe does. Whichever you want to use here is fine, but let's document it correctly.

Copy link
Member

@ZimbiX ZimbiX May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you are passing monotonic times from the middleware... I guess I don't understand how this works then 😅

called = true
end

Que::ActiveSupport::JobMiddleware.call(job) { job.que_error = "error" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was thinking you'd need to raise within the block here, but you're right - an exception from a job does not bubble up to middleware.

Testing job failure is good, but I was meaning that you test that it's in an ensure - when it encounters an exception. If not caused by the job, perhaps by a subsequent middleware.

For realistic data, I think job.que_error should also be set to an instance of an exception, not that it would have any effect here.

Although.. I've just looked up the term job_worked in the code, and found that logging makes a distinction: job_worked vs job_errored. So maybe this middleware should do the same? Or find a more generic term - something like job_attempted?

@ZimbiX
Copy link
Member

ZimbiX commented Aug 26, 2022

Closing in favour of #366

@ZimbiX ZimbiX closed this Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants