Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Broken jobs were found in the job queue" error spam #299

Open
sminnee opened this issue Jun 2, 2020 · 7 comments
Open

"Broken jobs were found in the job queue" error spam #299

sminnee opened this issue Jun 2, 2020 · 7 comments

Comments

@sminnee
Copy link

sminnee commented Jun 2, 2020

I have queuedjobs set up on a site with raygun error logging.

If a job breaks (which reports an error via raygun) then roughly once an hour I will get a subsequent message "Broken jobs were found in the job queue".

Because this leads to raygun notification, this gets quite spammy, especially on a weekend. Since the site in question recreates jobs periodically anyway, and the broken job is benign, this is doubly so.

A few thoughts about how to address this; one or more of these might be useful.

  • Add a config option to decide on whether "Broken jobs were found in the job queue" errors should be thrown
  • Add a facility where broken jobs can be automatically retried
  • Lower the frequency of such alerts – a daily alert to go and clean up jobs might be more usefrul.

It would be interesting to hear whether other deployments of queuedjobs have this issue.

If it turns out these facilities already exist then I would suggest that we address this ticket by updating docs, as I couldn't see mention of this in the docs.

@micschk
Copy link
Contributor

micschk commented Jun 2, 2020

These 'broken jobs' messages have once used up around a 1000 euros in SMS-budget overnight on a critical system which I had temporarily set up an SMS error handler for... :-)

I think currently every cron-run checks & outputs these alerts so if you're running one or even multiple threads each minute this can result in a lot of alerts.

Instead of outputting these alerts periodically or with a lower (configurable) frequency, wouldn't it make sense to just output an alert only once (per broken job)?

@sminnee
Copy link
Author

sminnee commented Jun 3, 2020

Generally speaking a job will have broken because of an error, and that error will have been passed to whatever system you have in place for error handling. So I don't think "notify once" is needed; if you disabled it entirely you would end up with the functionality you seek.

@micschk
Copy link
Contributor

micschk commented Jun 3, 2020

Which would ideally be the case indeed. But often job failure may caused by running out of memory or otherwise getting stuck on something and being restarted/stopped at some point by the runner, then error handling tends to not (always) get executed. I think that's the reason for the job-health checking being in place(?).

So for me it is important to get notified of 'failed' jobs (via e-mail/sms), just not every minute.
Also we don't set up Raygun/Sentry on every system so relying on a third party for notifications would be less desirable.

@michalkleiner
Copy link
Collaborator

An example for us is checking for potential composer package updates within CWP, where it's a part of the default recipe. The task there in some circumstances fails on insufficient memory, possibly due to a bug in the checker, who knows. Unscheduling/deleting the job is not a solution as it always gets recreated by dev/build.

@chillu
Copy link
Contributor

chillu commented Jun 9, 2020

Duplicate of #24?

@sminnee
Copy link
Author

sminnee commented Jun 9, 2020

Closely related but I believe “broken jobs” and “stalled jobs” have different messages

@mfendeksilverstripe
Copy link
Contributor

mfendeksilverstripe commented Jun 14, 2020

My general feedback (based on multiple projects):

  • email notifications are not that useful (for both stalled and broken jobs)
  • instead we rely on Raygun reporting
  • checking queue health is really useful as it applies automatic resume attempts for stalled jobs
  • to further reduce the number of broken jobs we have to deal with, we use automatic retry system for broken jobs, I added this system to the feature review PR
  • this is very useful for jobs that may break but the error can be safely ignored (jobs that trigger third party requests (request failure), embargo publish of multiple localisation of the same page (DB deadlock))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants