Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WJ-1176] Job queue #1671

Merged
merged 33 commits into from
Oct 31, 2023
Merged

[WJ-1176] Job queue #1671

merged 33 commits into from
Oct 31, 2023

Conversation

emmiegit
Copy link
Member

@emmiegit emmiegit commented Oct 30, 2023

Ground work for a persistent job queue was set up in #1668.

In this PR, I use the added rsmq_async crate to persist jobs to the queue, and receive jobs from the queue in a job worker (configurable quantity), before deleting them post-success, to enable failure retries. The previous job queue implementation, an in-memory channel, was removed. This simplifies ServerState setup a bit.

Additionally, to implement recurring jobs such as the periodic pruning work, this system permits jobs to submit a "follow-up job", which is added to the queue as part of the job's work. Combined with a delay, this allows things such as "run this operation every six hours".

I've added configuration fields for all these various durations and values.

I also change the concept of "directly fetching pages" to not require a site ID, and have an option for more clear fetching of deleted pages or not. This also updates the OutdateService, which had a bug in it regarding use of an incorrect site_id on pages.

NOTE: There is a known issue where the value for "one day" (86400 seconds) gets multiplied by 1000 and overflows rsmq's limit. I will investigate further and file a bug in the upstream, but for now this PR is ready.

@emmiegit emmiegit self-assigned this Oct 30, 2023
@codecov
Copy link

codecov bot commented Oct 30, 2023

Codecov Report

Merging #1671 (d6a2f08) into develop (c487a33) will decrease coverage by 0.20%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1671      +/-   ##
===========================================
- Coverage    40.45%   40.25%   -0.20%     
===========================================
  Files          341      342       +1     
  Lines        10744    10798      +54     
===========================================
+ Hits          4346     4347       +1     
- Misses        6398     6451      +53     
Flag Coverage Δ *Carryforward flag
deepwell 2.10% <0.00%> (-0.01%) ⬇️
ftml 76.83% <ø> (ø) Carriedforward from 6d801ee

*This pull request uses carry forward flags. Click here to find out more.

Files Coverage Δ
deepwell/src/config/object.rs 0.00% <ø> (ø)
deepwell/src/services/context.rs 0.00% <ø> (ø)
deepwell/src/services/page/structs.rs 0.00% <ø> (ø)
deepwell/src/api.rs 0.00% <0.00%> (ø)
deepwell/src/services/error.rs 0.00% <0.00%> (ø)
deepwell/src/services/page_revision/service.rs 0.00% <0.00%> (ø)
deepwell/src/endpoints/page.rs 0.00% <0.00%> (ø)
deepwell/src/services/page/service.rs 0.00% <0.00%> (ø)
deepwell/src/services/file/service.rs 0.00% <0.00%> (ø)
deepwell/src/services/job/service.rs 0.00% <0.00%> (ø)
... and 4 more

... and 1 file with indirect coverage changes

Is necessary because a worker may reboot, or multiple workers may start
and then all but the first fail to create the queue.
Now using the actual queue!
In some circumstances we really do not have the site ID and should
not be requiring it to be filtered there (that's what the regular
get() methods are for), so instead we should pass in only the ID
to be fetched, with a check for deleted entities in case we only
want extant ones.
This has an option for getting deleted pages, and separates out the
notion of fetching live pages by ID for internal processes.
With the change to PageService::get_direct(), we do not need the site_id
to do outdating, and this isn't correct anyways since for some cases it
assumes any page connections are on the same site, which is wrong.
This way we don't get into loops where a job fails, but not before it
adds another job on the queue, leading to a build-up.
rsmq enforces a maximum which we're apparently surpassing, we should
catch this at Config parsing/creation time.
@emmiegit emmiegit marked this pull request as ready for review October 31, 2023 06:06
@emmiegit
Copy link
Member Author

thanks @Zokhoi @Yossipossi1

@emmiegit emmiegit merged commit e8b7547 into develop Oct 31, 2023
9 checks passed
@emmiegit emmiegit deleted the WJ-1176-job-queue branch October 31, 2023 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants