Implement Abandoned Job Detection and Recovery #30367

jgambarios · 2024-10-16T17:39:02Z

Parent Issue

Task

We need to enhance our job queue system to handle abandoned jobs. These are jobs that may have been interrupted due to server crashes, network failures, or other unexpected issues, leaving them in an inconsistent state.

Objective:

Implement mechanisms to detect abandoned jobs and provide recovery strategies to ensure system reliability and data consistency.

Proposed Strategies:

Job Heartbeats:
- Implement periodic heartbeat updates for running jobs
- Create a background process to identify jobs with stale heartbeats
Timeout Mechanisms:
- Add a max_execution_time field to job configurations
- Implement a background process to check for jobs exceeding their maximum execution time
Recovery Procedures:
- Develop a recovery process
- Identify jobs in inconsistent states and apply appropriate recovery actions

Additional Considerations:

Ensure that abandoned job recovery doesn't conflict with distributed locking mechanisms
Consider the impact on job queue performance and optimize where necessary
Evaluate and document any changes to the system's fault tolerance and high availability characteristics

Proposed Objective

Core Features

Proposed Priority

Priority 2 - Important

Acceptance Criteria

System can detect jobs that have been abandoned due to server crashes or other issues
Abandoned jobs are automatically handled according to configured recovery strategies
All new functionality is covered by appropriate tests
System performance is not significantly impacted by new abandoned job handling processes

The text was updated successfully, but these errors were encountered:

nollymar · 2024-11-13T14:37:24Z

A job will be considered abandoned after 30 min (configurable)

…dded `JobEvent` interface to various job event classes and implemented validation logic in `ImportContentletsProcessor`. Introduced the `AbandonedJobDetector` class for detecting and handling abandoned jobs.

…s in transaction handling on the JobQueueManagerAPIImpl

…ionality.

github-actions · 2024-11-19T20:49:17Z

PRs:

This change updates the detectAndMarkAbandoned method to return an Optional<Job> instead of null. This helps to avoid potential NullPointerExceptions and improves code readability. Corresponding updates were made to the affected classes and integration tests to handle the Optional return type appropriately.

…onfig A protected no-arg constructor was added to AbandonedJobDetectorConfig to comply with CDI requirements. This ensures the class can be properly proxied and managed by the CDI container.

jgambarios · 2024-12-04T03:11:04Z

Changes applied as part of the feedback:

Streamlined job state management by introducing more precise states such as FAILED_PERMANENTLY and ABANDONED_PERMANENTLY.
The JobQueueResource now allows to list: active, completed, successful, canceled, failed and abandoned jobs.
Improved SSE handling.

Removed obsolete job events, streamlined job state management by introducing more precise states such as `FAILED_PERMANENTLY` and `ABANDONED_PERMANENTLY`. Replaced job completion terminology and refined method signatures and naming conventions to reinforce consistency. Enhanced Server-Sent Events (SSE) monitoring with a dedicated utility class for improved performance and error handling.

fabrizzio-dotCMS · 2024-12-05T22:59:10Z

It looks great now!
The job monitor moves smoothly, with no connection leaks and no stuck jobs.

Some potential improvements

But people need to know what they're doing when configuring this functionality

JOB_ABANDONMENT_THRESHOLD_MINUTES should always be greater than
JOB_ABANDONMENT_DETECTION_INTERVAL_MINUTES, allowing the system to verify consistently
As an improvement, we could validate or apply defaults if a minimum window is violated.

When an abandoned job lacks a retry policy, it will get automatically marked as permanently abandoned, which makes complete sense. Perpahs this policy should be shown when we list the available queues.
Perhaps we could fusion these two annotations into a single one:

@Queue("importContentlets")
@NoRetryPolicy

And finally, we can not forget that we need to initiate the thread pool automatically. Currently, it gets started until the endpoint is consumed, otherwise, abandoned jobs will never be detected until someone kicks off the thread-pool. I know we're taking care of that in another ticket.

jgambarios added Team : Scout Triage Type : Task labels Oct 16, 2024

jgambarios added this to dotCMS - Product Planning Oct 16, 2024

github-project-automation bot moved this to New in dotCMS - Product Planning Oct 16, 2024

nollymar moved this from New to Next 1-3 Sprints in dotCMS - Product Planning Oct 22, 2024

nollymar removed the Triage label Oct 22, 2024

jgambarios self-assigned this Nov 11, 2024

jgambarios moved this from Next 1-3 Sprints to In Progress in dotCMS - Product Planning Nov 11, 2024

jgambarios added a commit that referenced this issue Nov 18, 2024

#30367 Improved events handling in RealTimeJobMonitor and improvement…

db8fcb6

…s in transaction handling on the JobQueueManagerAPIImpl

jgambarios added a commit that referenced this issue Nov 18, 2024

#30367 Improve error handling when creating a Job

f495846

jgambarios added a commit that referenced this issue Nov 19, 2024

#30367 Fixing IT

3491712

jgambarios added a commit that referenced this issue Nov 19, 2024

#30367 Integration test for testing the abandoned job detection funct…

43f8cd3

…ionality.

jgambarios moved this from In Progress to In Review in dotCMS - Product Planning Nov 19, 2024

jgambarios linked a pull request Nov 19, 2024 that will close this issue

Feat (Core): Implement abandoned job detection and recovery #30710

Merged

jgambarios added a commit that referenced this issue Nov 20, 2024

#30367 Fixing unit tests

730d5f1

jgambarios closed this as completed in #30710 Nov 21, 2024

github-project-automation bot moved this from In Review to Internal QA in dotCMS - Product Planning Nov 21, 2024

github-actions bot mentioned this issue Nov 21, 2024

Feat (Core): Implement abandoned job detection and recovery #30710

Merged

jgambarios reopened this Nov 21, 2024

github-project-automation bot moved this from Internal QA to Current Sprint Backlog in dotCMS - Product Planning Nov 21, 2024

jgambarios moved this from Current Sprint Backlog to Internal QA in dotCMS - Product Planning Nov 21, 2024

jgambarios added the QA : Needs Internal label Nov 21, 2024

jgambarios removed their assignment Nov 21, 2024

fabrizzio-dotCMS self-assigned this Nov 21, 2024

jgambarios added a commit that referenced this issue Dec 4, 2024

#30367 Simplified SSE handling

cff25f2

jgambarios added a commit that referenced this issue Dec 4, 2024

#30367 Improvements on the cancel process

fdb6134

jgambarios moved this from In Progress to In Review in dotCMS - Product Planning Dec 4, 2024

nollymar added NW Removed QA : Failed Internal and removed Needs Work labels Dec 4, 2024

jgambarios added a commit that referenced this issue Dec 4, 2024

#30367 Applying code review feedback.

81e1c04

jgambarios added a commit that referenced this issue Dec 4, 2024

#30367 Applying code review feedback.

432e64d

jgambarios added a commit that referenced this issue Dec 4, 2024

#30367 Applying code review feedback.

a185478

jgambarios added a commit that referenced this issue Dec 4, 2024

#30367 Applying code review feedback.

6aa22b3

fabrizzio-dotCMS closed this as completed in #30816 Dec 4, 2024

github-project-automation bot moved this from In Review to Internal QA in dotCMS - Product Planning Dec 4, 2024

github-actions bot mentioned this issue Dec 4, 2024

#30367 Refactor job system and enhance SSE monitoring. #30816

Merged

nollymar added QA : Needs Internal and removed QA : Failed Internal labels Dec 5, 2024

nollymar reopened this Dec 5, 2024

github-project-automation bot moved this from Internal QA to Current Sprint Backlog in dotCMS - Product Planning Dec 5, 2024

nollymar assigned fabrizzio-dotCMS and unassigned jgambarios Dec 5, 2024

nollymar moved this from Current Sprint Backlog to Internal QA in dotCMS - Product Planning Dec 5, 2024

fabrizzio-dotCMS added QA : Passed Internal and removed QA : Needs Internal labels Dec 5, 2024

fabrizzio-dotCMS removed their assignment Dec 5, 2024

nollymar closed this as completed Dec 6, 2024

nollymar moved this from Internal QA to Done in dotCMS - Product Planning Dec 6, 2024

nollymar added the Release : 24.12.10 label Dec 6, 2024

jgambarios mentioned this issue Dec 10, 2024

Issue: Job Monitoring Stops in All Tabs When One Tab is Closed #30665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Abandoned Job Detection and Recovery #30367

Implement Abandoned Job Detection and Recovery #30367

jgambarios commented Oct 16, 2024 •

edited

Loading

nollymar commented Nov 13, 2024

github-actions bot commented Nov 19, 2024 •

edited

Loading

jgambarios commented Dec 4, 2024

fabrizzio-dotCMS commented Dec 5, 2024 •

edited

Loading

Implement Abandoned Job Detection and Recovery #30367

Implement Abandoned Job Detection and Recovery #30367

Comments

jgambarios commented Oct 16, 2024 • edited Loading

Parent Issue

Task

Objective:

Additional Considerations:

Proposed Objective

Proposed Priority

Acceptance Criteria

nollymar commented Nov 13, 2024

github-actions bot commented Nov 19, 2024 • edited Loading

jgambarios commented Dec 4, 2024

fabrizzio-dotCMS commented Dec 5, 2024 • edited Loading

jgambarios commented Oct 16, 2024 •

edited

Loading

github-actions bot commented Nov 19, 2024 •

edited

Loading

fabrizzio-dotCMS commented Dec 5, 2024 •

edited

Loading