Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Abandoned Job Detection and Recovery #30367

Closed
jgambarios opened this issue Oct 16, 2024 · 5 comments · Fixed by #30710 or #30816
Closed

Implement Abandoned Job Detection and Recovery #30367

jgambarios opened this issue Oct 16, 2024 · 5 comments · Fixed by #30710 or #30816

Comments

@jgambarios
Copy link
Contributor

jgambarios commented Oct 16, 2024

Parent Issue

#29474

Task

We need to enhance our job queue system to handle abandoned jobs. These are jobs that may have been interrupted due to server crashes, network failures, or other unexpected issues, leaving them in an inconsistent state.

Objective:

Implement mechanisms to detect abandoned jobs and provide recovery strategies to ensure system reliability and data consistency.

Proposed Strategies:

  1. Job Heartbeats:

    • Implement periodic heartbeat updates for running jobs
    • Create a background process to identify jobs with stale heartbeats
  2. Timeout Mechanisms:

    • Add a max_execution_time field to job configurations
    • Implement a background process to check for jobs exceeding their maximum execution time
  3. Recovery Procedures:

    • Develop a recovery process
    • Identify jobs in inconsistent states and apply appropriate recovery actions

Additional Considerations:

  • Ensure that abandoned job recovery doesn't conflict with distributed locking mechanisms
  • Consider the impact on job queue performance and optimize where necessary
  • Evaluate and document any changes to the system's fault tolerance and high availability characteristics

Proposed Objective

Core Features

Proposed Priority

Priority 2 - Important

Acceptance Criteria

  1. System can detect jobs that have been abandoned due to server crashes or other issues
  2. Abandoned jobs are automatically handled according to configured recovery strategies
  3. All new functionality is covered by appropriate tests
  4. System performance is not significantly impacted by new abandoned job handling processes
@nollymar
Copy link
Contributor

A job will be considered abandoned after 30 min (configurable)

jgambarios added a commit that referenced this issue Nov 16, 2024
…dded `JobEvent` interface to various job event classes and implemented validation logic in `ImportContentletsProcessor`. Introduced the `AbandonedJobDetector` class for detecting and handling abandoned jobs.
jgambarios added a commit that referenced this issue Nov 18, 2024
…s in transaction handling on the JobQueueManagerAPIImpl
jgambarios added a commit that referenced this issue Nov 19, 2024
@jgambarios jgambarios moved this from In Progress to In Review in dotCMS - Product Planning Nov 19, 2024
jgambarios added a commit that referenced this issue Nov 20, 2024
This change updates the detectAndMarkAbandoned method to return an Optional<Job> instead of null. This helps to avoid potential NullPointerExceptions and improves code readability. Corresponding updates were made to the affected classes and integration tests to handle the Optional return type appropriately.
jgambarios added a commit that referenced this issue Nov 20, 2024
…onfig

A protected no-arg constructor was added to AbandonedJobDetectorConfig to comply with CDI requirements. This ensures the class can be properly proxied and managed by the CDI container.
jgambarios added a commit that referenced this issue Nov 20, 2024
@github-project-automation github-project-automation bot moved this from In Review to Internal QA in dotCMS - Product Planning Nov 21, 2024
@jgambarios jgambarios reopened this Nov 21, 2024
@github-project-automation github-project-automation bot moved this from Internal QA to Current Sprint Backlog in dotCMS - Product Planning Nov 21, 2024
@jgambarios jgambarios moved this from Current Sprint Backlog to Internal QA in dotCMS - Product Planning Nov 21, 2024
@jgambarios jgambarios removed their assignment Nov 21, 2024
@fabrizzio-dotCMS fabrizzio-dotCMS self-assigned this Nov 21, 2024
jgambarios added a commit that referenced this issue Dec 4, 2024
@jgambarios
Copy link
Contributor Author

Changes applied as part of the feedback:

  1. Streamlined job state management by introducing more precise states such as FAILED_PERMANENTLY and ABANDONED_PERMANENTLY.
  2. The JobQueueResource now allows to list: active, completed, successful, canceled, failed and abandoned jobs.
  3. Improved SSE handling.

@jgambarios jgambarios moved this from In Progress to In Review in dotCMS - Product Planning Dec 4, 2024
jgambarios added a commit that referenced this issue Dec 4, 2024
jgambarios added a commit that referenced this issue Dec 4, 2024
jgambarios added a commit that referenced this issue Dec 4, 2024
jgambarios added a commit that referenced this issue Dec 4, 2024
github-merge-queue bot pushed a commit that referenced this issue Dec 4, 2024
Removed obsolete job events, streamlined job state management by
introducing more precise states such as `FAILED_PERMANENTLY` and
`ABANDONED_PERMANENTLY`. Replaced job completion terminology and refined
method signatures and naming conventions to reinforce consistency.
Enhanced Server-Sent Events (SSE) monitoring with a dedicated utility
class for improved performance and error handling.
@github-project-automation github-project-automation bot moved this from In Review to Internal QA in dotCMS - Product Planning Dec 4, 2024
@nollymar nollymar reopened this Dec 5, 2024
@github-project-automation github-project-automation bot moved this from Internal QA to Current Sprint Backlog in dotCMS - Product Planning Dec 5, 2024
@nollymar nollymar moved this from Current Sprint Backlog to Internal QA in dotCMS - Product Planning Dec 5, 2024
@fabrizzio-dotCMS fabrizzio-dotCMS removed their assignment Dec 5, 2024
@fabrizzio-dotCMS
Copy link
Contributor

fabrizzio-dotCMS commented Dec 5, 2024

It looks great now!
The job monitor moves smoothly, with no connection leaks and no stuck jobs.

Some potential improvements

  • But people need to know what they're doing when configuring this functionality

JOB_ABANDONMENT_THRESHOLD_MINUTES should always be greater than
JOB_ABANDONMENT_DETECTION_INTERVAL_MINUTES, allowing the system to verify consistently
As an improvement, we could validate or apply defaults if a minimum window is violated.

  • When an abandoned job lacks a retry policy, it will get automatically marked as permanently abandoned, which makes complete sense. Perpahs this policy should be shown when we list the available queues.
  • Perhaps we could fusion these two annotations into a single one:
@Queue("importContentlets")
@NoRetryPolicy
  • And finally, we can not forget that we need to initiate the thread pool automatically. Currently, it gets started until the endpoint is consumed, otherwise, abandoned jobs will never be detected until someone kicks off the thread-pool. I know we're taking care of that in another ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment