Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Alerting] Add serial/synchronized mode to backfills #181072

Open
ymao1 opened this issue Apr 17, 2024 · 1 comment
Open

[Response Ops][Alerting] Add serial/synchronized mode to backfills #181072

ymao1 opened this issue Apr 17, 2024 · 1 comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented Apr 17, 2024

The current implementation of backfill will run backfill jobs as quickly as possible, depending on worker availability so by bulk scheduling a batch of rules, they will all backfill completely independently of each other. We would like to add a serial or synchronized mode where scheduling a batch of rules will introduce a light dependency so that rules run roughly in order of schedule.

Current implementation

Schedule backfill for two rules with 12h schedule intervals

[
  {
    "rule_id": "1",
    "start": "2023-03-13T00:00:00.000Z",
    "end": "2023-03-15T00:00:00.000Z"
  },
  {
    "rule_id": "2",
    "start": "2023-03-13T00:00:00.000Z",
    "end": "2023-03-15T00:00:00.000Z"
  }
]

creates 2 backfill jobs. For the purposes of this example, we're using the same backfill time range so both backfill jobs will have the same schedule

[
  {
    "run_at": "2023-03-13T12:00:00.000Z",
    "status": "pending",
    "interval": "12h",
  },
  {
    "run_at": "2023-03-14T00:00:00.000Z",
    "status": "pending",
    "interval": "12h",
  },
  {
    "run_at": "2023-03-14T12:00:00.000Z",
    "status": "pending",
    "interval": "12h",
  },
  {
    "run_at": "2023-03-15T00:00:00.000Z",
    "status": "pending",
    "interval": "12h",
  }
]

Task manager claims these tasks based on availability and processes the next pending schedule entry. Once complete, the task is put back in the queue if there are any pending schedule entries remaining. The backfill jobs run completely independently.

Proposed enhancement

  • Add the ability to specify a synchronized mode when scheduling. When this mode is used, a groupID (uuid) is generated that links all the backfill requests in a single bulk request together.
  • In this mode, before calling the rule type executor, the ad hoc task runner queries for other backfill jobs with the same groupID where the next scheduled run_at is <= the current task's run_at. If such a job exists, the task runner reschedules the current task for some delayed time in the future.

The outcome of this is that task manager will claim the tasks based on availability but jobs in the same group run in a specific order. Using the above schedule as an example, we would assign groupId: "ABC" to the backfills.

  • task runner picks up backfill for rule "1", "run_at": "2023-03-13T12:00:00.000Z", completes the run, puts the task back in the queue
  • task runner picks up backfill for rule "2", "run_at": "2023-03-14T00:00:00.000Z" before backfill for rule "2" runs its first schedule entry. because there's a backfill with the same group where the next run_at is < "2023-03-14T00:00:00.000Z", puts the task back in the queue with a runAt of now + 1m.
  • task runner picks up and runs rule "2", "run_at": "2023-03-13T12:00:00.000Z", completes the run, puts the task back in the queue

1 minute later

  • task runner picks up backfill for rule "2", "run_at": "2023-03-14T00:00:00.000Z", completes the run, puts the task back in the queue.
@ymao1 ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Apr 17, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

2 participants