[Response Ops][Alerting] Add serial/synchronized
mode to backfills
#181072
Labels
Feature:Alerting
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
The current implementation of backfill will run backfill jobs as quickly as possible, depending on worker availability so by bulk scheduling a batch of rules, they will all backfill completely independently of each other. We would like to add a
serial
orsynchronized
mode where scheduling a batch of rules will introduce a light dependency so that rules run roughly in order of schedule.Current implementation
Schedule backfill for two rules with
12h
schedule intervalscreates 2 backfill jobs. For the purposes of this example, we're using the same backfill time range so both backfill jobs will have the same schedule
Task manager claims these tasks based on availability and processes the next
pending
schedule entry. Once complete, the task is put back in the queue if there are any pending schedule entries remaining. The backfill jobs run completely independently.Proposed enhancement
synchronized
mode when scheduling. When this mode is used, agroupID
(uuid) is generated that links all the backfill requests in a single bulk request together.groupID
where the next scheduledrun_at
is <= the current task'srun_at
. If such a job exists, the task runner reschedules the current task for some delayed time in the future.The outcome of this is that task manager will claim the tasks based on availability but jobs in the same group run in a specific order. Using the above schedule as an example, we would assign
groupId: "ABC"
to the backfills."run_at": "2023-03-13T12:00:00.000Z"
, completes the run, puts the task back in the queue"run_at": "2023-03-14T00:00:00.000Z"
before backfill for rule "2" runs its first schedule entry. because there's a backfill with the same group where the nextrun_at
is <"2023-03-14T00:00:00.000Z"
, puts the task back in the queue with arunAt
ofnow + 1m
."run_at": "2023-03-13T12:00:00.000Z"
, completes the run, puts the task back in the queue1 minute later
"run_at": "2023-03-14T00:00:00.000Z"
, completes the run, puts the task back in the queue.The text was updated successfully, but these errors were encountered: