Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent scheduling when tasks run within the poll interval of their original time #190093

Merged

Conversation

mikecote
Copy link
Contributor

@mikecote mikecote commented Aug 7, 2024

Resolves #189114

In this PR, I'm changing the logic to calculate the task's next run at. Whenever the gap between the task's runAt and when it was picked up is less than the poll interval, we'll use the runAt to schedule the next. This way we don't continuously add time to the task's next run (ex: running every 1m turns into every 1m 3s).

I've had to modify a few tests to have a more increased interval because this made tasks run more frequently (on time), which introduced flakiness.

To verify

  1. Create an alerting rule that runs every 10s
  2. Apply the following diff to your code
diff --git a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
index 55d5f85e5d3..4342dcdd845 100644
--- a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
+++ b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
@@ -31,5 +31,7 @@ export function getNextRunAt(
     Date.now()
   );

+  console.log(`*** Next run at: ${new Date(nextCalculatedRunAt).toISOString()}, interval=${newSchedule?.interval ?? schedule.interval}, originalRunAt=${originalRunAt.toISOString()}, startedAt=${startedAt.toISOString()}`);
+
   return new Date(nextCalculatedRunAt);
 }
  1. Observe the logs, the gap between runAt and startedAt should be less than the poll interval, so the next run at is based on runAt instead of startedAt.
  2. Stop Kibana for 15 seconds then start it again
  3. Observe the first logs when the rule runs again and notice now that the gap between runAt and startedAt is larger than the poll interval, the next run at is based on startedAt instead of runAt to spread the tasks out evenly.

@mikecote mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Aug 7, 2024
@mikecote mikecote closed this Aug 14, 2024
@mikecote mikecote reopened this Aug 29, 2024
@mikecote
Copy link
Contributor Author

mikecote commented Sep 6, 2024

/ci

@mikecote mikecote changed the title Consistent scheduling when tasks run within 10s of their original time Consistent scheduling when tasks run within the poll interval of their original time Sep 6, 2024
@mikecote
Copy link
Contributor Author

/ci

@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@mikecote
Copy link
Contributor Author

/ci

…thub.com:mikecote/kibana into task-manager/consistent-scheduling-gap-threshold
@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@mikecote
Copy link
Contributor Author

/ci

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🟠 Some tests failed. - kibana-flaky-test-suite-runner#6917

[❌] x-pack/test/alerting_api_integration/spaces_only/tests/alerting/group3/config.ts: 0/100 tests passed.

see run history

…thub.com:mikecote/kibana into task-manager/consistent-scheduling-gap-threshold
@mikecote
Copy link
Contributor Author

/ci

@mikecote
Copy link
Contributor Author

/ci

@mikecote mikecote requested a review from a team as a code owner September 12, 2024 18:19
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote mikecote added v9.0.0 backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Sep 13, 2024
Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left some nits

}

const taskSchedule = newSchedule?.interval ?? schedule?.interval;
const taskDelay = Date.now() - originalRunAt.getTime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this use startedAt instead of Date.now()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, otherwise it would only work if the task finished running within the poll interval 🙈 fixed in 3d5eb39.

return newRunAt;
}

const taskSchedule = newSchedule?.interval ?? schedule?.interval;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. since this is also calculated in the task runner, we could set const scheduleToUse = newSchedule ?? schedule in the task runner and pass it into the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ymao1 Something like this may work, lmk what you think! a396408

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -162,6 +165,7 @@ export class TaskManagerRunner implements TaskRunner {
private eventLoopDelayConfig: EventLoopDelayConfig;
private readonly taskValidator: TaskValidator;
private readonly claimStrategy: string;
private currentPollInterval?: number;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we set a default in case something goes wrong with the observable?

Copy link
Contributor Author

@mikecote mikecote Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I defaulted it to the config.poll_interval in this commit: d30bbfc

@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@mikecote
Copy link
Contributor Author

/ci

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@mikecote mikecote merged commit 1f673dc into elastic:main Sep 16, 2024
40 checks passed
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Sep 16, 2024
…r original time (elastic#190093)

Resolves elastic#189114

In this PR, I'm changing the logic to calculate the task's next run at.
Whenever the gap between the task's runAt and when it was picked up is
less than the poll interval, we'll use the `runAt` to schedule the next.
This way we don't continuously add time to the task's next run (ex:
running every 1m turns into every 1m 3s).

I've had to modify a few tests to have a more increased interval because
this made tasks run more frequently (on time), which introduced
flakiness.

## To verify
1. Create an alerting rule that runs every 10s
2. Apply the following diff to your code
```
diff --git a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
index 55d5f85e5d3..4342dcdd845 100644
--- a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
+++ b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
@@ -31,5 +31,7 @@ export function getNextRunAt(
     Date.now()
   );

+  console.log(`*** Next run at: ${new Date(nextCalculatedRunAt).toISOString()}, interval=${newSchedule?.interval ?? schedule.interval}, originalRunAt=${originalRunAt.toISOString()}, startedAt=${startedAt.toISOString()}`);
+
   return new Date(nextCalculatedRunAt);
 }
```
3. Observe the logs, the gap between runAt and startedAt should be less
than the poll interval, so the next run at is based on `runAt` instead
of `startedAt`.
4. Stop Kibana for 15 seconds then start it again
5. Observe the first logs when the rule runs again and notice now that
the gap between runAt and startedAt is larger than the poll interval,
the next run at is based on `startedAt` instead of `runAt` to spread the
tasks out evenly.

---------

Co-authored-by: Elastic Machine <[email protected]>
(cherry picked from commit 1f673dc)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Sep 16, 2024
…f their original time (#190093) (#193022)

# Backport

This will backport the following commits from `main` to `8.x`:
- [Consistent scheduling when tasks run within the poll interval of
their original time
(#190093)](#190093)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Mike
Côté","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-09-16T14:10:36Z","message":"Consistent
scheduling when tasks run within the poll interval of their original
time (#190093)\n\nResolves
https://github.com/elastic/kibana/issues/189114\r\n\r\nIn this PR, I'm
changing the logic to calculate the task's next run at.\r\nWhenever the
gap between the task's runAt and when it was picked up is\r\nless than
the poll interval, we'll use the `runAt` to schedule the next.\r\nThis
way we don't continuously add time to the task's next run
(ex:\r\nrunning every 1m turns into every 1m 3s).\r\n\r\nI've had to
modify a few tests to have a more increased interval because\r\nthis
made tasks run more frequently (on time), which
introduced\r\nflakiness.\r\n\r\n## To verify\r\n1. Create an alerting
rule that runs every 10s\r\n2. Apply the following diff to your
code\r\n```\r\ndiff --git
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\nindex
55d5f85e5d3..4342dcdd845 100644\r\n---
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n+++
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n@@ -31,5
+31,7 @@ export function getNextRunAt(\r\n Date.now()\r\n );\r\n\r\n+
console.log(`*** Next run at: ${new
Date(nextCalculatedRunAt).toISOString()},
interval=${newSchedule?.interval ?? schedule.interval},
originalRunAt=${originalRunAt.toISOString()},
startedAt=${startedAt.toISOString()}`);\r\n+\r\n return new
Date(nextCalculatedRunAt);\r\n }\r\n```\r\n3. Observe the logs, the gap
between runAt and startedAt should be less\r\nthan the poll interval, so
the next run at is based on `runAt` instead\r\nof `startedAt`.\r\n4.
Stop Kibana for 15 seconds then start it again\r\n5. Observe the first
logs when the rule runs again and notice now that\r\nthe gap between
runAt and startedAt is larger than the poll interval,\r\nthe next run at
is based on `startedAt` instead of `runAt` to spread the\r\ntasks out
evenly.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"1f673dc9f12e90a6aa41a903fee8b0adafcdcaf9","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Feature:Task
Manager","Team:ResponseOps","v9.0.0","backport:prev-minor","v8.16.0"],"title":"Consistent
scheduling when tasks run within the poll interval of their original
time","number":190093,"url":"https://github.com/elastic/kibana/pull/190093","mergeCommit":{"message":"Consistent
scheduling when tasks run within the poll interval of their original
time (#190093)\n\nResolves
https://github.com/elastic/kibana/issues/189114\r\n\r\nIn this PR, I'm
changing the logic to calculate the task's next run at.\r\nWhenever the
gap between the task's runAt and when it was picked up is\r\nless than
the poll interval, we'll use the `runAt` to schedule the next.\r\nThis
way we don't continuously add time to the task's next run
(ex:\r\nrunning every 1m turns into every 1m 3s).\r\n\r\nI've had to
modify a few tests to have a more increased interval because\r\nthis
made tasks run more frequently (on time), which
introduced\r\nflakiness.\r\n\r\n## To verify\r\n1. Create an alerting
rule that runs every 10s\r\n2. Apply the following diff to your
code\r\n```\r\ndiff --git
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\nindex
55d5f85e5d3..4342dcdd845 100644\r\n---
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n+++
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n@@ -31,5
+31,7 @@ export function getNextRunAt(\r\n Date.now()\r\n );\r\n\r\n+
console.log(`*** Next run at: ${new
Date(nextCalculatedRunAt).toISOString()},
interval=${newSchedule?.interval ?? schedule.interval},
originalRunAt=${originalRunAt.toISOString()},
startedAt=${startedAt.toISOString()}`);\r\n+\r\n return new
Date(nextCalculatedRunAt);\r\n }\r\n```\r\n3. Observe the logs, the gap
between runAt and startedAt should be less\r\nthan the poll interval, so
the next run at is based on `runAt` instead\r\nof `startedAt`.\r\n4.
Stop Kibana for 15 seconds then start it again\r\n5. Observe the first
logs when the rule runs again and notice now that\r\nthe gap between
runAt and startedAt is larger than the poll interval,\r\nthe next run at
is based on `startedAt` instead of `runAt` to spread the\r\ntasks out
evenly.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"1f673dc9f12e90a6aa41a903fee8b0adafcdcaf9"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/190093","number":190093,"mergeCommit":{"message":"Consistent
scheduling when tasks run within the poll interval of their original
time (#190093)\n\nResolves
https://github.com/elastic/kibana/issues/189114\r\n\r\nIn this PR, I'm
changing the logic to calculate the task's next run at.\r\nWhenever the
gap between the task's runAt and when it was picked up is\r\nless than
the poll interval, we'll use the `runAt` to schedule the next.\r\nThis
way we don't continuously add time to the task's next run
(ex:\r\nrunning every 1m turns into every 1m 3s).\r\n\r\nI've had to
modify a few tests to have a more increased interval because\r\nthis
made tasks run more frequently (on time), which
introduced\r\nflakiness.\r\n\r\n## To verify\r\n1. Create an alerting
rule that runs every 10s\r\n2. Apply the following diff to your
code\r\n```\r\ndiff --git
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\nindex
55d5f85e5d3..4342dcdd845 100644\r\n---
a/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n+++
b/x-pack/plugins/task_manager/server/lib/get_next_run_at.ts\r\n@@ -31,5
+31,7 @@ export function getNextRunAt(\r\n Date.now()\r\n );\r\n\r\n+
console.log(`*** Next run at: ${new
Date(nextCalculatedRunAt).toISOString()},
interval=${newSchedule?.interval ?? schedule.interval},
originalRunAt=${originalRunAt.toISOString()},
startedAt=${startedAt.toISOString()}`);\r\n+\r\n return new
Date(nextCalculatedRunAt);\r\n }\r\n```\r\n3. Observe the logs, the gap
between runAt and startedAt should be less\r\nthan the poll interval, so
the next run at is based on `runAt` instead\r\nof `startedAt`.\r\n4.
Stop Kibana for 15 seconds then start it again\r\n5. Observe the first
logs when the rule runs again and notice now that\r\nthe gap between
runAt and startedAt is larger than the poll interval,\r\nthe next run at
is based on `startedAt` instead of `runAt` to spread the\r\ntasks out
evenly.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"1f673dc9f12e90a6aa41a903fee8b0adafcdcaf9"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Mike Côté <[email protected]>
mikecote added a commit to mikecote/kibana that referenced this pull request Sep 20, 2024
mikecote added a commit to mikecote/kibana that referenced this pull request Sep 20, 2024
… poll interval of their original time (elastic#190093) (elastic#193022)""

This reverts commit 6296a6e.
mikecote added a commit that referenced this pull request Sep 20, 2024
In this PR, I'm fixing a memory leak that was introduced in
#190093 where every task runner
class object wouldn't free up in memory because it subscribed to the
`pollIntervalConfiguration$` observable. To fix this, I moved the
observable up a class into `TaskPollingLifecycle` which only gets
created once on plugin start and then pass down the pollInterval value
via a function call the task runner class can call.
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Sep 20, 2024
In this PR, I'm fixing a memory leak that was introduced in
elastic#190093 where every task runner
class object wouldn't free up in memory because it subscribed to the
`pollIntervalConfiguration$` observable. To fix this, I moved the
observable up a class into `TaskPollingLifecycle` which only gets
created once on plugin start and then pass down the pollInterval value
via a function call the task runner class can call.

(cherry picked from commit cf6e8b5)
kibanamachine added a commit that referenced this pull request Sep 20, 2024
# Backport

This will backport the following commits from `main` to `8.x`:
- [Fix memory leak in task manager task runner
(#193612)](#193612)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Mike
Côté","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-09-20T17:52:26Z","message":"Fix
memory leak in task manager task runner (#193612)\n\nIn this PR, I'm
fixing a memory leak that was introduced
in\r\nhttps://github.com//pull/190093 where every task
runner\r\nclass object wouldn't free up in memory because it subscribed
to the\r\n`pollIntervalConfiguration# Backport

This will backport the following commits from `main` to `8.x`:
- [Fix memory leak in task manager task runner
(#193612)](#193612)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT observable. To fix this, I moved the\r\nobservable up a
class into `TaskPollingLifecycle` which only gets\r\ncreated once on
plugin start and then pass down the pollInterval value\r\nvia a function
call the task runner class can
call.","sha":"cf6e8b5ba971fffe2a57e1a7c573e60cc2fbe280","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Feature:Task
Manager","Team:ResponseOps","v9.0.0","backport:prev-minor","v8.16.0"],"title":"Fix
memory leak in task manager task
runner","number":193612,"url":"https://github.com/elastic/kibana/pull/193612","mergeCommit":{"message":"Fix
memory leak in task manager task runner (#193612)\n\nIn this PR, I'm
fixing a memory leak that was introduced
in\r\nhttps://github.com//pull/190093 where every task
runner\r\nclass object wouldn't free up in memory because it subscribed
to the\r\n`pollIntervalConfiguration# Backport

This will backport the following commits from `main` to `8.x`:
- [Fix memory leak in task manager task runner
(#193612)](#193612)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT observable. To fix this, I moved the\r\nobservable up a
class into `TaskPollingLifecycle` which only gets\r\ncreated once on
plugin start and then pass down the pollInterval value\r\nvia a function
call the task runner class can
call.","sha":"cf6e8b5ba971fffe2a57e1a7c573e60cc2fbe280"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/193612","number":193612,"mergeCommit":{"message":"Fix
memory leak in task manager task runner (#193612)\n\nIn this PR, I'm
fixing a memory leak that was introduced
in\r\nhttps://github.com//pull/190093 where every task
runner\r\nclass object wouldn't free up in memory because it subscribed
to the\r\n`pollIntervalConfiguration# Backport

This will backport the following commits from `main` to `8.x`:
- [Fix memory leak in task manager task runner
(#193612)](#193612)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT observable. To fix this, I moved the\r\nobservable up a
class into `TaskPollingLifecycle` which only gets\r\ncreated once on
plugin start and then pass down the pollInterval value\r\nvia a function
call the task runner class can
call.","sha":"cf6e8b5ba971fffe2a57e1a7c573e60cc2fbe280"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Mike Côté <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task Manager] Improve scheduling of recurring tasks so they re-run more on-time
5 participants