[ResponseOps][TaskManager] fix limited concurrency starvation in mget task claimer #187809

pmuellr · 2024-07-08T23:55:17Z

Summary

Fixes problem with limited concurrency tasks potentially starving unlimited concurrency tasks, by using _msearch to search limited concurrency tasks separately from unlimited concurrency tasks.

Checklist

Unit or functional tests were updated or added to match the most common scenarios
Flaky Test Runner was used on any tests changed

pmuellr · 2024-07-08T23:55:26Z

/ci

pmuellr · 2024-07-09T23:57:52Z

/ci

elasticmachine · 2024-07-10T03:02:14Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 0d00243

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #30 / Rule execution logic API Detection Engine - Execution logic @ess @serverless @skipInServerlessMKI Machine Learning Detection Rule - Alert Suppression with an active ML Job with interval suppression duration performs no suppression if a single alert is generated

Metrics [docs]

✅ unchanged

History

💔 Build #220244 failed 2b0c750

pmuellr · 2024-07-24T03:28:52Z

/ci

pmuellr · 2024-07-26T04:36:52Z

/ci

elasticmachine · 2024-07-26T15:46:04Z

Pinging @elastic/response-ops (Team:ResponseOps)

ymao1

Left round of comments after code review. Will verify it next!

ymao1 · 2024-07-26T17:17:18Z

x-pack/plugins/task_manager/server/task_store.ts

+
+    for (const response of responses) {
+      if (response.status !== 200) {
+        throw new Error(`Unexpected status code: ${response.status}`);


should we pass this error to this.errors$ as well?

Ya, though it's making me wonder, with the weird partial result error stuff from CCS calls, should we just skip over these? If just one of the queries is bad for some reason, but the other ones were ok, and that was consistent, we'd never pull any tasks. Vs pulling tasks for everything but one of the "inner searches" failing.

I guess we'll figure that out ... :-)

added in 842d402

ymao1 · 2024-07-26T17:18:32Z

x-pack/plugins/task_manager/server/task_store.ts

@@ -504,6 +505,36 @@ export class TaskStore {
    }
  }

+  async msearch(opts: SearchOpts[] = []): Promise<FetchResult> {


can we add a unit test for this?

Just a note this is NOT in 842d402, as I wanted the functional changes in, I think this test will likely be pretty hairy, and could probably be deferred (but taking a look right now!)

Added in 4b357c1.

x-pack/plugins/task_manager/server/queries/mark_available_tasks_as_claimed.ts

ymao1 · 2024-07-26T17:23:26Z

x-pack/plugins/task_manager/server/task_claimers/strategy_mget.ts

    }
+
+    const capacity = getCapacity(definition.type);
+    result.limitedTypes.set(definition.type, capacity);


should we check for capacity=0 and not add to this map to avoid issuing a query with size 0?

added in 842d402

ymao1 · 2024-07-26T17:24:12Z

x-pack/plugins/task_manager/server/task_claimers/strategy_mget.ts

+      RecognizedTask
+    );
+
+    const query = matchesClauses(queryForLimitedTasks, filterDownBy(InactiveTasks));


do we need to add tasksWithPartitions to this clause?

added in 842d402

ymao1 · 2024-07-29T18:31:12Z

x-pack/plugins/task_manager/server/task_claimers/strategy_mget.ts

    }
+
+    const capacity = getCapacity(definition.type);


The capacity that's returned is actually now returned in cost (for the mget claim strategy), so for a normal cost task with maxConcurrency=1, it'll return 2. To convert capacity to a "number of tasks we can search for", I would divide this by the cost of the task:

const capacity = getCapacity(definition.type) / definition.cost

added in 842d402

x-pack/plugins/task_manager/server/task_claimers/strategy_mget.ts

ymao1 · 2024-07-29T18:40:00Z

x-pack/plugins/task_manager/server/task_claimers/strategy_mget.ts

+
+  const { types, excludedTypes, removedTypes, getCapacity, definitions } = opts;
+  for (const type of types) {
+    if (excludedTypes.has(type)) continue;


I noticed while adding an integration test #189431 that this uses slightly different logic than the default task claimer and doesn't respect wildcards. I think we should use the same function used for the default task claimer. Updated in my integration test PR so one of us will have a conflict!

I ended up fixing this in the last main merge, since that was part of what the merge conflicted with.

pmuellr · 2024-07-29T20:06:50Z

x-pack/plugins/task_manager/server/queries/mark_available_tasks_as_claimed.ts

@@ -15,23 +15,6 @@ import {
  MustNotCondition,
 } from './query_clauses';

-export function taskWithLessThanMaxAttempts(type: string, maxAttempts: number): MustCondition {


I noticed a few lingering references to search-related things regarding tasks running too many attempts. I believe this got resolved in #152841; though not sure if that applies to recurring tasks. @mikecote @ymao1 ??? In any case, this function was no longer being used, so figured I might as well delete it.

Yea I don't think we enforced anything with max attempts for recurring task types.

+1 shouldn't be used for recurring tasks, only ad-hoc (one time) tasks

pmuellr · 2024-08-19T19:18:45Z

@elasticmachine merge upstream

pmuellr · 2024-08-23T17:51:46Z

@elasticmachine merge upstream

mikecote · 2024-08-26T11:14:49Z

@elasticmachine merge upstream

ymao1 · 2024-08-26T12:46:33Z

x-pack/plugins/task_manager/server/task_claimers/strategy_mget.ts

      query,
      sort,
-      size,
+      size: capacity,


wonder if we should add a size multiplier here to account for possible conflicts?

Yeah, we should as the same concept for mget applies here. I'll add that in the code.

Added in 48806ca.

ymao1 · 2024-08-26T13:15:28Z

x-pack/plugins/task_manager/server/task_claimers/strategy_mget.ts

@@ -167,7 +166,7 @@ async function claimAvailableTasks(opts: TaskClaimerOpts): Promise<ClaimOwnershi
  }

  // apply limited concurrency limits (TODO: can currently starve other tasks)
-  const candidateTasks = applyLimitedConcurrency(currentTasks, batches);
+  const candidateTasks = selectTasksByCapacity(currentTasks, batches);


wonder if we still need this since we're searching directly using the msearch?

I think because we will now apply the SIZE_MULTIPLIER_FOR_TASK_FETCH multiplier, we'll need to replicate the concurrency limitations in Kibana. I think this function will still be necessary but looking at the code, it should still consider the available capacity (tasks currently running).

We discussed offline and given the code works with concurrency of 1, we can follow up the work to fix the code when concurrencies > 1 #191301

ymao1

LGTM

kibana-ci · 2024-08-26T18:51:09Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 426a306

Failed CI Steps

FTR Configs #62

Test Failures

[job] [logs] FTR Configs #62 / Cloud Security Posture Test adding Cloud Security Posture Integrations CSPM AWS CIS_AWS Single Manual Assume Role CIS_AWS Single Manual Assume Role Workflow

Metrics [docs]

✅ unchanged

History

💛 Build #229738 was flaky b2a00c5
💛 Build #229554 was flaky f13d557
💔 Build #229430 failed 4b357c1
💚 Build #229307 succeeded 842d402
💛 Build #228389 was flaky fa7f630
💚 Build #226417 succeeded ae2a46c

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

some starting changes

2b0c750

pmuellr added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0 labels Jul 8, 2024

pmuellr added 2 commits July 9, 2024 01:04

close, I think

e7d505f

fix jest test

0d00243

merge main and fix conflicts

1bde8b7

merge main and fix conflicts

1331959

pmuellr marked this pull request as ready for review July 26, 2024 15:46

pmuellr requested a review from a team as a code owner July 26, 2024 15:46

pmuellr added the release_note:skip Skip the PR/issue when compiling release notes label Jul 26, 2024

ymao1 reviewed Jul 26, 2024

View reviewed changes

ymao1 reviewed Jul 29, 2024

View reviewed changes

x-pack/plugins/task_manager/server/task_claimers/strategy_mget.ts Show resolved Hide resolved

ymao1 reviewed Jul 29, 2024

View reviewed changes

pmuellr commented Jul 29, 2024

View reviewed changes

pmuellr added 3 commits July 29, 2024 23:02

add test for claimSort()

450acb3

merge main and fix conflicts

7a0d574

merge main and fix conflicts

ae2a46c

mikecote mentioned this pull request Aug 19, 2024

Scaling the alerting throughput ceiling from 3,200 to 32,000+ rules per minute #188194

Open

48 tasks

elasticmachine and others added 3 commits August 20, 2024 05:18

Merge branch 'main' into 184937-mget-fix-concurrency

fa7f630

merge main and fix conflicts

88a4f29

changes from PR review

842d402

Copy search jest tests

4b357c1

Merge branch 'main' into 184937-mget-fix-concurrency

f13d557

Merge branch 'main' into 184937-mget-fix-concurrency

b2a00c5

ymao1 reviewed Aug 26, 2024

View reviewed changes

Add size multiplier for limited concurrency tasks

48806ca

mikecote mentioned this pull request Aug 26, 2024

Fix Task Manager mget to support various concurrency settings #191301

Open

Merge with main

426a306

ymao1 approved these changes Aug 26, 2024

View reviewed changes

mikecote merged commit d3fdb7d into elastic:main Aug 26, 2024
38 checks passed

kibanamachine added the backport:skip This commit does not require backporting label Aug 26, 2024

[ResponseOps][TaskManager] fix limited concurrency starvation in mget task claimer #187809

[ResponseOps][TaskManager] fix limited concurrency starvation in mget task claimer #187809

Conversation

pmuellr commented Jul 8, 2024

Summary

Checklist

pmuellr commented Jul 8, 2024

pmuellr commented Jul 9, 2024

elasticmachine commented Jul 10, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

pmuellr commented Jul 24, 2024

pmuellr commented Jul 26, 2024

elasticmachine commented Jul 26, 2024

ymao1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmuellr commented Aug 19, 2024

pmuellr commented Aug 23, 2024

mikecote commented Aug 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ymao1 left a comment

Choose a reason for hiding this comment

kibana-ci commented Aug 26, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History