Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ResponseOps] implement task claiming strategy mget #180485

Merged
merged 30 commits into from
Jun 13, 2024

Conversation

pmuellr
Copy link
Member

@pmuellr pmuellr commented Apr 10, 2024

resolves: #181325

Summary

Adds a new task claiming strategy mget, which can be used instead of the default one default. Add the following to your kibana.yml to enable it:

xpack.task_manager.claim_strategy: 'unsafe_mget'

Work still to be done

To Verify

Add xpack.task_manager.claim_strategy: 'unsafe_mget' to your kibana.dev.yml and you'll be using the new claimer. The most obvious problem is the task manager health report has null/0 values for the task claim section (not sure why, yet).

You can set TM debug logging on, and should see a message every claim cycle:

task claimer claimed: N; stale: N; conflicts: N; missing: N; updateErrors: N; removed: N;

Here's the logging stanza:

logging:
  loggers:
    - name:  plugins.taskManager
      level: debug

When running multiple Kibana instances, you'll see cases where all the potential tasks are stale, or the updates were in conflict, and nothing actually gets claimed (noted in things to fix, above).

If you want to dig a little further, a command-line tool is available in x-pack/plugins/task_manager/server/manual_tests/get_rule_run_event_logs.js. It is used to pull rule run execution documents from multiple clusters at once, and provide some augmented info in them, regarding workers, idle time, etc. The idea is to do an A/B test of using the new task claimer vs default, then see how the runs compare.

@pmuellr pmuellr force-pushed the 155770-tm-poll-mget branch from 746af9d to 80f8a9a Compare April 15, 2024 13:26
@pmuellr pmuellr changed the title [ResponseOps] implement task claiming stragegy mget [ResponseOps] implement task claiming strategy mget Apr 29, 2024
@pmuellr pmuellr added ci:build-cloud-image ci:cloud-deploy Create or update a Cloud deployment ci:cloud-persist-deployment Persist cloud deployment indefinitely ci:build-serverless-image ci:project-deploy-elasticsearch Create an Elasticsearch Serverless project ci:project-persist-deployment Persist project deployment indefinitely ci:collect-apm labels May 3, 2024
@pmuellr
Copy link
Member Author

pmuellr commented May 3, 2024

/ci

@pmuellr pmuellr force-pushed the 155770-tm-poll-mget branch from a855be0 to f589608 Compare May 3, 2024 15:36
@pmuellr
Copy link
Member Author

pmuellr commented May 6, 2024

Taking qaf for a spin ...

$ qaf rac alert-load \
--rule-count     200 \
--rule-interval  1m \
--run-minutes    10 \
--percent-firing  0 \
--es-url         https://keepkibana-pr-180485-elasticsearch-ea83fb.es.eu-west-1.aws.qa.elastic.cloud \
--kibana-url     https://keepkibana-pr-180485-elasticsearch-ea83fb.kb.eu-west-1.aws.qa.elastic.cloud \
--username       testing-internal \
--password       [secret-here]

That command will create 200 rules for 10m, and then produce some Dashboards showing some stats. Nice! The TM stats don't seem to be there, guessing that's because the TM health report is not really available for serverless. I'm going to take a look at the event log directly instead ...

resolves: elastic#155770

Adds a new task claiming strategy `mget`, which can be used instead of
the default one `default`.  Add the following to your `kibana.yml` to
enable it:

    xpack.task_manager.claim_strategy: 'mget'
@pmuellr pmuellr force-pushed the 155770-tm-poll-mget branch from f589608 to 61c616f Compare May 7, 2024 01:13
@pmuellr
Copy link
Member Author

pmuellr commented May 7, 2024

/ci

@pmuellr pmuellr removed ci:collect-apm ci:cloud-deploy Create or update a Cloud deployment ci:cloud-persist-deployment Persist cloud deployment indefinitely ci:project-deploy-elasticsearch Create an Elasticsearch Serverless project ci:project-persist-deployment Persist project deployment indefinitely labels May 7, 2024
@pmuellr
Copy link
Member Author

pmuellr commented May 7, 2024

I removed the cluster / project auto-deployments - they're hard to control, I figure using the custom images will be good enough.

@pmuellr
Copy link
Member Author

pmuellr commented May 8, 2024

/ci

@pmuellr
Copy link
Member Author

pmuellr commented May 8, 2024

/ci

@pmuellr
Copy link
Member Author

pmuellr commented May 8, 2024

I changed the default task claimer to the one implemented here, to make it easy to test in cloud without overrides.

Interesting to see the FT failures - I thought there would be more!

@pmuellr
Copy link
Member Author

pmuellr commented May 8, 2024

image

First time trying on serverless. This is the count of "stale" messages the claimer found from it's search, after the mget, with three background instances. Curious only two report stale entries, seems like a timing thing. I would guess over time, since we randomize the interval a tiny bit, that these "stale" finds will migrate across the kibanas. Will be fun to see over a longer time frame.

The graph below is per second. Super interesting that the number of stales is usually around 10. Which also makes sense. The second Kibana found the same 10 the previous Kibana found, but the previous one claimed them so the mget marked them stale. I guess as we scale up, we may see 20 as a number, or maybe chaos kinda takes over at that point.

Screenshot 2024-05-08 at 10 08 24 AM

@pmuellr
Copy link
Member Author

pmuellr commented May 10, 2024

/ci

@pmuellr
Copy link
Member Author

pmuellr commented Jun 6, 2024

My last commit before my last comment-based merge upstream had some functional test errors. Hoping they're fixed with the revert of the metrics PR 🤞🏻 .

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got all the feedback I needed out from my side, thanks @pmuellr for getting this through ❤️ Once the PR is merged, we should create the follow up issues as we'll most likely prioritize them right after.

.github/CODEOWNERS Show resolved Hide resolved
x-pack/plugins/task_manager/server/task_store.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for addressing comments

@pmuellr
Copy link
Member Author

pmuellr commented Jun 6, 2024

@elasticmachine merge upstream

@pmuellr
Copy link
Member Author

pmuellr commented Jun 10, 2024

@elasticmachine merge upstream

mikecote pushed a commit that referenced this pull request Jun 10, 2024
…only claimable task types (#185894)

## Summary

This PR updates the overdue metrics collector to filter to only
claimable task types.
I borrowed the `OneOfTaskTypes` clause from
#180485
 ```
// a task type that's not excluded (may be removed or not)
    OneOfTaskTypes('task.taskType', searchedTypes),
```


### Checklist

- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
doakalexi added a commit that referenced this pull request Jun 12, 2024
…o only claimable task types (#185892)

## Summary

This PR updates the overdue metrics collector to filter to only
claimable task types.
I borrowed the `OneOfTaskTypes` clause from
#180485
 ```
// a task type that's not excluded (may be removed or not)
    OneOfTaskTypes('task.taskType', searchedTypes),
```


### Checklist

- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
@pmuellr
Copy link
Member Author

pmuellr commented Jun 13, 2024

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

kibana-ci commented Jun 13, 2024

💛 Build succeeded, but was flaky

  • Buildkite Build
  • Commit: dcf0c2f
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-180485-dcf0c2f4fbfa

Failed CI Steps

Test Failures

  • [job] [logs] x-pack/test/cases_api_integration/security_and_spaces/config_trial.ts / cases security and spaces enabled: trial Common migrations migrations 7.13.2 "before all" hook in "7.13.2"

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
taskManager 26 28 +2

Total ESLint disabled count

id before after diff
taskManager 26 28 +2

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@pmuellr pmuellr merged commit f016398 into elastic:main Jun 13, 2024
38 checks passed
@kibanamachine kibanamachine added v8.15.0 backport:skip This commit does not require backporting labels Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:build-cloud-image ci:build-serverless-image release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create alternative task claimer that performs client-side updates w/ mget to prune stale docs
7 participants