Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kibana Task Manager capacity estimation doesn't not observe the right amount of Kibana nodes anymore #192568

Closed
mikecote opened this issue Sep 11, 2024 · 2 comments · Fixed by #194113
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

Yesterday I observed when running a large number of alerting rules that there were logs mentioning Kibana was unhealthy while it wasn't the case. After looking into it, I noticed the health report doesn't observe more than a single Kibana node at any given time. Turns out it's because of the ownerId aggregation on the tasks index, which filters for startedAt to be within a given time range but this field is no longer mapped. We should look into fixing this so we don't continue to generate false reports for customers.

ownerIds: {
filter: { range: { 'task.startedAt': { gte: 'now-1w/w' } } },
aggs: { ownerIds: { cardinality: { field: 'task.ownerId' } } },
},

@mikecote mikecote added bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Sep 11, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@seang-es
Copy link

This is showing up at some user sites with high rules or alerting volume. It's leading to false positive 'Task Manager unhealthy' errors because the total required capacity estimation is being conflated with the per-kibana-server capacity estimation, so we will see things like "Task Manager is unhealthy, the assumedAverageRecurringRequiredThroughputPerMinutePerKibana (1833) > capacityPerMinutePerKibana (200)", when we have multiple Kibana nodes to cover the expected throughput and are seeing no delays.

kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Oct 2, 2024
Resolves elastic#192568

In this PR, I'm solving the issue where the task manager health API is
unable to determine how many Kibana nodes are running. I'm doing so by
leveraging the Kibana discovery service to get a count instead of
calculating it based on an aggregation on the `.kibana_task_manager`
index where we count the unique number of `ownerId`, which requires
tasks to be running and a sufficient distribution across the Kibana
nodes to determine the number properly.

Note: This will only work when mget is the task claim strategy

## To verify
1. Set `xpack.task_manager.claim_strategy: mget` in kibana.yml
2. Startup the PR locally with Elasticsearch and Kibana running
3. Navigate to the `/api/task_manager/_health` route and confirm
`observed_kibana_instances` is `1`
4. Apply the following code and restart Kibana
```
diff --git a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
index 090847032bf..69dfb6d1b36 100644
--- a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
+++ b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
@@ -59,6 +59,7 @@ export class KibanaDiscoveryService {
     const lastSeen = lastSeenDate.toISOString();
     try {
       await this.upsertCurrentNode({ id: this.currentNode, lastSeen });
+      await this.upsertCurrentNode({ id: `${this.currentNode}-2`, lastSeen });
       if (!this.started) {
         this.logger.info('Kibana Discovery Service has been started');
         this.started = true;
```
5. Navigate to the `/api/task_manager/_health` route and confirm
`observed_kibana_instances` is `2`

---------

Co-authored-by: Elastic Machine <[email protected]>
(cherry picked from commit d0d2032)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
3 participants