Kibana Task Manager capacity estimation doesn't not observe the right amount of Kibana nodes anymore #192568

mikecote · 2024-09-11T11:44:38Z

Yesterday I observed when running a large number of alerting rules that there were logs mentioning Kibana was unhealthy while it wasn't the case. After looking into it, I noticed the health report doesn't observe more than a single Kibana node at any given time. Turns out it's because of the ownerId aggregation on the tasks index, which filters for startedAt to be within a given time range but this field is no longer mapped. We should look into fixing this so we don't continue to generate false reports for customers.

kibana/x-pack/plugins/task_manager/server/monitoring/workload_statistics.ts

Lines 166 to 169 in 9accb33

    
           ownerIds: { 
        
             filter: { range: { 'task.startedAt': { gte: 'now-1w/w' } } }, 
        
             aggs: { ownerIds: { cardinality: { field: 'task.ownerId' } } }, 
        
           },

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-09-11T11:44:40Z

Pinging @elastic/response-ops (Team:ResponseOps)

seang-es · 2024-09-19T19:00:42Z

This is showing up at some user sites with high rules or alerting volume. It's leading to false positive 'Task Manager unhealthy' errors because the total required capacity estimation is being conflated with the per-kibana-server capacity estimation, so we will see things like "Task Manager is unhealthy, the assumedAverageRecurringRequiredThroughputPerMinutePerKibana (1833) > capacityPerMinutePerKibana (200)", when we have multiple Kibana nodes to cover the expected throughput and are seeing no delays.

Resolves elastic#192568 In this PR, I'm solving the issue where the task manager health API is unable to determine how many Kibana nodes are running. I'm doing so by leveraging the Kibana discovery service to get a count instead of calculating it based on an aggregation on the `.kibana_task_manager` index where we count the unique number of `ownerId`, which requires tasks to be running and a sufficient distribution across the Kibana nodes to determine the number properly. Note: This will only work when mget is the task claim strategy ## To verify 1. Set `xpack.task_manager.claim_strategy: mget` in kibana.yml 2. Startup the PR locally with Elasticsearch and Kibana running 3. Navigate to the `/api/task_manager/_health` route and confirm `observed_kibana_instances` is `1` 4. Apply the following code and restart Kibana ``` diff --git a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts index 090847032bf..69dfb6d1b36 100644 --- a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts +++ b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts @@ -59,6 +59,7 @@ export class KibanaDiscoveryService { const lastSeen = lastSeenDate.toISOString(); try { await this.upsertCurrentNode({ id: this.currentNode, lastSeen }); + await this.upsertCurrentNode({ id: `${this.currentNode}-2`, lastSeen }); if (!this.started) { this.logger.info('Kibana Discovery Service has been started'); this.started = true; ``` 5. Navigate to the `/api/task_manager/_health` route and confirm `observed_kibana_instances` is `2` --------- Co-authored-by: Elastic Machine <[email protected]> (cherry picked from commit d0d2032)

mikecote added bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Sep 11, 2024

mikecote mentioned this issue Sep 12, 2024

Hook up discovery service to Task Manager health #192691

Closed

mikecote self-assigned this Sep 12, 2024

mikecote mentioned this issue Sep 26, 2024

Hook up discovery service to Task Manager health #194113

Merged

mikecote closed this as completed in #194113 Oct 2, 2024

mikecote closed this as completed in d0d2032 Oct 2, 2024

mikecote mentioned this issue Oct 24, 2024

Multi instance Task Manager issues after 8.15 #197145

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kibana Task Manager capacity estimation doesn't not observe the right amount of Kibana nodes anymore #192568

Kibana Task Manager capacity estimation doesn't not observe the right amount of Kibana nodes anymore #192568

mikecote commented Sep 11, 2024

elasticmachine commented Sep 11, 2024

seang-es commented Sep 19, 2024

Kibana Task Manager capacity estimation doesn't not observe the right amount of Kibana nodes anymore #192568

Kibana Task Manager capacity estimation doesn't not observe the right amount of Kibana nodes anymore #192568

Comments

mikecote commented Sep 11, 2024

elasticmachine commented Sep 11, 2024

seang-es commented Sep 19, 2024