Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager][Health] Warn Runtime Status for high Drift #166006

Open
stefnestor opened this issue Sep 7, 2023 · 2 comments
Open

[Task Manager][Health] Warn Runtime Status for high Drift #166006

stefnestor opened this issue Sep 7, 2023 · 2 comments
Labels
bug Fixes for quality problems that affect the customer experience enhancement New value added to drive a business result Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@stefnestor
Copy link
Contributor

Summary

👋🏼 howdy, team!

I've noticed across a couple clusters that Kibana can end up in a degraded status due to capacity_estimation which really sources in informatively from high runtime > drift usually drift_by_type of alerting:* (aka. Expensive Rules).

The (I really feel is more) bug or (could be labelled instead as) FR I have is that even if drift is p50 backed up by 3mins usu. with load.p50: 100 then runtime still reports status: OK. Can we put some logic in there to flip this to warn/error at some point?

Example

I've dealt with this situation with a couple of users, most egregious situations have been air-gapped so I can't share those examples. However, sharing a low-medium example output in full:

[A]

I wrote an automation to root-cause problematic plugin so reports:

# undocumented, internal API ( https://github.com/elastic/support-diagnostics/blob/main/src/main/resources/kibana-rest.yml ) 
> GET KIBANA_URL/status # ui
> GET kbn:/api/status # api

# overall
$ cat kibana_status.json | jq '{ overall: .status.overall.level }'
{
    overall: "degraded"
}

# core
$ cat kibana_status.json | jq -r '.status.core|{ elasticsearch: .elasticsearch.level, savedObjects:.core.savedObjects }'
{
    elasticsearch: "available",
    savedObjects:  "available"
}

# main cascading plugins
$ cat kibana_status.json | jq -r '.status.plugins|{ taskManager:.taskManager.level, savedObject:.savedOjbects.level, security:.security.level, reporting:.reporting.level }'
{
    taskManager:  "degraded",
    savedObject:  "degraded",
    security:     "degraded",
    ruleRegistry: "degraded",
    reporting:    "degraded"
}

Root plugins: ['taskManager']

$ cat kibana_status.json | jq -r '.status.plugins.taskManager'
{
    "level": "degraded",
    "summary": "Task Manager is unhealthy"
}

Task Manager: Health, Troubleshooting

# https://www.elastic.co/guide/en/kibana/current/task-manager-api-health.html
> GET kbn:api/task_manager/_health
$ cat kibana_task_manager_health.json | jq -r '{ overall:.status }'
{
    overall: "warn"
}

$ cat kibana_task_manager_health.json | jq -r '{ capacity:.stats.capacity_estimation.status, config:.stats.configuration.status, runtime:.stats.runtime.status, workload:.stats.workload.status }'
{
    capacity:  "warn",
    config:    "OK",
    runtime:   "OK",
    workload:  "OK"
}

Troubleshoot specific areas: capacity, config, runtime, workload

...

My report automation goes on, but pivoting towards applicability for this Github, e.g. doc: Evaluate the Runtime quotes section

Theory: Kibana is polling as frequently as it should, but that isn’t often enough to keep up with the workload
...
For details on achieving higher throughput by adjusting your scaling strategy, see Scaling guidance.

In our example(s) the load compared to this example doc section is instead actually p50: 100 and drifted by >1min. In a recent air-gapped example (not represented just below) it was >3min drifted:

$ cat kibana_task_manager_health.json | jq '.stats.runtime.value|{drift, load}'
{
  "drift": {
    "p50": 74544,
    "p90": 80640,
    "p95": 80640,
    "p99": 80640
  },
  "load": {
    "p50": 100,
    "p90": 100,
    "p95": 100,
    "p99": 100
  }
}

So overall, it makes sense that this drift+load cascades into capacity_estimation messages since that's where the docs point. However for API response interpretation/usability or diagnostic automations, it doesn't really make sense that runtime didn't flag as status: warn or something more problematic since the root-cause of the problem was something inside runtime cascaded into capacity_estimation.

Request

Unknown literal values but some logic like

  • IF runtime.drift.p50 > 60000 then runtime.status: warn.
  • IF runtime.load.p50: 100 then runtime.status: error

🙏🏼

@stefnestor stefnestor added bug Fixes for quality problems that affect the customer experience enhancement New value added to drive a business result Feature:Task Manager labels Sep 7, 2023
@botelastic botelastic bot added the needs-team Issues missing a team label label Sep 7, 2023
@stefnestor
Copy link
Contributor Author

Task Manager Health status:warn:

[A]
{
    "id": "3fd07b39-edb5-46a8-8875-1b55c9e1a32d",
    "last_update": "2023-06-15T20:51:11.047Z",
    "stats":
    {
        "capacity_estimation":
        {
            "status": "warn",
            "timestamp": "2023-06-15T20:51:11.330Z",
            "value":
            {
                "observed":
                {
                    "avg_recurring_required_throughput_per_minute": 129,
                    "avg_recurring_required_throughput_per_minute_per_kibana": 129,
                    "avg_required_throughput_per_minute": 329,
                    "avg_required_throughput_per_minute_per_kibana": 329,
                    "max_throughput_per_minute": 200,
                    "max_throughput_per_minute_per_kibana": 200,
                    "minutes_to_drain_overdue": 291,
                    "observed_kibana_instances": 1
                },
                "proposed":
                {
                    "avg_recurring_required_throughput_per_minute_per_kibana": 65,
                    "avg_required_throughput_per_minute_per_kibana": 165,
                    "min_required_kibana": 1,
                    "provisioned_kibana": 2
                }
            }
        },
        "configuration":
        {
            "status": "OK",
            "timestamp": "2023-06-07T00:09:58.697Z",
            "value":
            {
                "max_poll_inactivity_cycles": 10,
                "max_workers": 10,
                "monitored_aggregated_stats_refresh_rate": 60000,
                "monitored_stats_running_average_window": 50,
                "monitored_task_execution_thresholds":
                {
                    "custom":
                    {},
                    "default":
                    {
                        "error_threshold": 90,
                        "warn_threshold": 80
                    }
                },
                "poll_interval": 3000,
                "request_capacity": 1000
            }
        },
        "runtime":
        {
            "status": "OK",
            "timestamp": "2023-06-15T20:51:11.046Z",
            "value":
            {
                "drift":
                {
                    "p50": 74544,
                    "p90": 80640,
                    "p95": 80640,
                    "p99": 80640
                },
                "drift_by_type":
                {
                    "Fleet-Usage-Logger":
                    {
                        "p50": 165,
                        "p90": 74228,
                        "p95": 106823,
                        "p99": 108317
                    },
                    "Fleet-Usage-Sender":
                    {
                        "p50": 383,
                        "p90": 46934,
                        "p95": 62560,
                        "p99": 66442
                    },
                    "ML:saved-objects-sync":
                    {
                        "p50": 366.5,
                        "p90": 1963.5,
                        "p95": 52798,
                        "p99": 74200
                    },
                    "actions:.server-log":
                    {
                        "p50": 2281,
                        "p90": 22912.4,
                        "p95": 23499,
                        "p99": 23499
                    },
                    "actions:.webhook":
                    {
                        "p50": 74544,
                        "p90": 80640,
                        "p95": 80640,
                        "p99": 80640
                    },
                    "actions_telemetry":
                    {
                        "p50": 868.5,
                        "p90": 50222.10000000001,
                        "p95": 70689,
                        "p99": 70689
                    },
                    "alerting:.es-query":
                    {
                        "p50": 53568,
                        "p90": 63144.5,
                        "p95": 65387,
                        "p99": 66127
                    },
                    "alerting:logs.alert.document.count":
                    {
                        "p50": 52146,
                        "p90": 62983.5,
                        "p95": 63115,
                        "p99": 66035
                    },
                    "alerting:metrics.alert.inventory.threshold":
                    {
                        "p50": 51205.5,
                        "p90": 63087.5,
                        "p95": 63294,
                        "p99": 66127
                    },
                    "alerting:monitoring_alert_cluster_health":
                    {
                        "p50": 53449.5,
                        "p90": 64685.5,
                        "p95": 66029,
                        "p99": 68195
                    },
                    "alerting:monitoring_alert_cpu_usage":
                    {
                        "p50": 53450,
                        "p90": 64135.5,
                        "p95": 66008,
                        "p99": 66042
                    },
                    "alerting:monitoring_alert_disk_usage":
                    {
                        "p50": 51926,
                        "p90": 63807.5,
                        "p95": 66009,
                        "p99": 68195
                    },
                    "alerting:monitoring_alert_elasticsearch_version_mismatch":
                    {
                        "p50": 51785,
                        "p90": 63015.5,
                        "p95": 63294,
                        "p99": 66036
                    },
                    "alerting:monitoring_alert_jvm_memory_usage":
                    {
                        "p50": 53044.5,
                        "p90": 63109,
                        "p95": 65401,
                        "p99": 66028
                    },
                    "alerting:monitoring_alert_kibana_version_mismatch":
                    {
                        "p50": 53744,
                        "p90": 63478,
                        "p95": 65946,
                        "p99": 68196
                    },
                    "alerting:monitoring_alert_license_expiration":
                    {
                        "p50": 51230,
                        "p90": 63092,
                        "p95": 63294,
                        "p99": 66102
                    },
                    "alerting:monitoring_alert_logstash_version_mismatch":
                    {
                        "p50": 53560,
                        "p90": 63186.5,
                        "p95": 65946,
                        "p99": 67528
                    },
                    "alerting:monitoring_alert_missing_monitoring_data":
                    {
                        "p50": 53591,
                        "p90": 63190.5,
                        "p95": 63362,
                        "p99": 65645
                    },
                    "alerting:monitoring_alert_nodes_changed":
                    {
                        "p50": 53864.5,
                        "p90": 64535,
                        "p95": 66028,
                        "p99": 66036
                    },
                    "alerting:monitoring_alert_thread_pool_search_rejections":
                    {
                        "p50": 53641,
                        "p90": 63117.5,
                        "p95": 66029,
                        "p99": 68520
                    },
                    "alerting:monitoring_alert_thread_pool_write_rejections":
                    {
                        "p50": 52097.5,
                        "p90": 63092,
                        "p95": 65645,
                        "p99": 66029
                    },
                    "alerting:monitoring_ccr_read_exceptions":
                    {
                        "p50": 53164,
                        "p90": 63153.5,
                        "p95": 64429,
                        "p99": 68334
                    },
                    "alerting:monitoring_shard_size":
                    {
                        "p50": 53646,
                        "p90": 64965.5,
                        "p95": 65387,
                        "p99": 66126
                    },
                    "alerting:siem.eqlRule":
                    {
                        "p50": 56349,
                        "p90": 56757,
                        "p95": 57101,
                        "p99": 57118
                    },
                    "alerting:siem.mlRule":
                    {
                        "p50": 35189,
                        "p90": 105096,
                        "p95": 105096,
                        "p99": 108616
                    },
                    "alerting:siem.newTermsRule":
                    {
                        "p50": 42077.5,
                        "p90": 47693.5,
                        "p95": 48268,
                        "p99": 56368
                    },
                    "alerting:siem.queryRule":
                    {
                        "p50": 56424,
                        "p90": 56586,
                        "p95": 56587,
                        "p99": 107950
                    },
                    "alerting:siem.thresholdRule":
                    {
                        "p50": 42399.5,
                        "p90": 56292,
                        "p95": 56587,
                        "p99": 64218
                    },
                    "alerting:xpack.uptime.alerts.monitorStatus":
                    {
                        "p50": 55484.5,
                        "p90": 66126,
                        "p95": 66157,
                        "p99": 68507
                    },
                    "alerting_health_check":
                    {
                        "p50": 375,
                        "p90": 25610,
                        "p95": 56234,
                        "p99": 72323
                    },
                    "alerting_telemetry":
                    {
                        "p50": 868,
                        "p90": 50222.500000000015,
                        "p95": 70690,
                        "p99": 70690
                    },
                    "alerts_invalidate_api_keys":
                    {
                        "p50": 42106.5,
                        "p90": 46104.5,
                        "p95": 48026,
                        "p99": 56335
                    },
                    "apm-source-map-migration-task":
                    {
                        "p50": 17770,
                        "p90": 17770,
                        "p95": 17770,
                        "p99": 17770
                    },
                    "apm-telemetry-task":
                    {
                        "p50": 2239.5,
                        "p90": 23865,
                        "p95": 23865,
                        "p99": 23865
                    },
                    "cases-telemetry-task":
                    {
                        "p50": 32153,
                        "p90": 32153,
                        "p95": 32153,
                        "p99": 32153
                    },
                    "cleanup_failed_action_executions":
                    {
                        "p50": 374.5,
                        "p90": 1693.5,
                        "p95": 56390,
                        "p99": 62546
                    },
                    "dashboard_telemetry":
                    {
                        "p50": 868,
                        "p90": 50221.80000000001,
                        "p95": 70689,
                        "p99": 70689
                    },
                    "endpoint:metadata-check-transforms-task":
                    {
                        "p50": 681.5,
                        "p90": 1860,
                        "p95": 23852,
                        "p99": 76208
                    },
                    "endpoint:user-artifact-packager":
                    {
                        "p50": 51402.5,
                        "p90": 63686.5,
                        "p95": 65354,
                        "p99": 67615
                    },
                    "fleet:check-deleted-files-task":
                    {
                        "p50": 4262,
                        "p90": 42038,
                        "p95": 42038,
                        "p99": 42038
                    },
                    "osquery:telemetry-configs":
                    {
                        "p50": 1489,
                        "p90": 2750,
                        "p95": 2750,
                        "p99": 2750
                    },
                    "osquery:telemetry-packs":
                    {
                        "p50": 655,
                        "p90": 2745,
                        "p95": 2745,
                        "p99": 2745
                    },
                    "osquery:telemetry-saved-queries":
                    {
                        "p50": 656,
                        "p90": 1758.3999999999999,
                        "p95": 1832,
                        "p99": 1832
                    },
                    "reports:monitor":
                    {
                        "p50": 26993,
                        "p90": 90171,
                        "p95": 92536,
                        "p99": 92879
                    },
                    "security:endpoint-diagnostics":
                    {
                        "p50": 40907,
                        "p90": 47764.5,
                        "p95": 48099,
                        "p99": 56335
                    },
                    "security:endpoint-meta-telemetry":
                    {
                        "p50": 652,
                        "p90": 2729,
                        "p95": 2729,
                        "p99": 2729
                    },
                    "security:telemetry-configuration":
                    {
                        "p50": 378.5,
                        "p90": 1462.5,
                        "p95": 41152,
                        "p99": 60387
                    },
                    "security:telemetry-detection-rules":
                    {
                        "p50": 653,
                        "p90": 2730,
                        "p95": 2730,
                        "p99": 2730
                    },
                    "security:telemetry-filterlist-artifact":
                    {
                        "p50": 301.5,
                        "p90": 34767,
                        "p95": 39446,
                        "p99": 44473
                    },
                    "security:telemetry-lists":
                    {
                        "p50": 5725,
                        "p90": 42038,
                        "p95": 42038,
                        "p99": 42038
                    },
                    "security:telemetry-prebuilt-rule-alerts":
                    {
                        "p50": 377.5,
                        "p90": 2548,
                        "p95": 56390,
                        "p99": 62546
                    },
                    "security:telemetry-timelines":
                    {
                        "p50": 999,
                        "p90": 1570.7999999999997,
                        "p95": 28771.899999999838,
                        "p99": 78584
                    },
                    "session_cleanup":
                    {
                        "p50": 372.5,
                        "p90": 22218,
                        "p95": 56235,
                        "p99": 68049
                    }
                },
                "execution":
                {
                    "duration":
                    {
                        "Fleet-Usage-Logger":
                        {
                            "p50": 78,
                            "p90": 163,
                            "p95": 172,
                            "p99": 186
                        },
                        "Fleet-Usage-Sender":
                        {
                            "p50": 240,
                            "p90": 748.5,
                            "p95": 1091,
                            "p99": 1300
                        },
                        "ML:saved-objects-sync":
                        {
                            "p50": 31.5,
                            "p90": 70.5,
                            "p95": 80,
                            "p99": 108
                        },
                        "actions:.server-log":
                        {
                            "p50": 113,
                            "p90": 180.8,
                            "p95": 182,
                            "p99": 182
                        },
                        "actions:.webhook":
                        {
                            "p50": 217,
                            "p90": 248,
                            "p95": 250,
                            "p99": 1222
                        },
                        "actions_telemetry":
                        {
                            "p50": 2267,
                            "p90": 5196.6,
                            "p95": 5199,
                            "p99": 5199
                        },
                        "alerting:.es-query":
                        {
                            "p50": 712,
                            "p90": 1564,
                            "p95": 1598,
                            "p99": 1812
                        },
                        "alerting:logs.alert.document.count":
                        {
                            "p50": 2022.5,
                            "p90": 2781,
                            "p95": 2918,
                            "p99": 3464
                        },
                        "alerting:metrics.alert.inventory.threshold":
                        {
                            "p50": 2186,
                            "p90": 3115,
                            "p95": 3499,
                            "p99": 3680
                        },
                        "alerting:monitoring_alert_cluster_health":
                        {
                            "p50": 397,
                            "p90": 523,
                            "p95": 644,
                            "p99": 854
                        },
                        "alerting:monitoring_alert_cpu_usage":
                        {
                            "p50": 387.5,
                            "p90": 569.5,
                            "p95": 599,
                            "p99": 621
                        },
                        "alerting:monitoring_alert_disk_usage":
                        {
                            "p50": 380.5,
                            "p90": 588.5,
                            "p95": 627,
                            "p99": 675
                        },
                        "alerting:monitoring_alert_elasticsearch_version_mismatch":
                        {
                            "p50": 417.5,
                            "p90": 980,
                            "p95": 1258,
                            "p99": 1345
                        },
                        "alerting:monitoring_alert_jvm_memory_usage":
                        {
                            "p50": 384,
                            "p90": 512,
                            "p95": 593,
                            "p99": 1353
                        },
                        "alerting:monitoring_alert_kibana_version_mismatch":
                        {
                            "p50": 790.5,
                            "p90": 1358,
                            "p95": 1383,
                            "p99": 1529
                        },
                        "alerting:monitoring_alert_license_expiration":
                        {
                            "p50": 425.5,
                            "p90": 987,
                            "p95": 1188,
                            "p99": 1269
                        },
                        "alerting:monitoring_alert_logstash_version_mismatch":
                        {
                            "p50": 938,
                            "p90": 1478,
                            "p95": 1850,
                            "p99": 2010
                        },
                        "alerting:monitoring_alert_missing_monitoring_data":
                        {
                            "p50": 456,
                            "p90": 681,
                            "p95": 905,
                            "p99": 1369
                        },
                        "alerting:monitoring_alert_nodes_changed":
                        {
                            "p50": 405,
                            "p90": 486.5,
                            "p95": 606,
                            "p99": 859
                        },
                        "alerting:monitoring_alert_thread_pool_search_rejections":
                        {
                            "p50": 398.5,
                            "p90": 1129,
                            "p95": 1334,
                            "p99": 1851
                        },
                        "alerting:monitoring_alert_thread_pool_write_rejections":
                        {
                            "p50": 399.5,
                            "p90": 598,
                            "p95": 1263,
                            "p99": 1918
                        },
                        "alerting:monitoring_ccr_read_exceptions":
                        {
                            "p50": 391,
                            "p90": 623,
                            "p95": 855,
                            "p99": 1286
                        },
                        "alerting:monitoring_shard_size":
                        {
                            "p50": 583,
                            "p90": 1434,
                            "p95": 1458,
                            "p99": 1847
                        },
                        "alerting:siem.eqlRule":
                        {
                            "p50": 729.5,
                            "p90": 864.5,
                            "p95": 882,
                            "p99": 2668
                        },
                        "alerting:siem.mlRule":
                        {
                            "p50": 525,
                            "p90": 759.5,
                            "p95": 772,
                            "p99": 776
                        },
                        "alerting:siem.newTermsRule":
                        {
                            "p50": 513,
                            "p90": 1294.5,
                            "p95": 1432,
                            "p99": 2526
                        },
                        "alerting:siem.queryRule":
                        {
                            "p50": 541.5,
                            "p90": 1876,
                            "p95": 2367,
                            "p99": 2372
                        },
                        "alerting:siem.thresholdRule":
                        {
                            "p50": 708.5,
                            "p90": 5687.5,
                            "p95": 5737,
                            "p99": 5948
                        },
                        "alerting:xpack.uptime.alerts.monitorStatus":
                        {
                            "p50": 1002,
                            "p90": 4170.5,
                            "p95": 4253,
                            "p99": 4434
                        },
                        "alerting_health_check":
                        {
                            "p50": 38,
                            "p90": 93.5,
                            "p95": 102,
                            "p99": 125
                        },
                        "alerting_telemetry":
                        {
                            "p50": 6376.5,
                            "p90": 8378.1,
                            "p95": 8670,
                            "p99": 8670
                        },
                        "alerts_invalidate_api_keys":
                        {
                            "p50": 19.5,
                            "p90": 31,
                            "p95": 44,
                            "p99": 2117
                        },
                        "apm-source-map-migration-task":
                        {
                            "p50": 26,
                            "p90": 26,
                            "p95": 26,
                            "p99": 26
                        },
                        "apm-telemetry-task":
                        {
                            "p50": 1402,
                            "p90": 1525,
                            "p95": 1525,
                            "p99": 1525
                        },
                        "cases-telemetry-task":
                        {
                            "p50": 973,
                            "p90": 973,
                            "p95": 973,
                            "p99": 973
                        },
                        "cleanup_failed_action_executions":
                        {
                            "p50": 10,
                            "p90": 19,
                            "p95": 22,
                            "p99": 24
                        },
                        "dashboard_telemetry":
                        {
                            "p50": 215.5,
                            "p90": 365.2,
                            "p95": 379,
                            "p99": 379
                        },
                        "endpoint:metadata-check-transforms-task":
                        {
                            "p50": 37.5,
                            "p90": 87.5,
                            "p95": 106,
                            "p99": 120
                        },
                        "endpoint:user-artifact-packager":
                        {
                            "p50": 16.5,
                            "p90": 23,
                            "p95": 23,
                            "p99": 63
                        },
                        "fleet:check-deleted-files-task":
                        {
                            "p50": 15,
                            "p90": 16,
                            "p95": 16,
                            "p99": 16
                        },
                        "osquery:telemetry-configs":
                        {
                            "p50": 12,
                            "p90": 14,
                            "p95": 14,
                            "p99": 14
                        },
                        "osquery:telemetry-packs":
                        {
                            "p50": 6,
                            "p90": 10,
                            "p95": 10,
                            "p99": 10
                        },
                        "osquery:telemetry-saved-queries":
                        {
                            "p50": 7,
                            "p90": 13.6,
                            "p95": 14,
                            "p99": 14
                        },
                        "reports:monitor":
                        {
                            "p50": 19,
                            "p90": 27.5,
                            "p95": 36,
                            "p99": 1392
                        },
                        "security:endpoint-diagnostics":
                        {
                            "p50": 10,
                            "p90": 13,
                            "p95": 14,
                            "p99": 17
                        },
                        "security:endpoint-meta-telemetry":
                        {
                            "p50": 3,
                            "p90": 4,
                            "p95": 4,
                            "p99": 4
                        },
                        "security:telemetry-configuration":
                        {
                            "p50": 3,
                            "p90": 8,
                            "p95": 10,
                            "p99": 14
                        },
                        "security:telemetry-detection-rules":
                        {
                            "p50": 2,
                            "p90": 3,
                            "p95": 3,
                            "p99": 3
                        },
                        "security:telemetry-filterlist-artifact":
                        {
                            "p50": 4,
                            "p90": 9,
                            "p95": 10,
                            "p99": 16
                        },
                        "security:telemetry-lists":
                        {
                            "p50": 8,
                            "p90": 9,
                            "p95": 9,
                            "p99": 9
                        },
                        "security:telemetry-prebuilt-rule-alerts":
                        {
                            "p50": 3,
                            "p90": 9,
                            "p95": 11,
                            "p99": 14
                        },
                        "security:telemetry-timelines":
                        {
                            "p50": 3,
                            "p90": 7.9999999999999964,
                            "p95": 12.699999999999996,
                            "p99": 14
                        },
                        "session_cleanup":
                        {
                            "p50": 13.5,
                            "p90": 40,
                            "p95": 56,
                            "p99": 71
                        }
                    },
                    "duration_by_persistence":
                    {
                        "non_recurring":
                        {
                            "p50": 217,
                            "p90": 248,
                            "p95": 250,
                            "p99": 1222
                        },
                        "recurring":
                        {
                            "p50": 696.5,
                            "p90": 1118,
                            "p95": 1519,
                            "p99": 3643
                        }
                    },
                    "persistence":
                    {
                        "ephemeral": 0,
                        "non_recurring": 100,
                        "recurring": 0
                    },
                    "result_frequency_percent_as_number":
                    {
                        "Fleet-Usage-Logger":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "Fleet-Usage-Sender":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "ML:saved-objects-sync":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "actions:.server-log":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "actions:.webhook":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "actions_telemetry":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:.es-query":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:logs.alert.document.count":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:metrics.alert.inventory.threshold":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_cluster_health":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_cpu_usage":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_disk_usage":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_elasticsearch_version_mismatch":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_jvm_memory_usage":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_kibana_version_mismatch":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_license_expiration":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_logstash_version_mismatch":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_missing_monitoring_data":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_nodes_changed":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_thread_pool_search_rejections":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_alert_thread_pool_write_rejections":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_ccr_read_exceptions":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:monitoring_shard_size":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:siem.eqlRule":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:siem.mlRule":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:siem.newTermsRule":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:siem.queryRule":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:siem.thresholdRule":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting:xpack.uptime.alerts.monitorStatus":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting_health_check":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerting_telemetry":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "alerts_invalidate_api_keys":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "apm-source-map-migration-task":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "apm-telemetry-task":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "cases-telemetry-task":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "cleanup_failed_action_executions":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "dashboard_telemetry":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "endpoint:metadata-check-transforms-task":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "endpoint:user-artifact-packager":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "fleet:check-deleted-files-task":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "osquery:telemetry-configs":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "osquery:telemetry-packs":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "osquery:telemetry-saved-queries":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "reports:monitor":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "security:endpoint-diagnostics":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "security:endpoint-meta-telemetry":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "security:telemetry-configuration":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "security:telemetry-detection-rules":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "security:telemetry-filterlist-artifact":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "security:telemetry-lists":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "security:telemetry-prebuilt-rule-alerts":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "security:telemetry-timelines":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        },
                        "session_cleanup":
                        {
                            "Failed": 0,
                            "RetryScheduled": 0,
                            "Success": 100,
                            "status": "OK"
                        }
                    }
                },
                "load":
                {
                    "p50": 100,
                    "p90": 100,
                    "p95": 100,
                    "p99": 100
                },
                "polling":
                {
                    "claim_conflicts":
                    {
                        "p50": 0,
                        "p90": 0,
                        "p95": 0,
                        "p99": 0
                    },
                    "claim_duration":
                    {
                        "p50": 99,
                        "p90": 115.5,
                        "p95": 145,
                        "p99": 145
                    },
                    "claim_mismatches":
                    {
                        "p50": 0,
                        "p90": 0,
                        "p95": 0,
                        "p99": 0
                    },
                    "duration":
                    {
                        "p50": 207.5,
                        "p90": 297.5,
                        "p95": 307,
                        "p99": 458
                    },
                    "last_polling_delay": "2023-06-07T00:10:01.343Z",
                    "last_successful_poll": "2023-06-15T20:51:10.336Z",
                    "persistence":
                    {
                        "non_recurring": 100,
                        "recurring": 0
                    },
                    "result_frequency_percent_as_number":
                    {
                        "Failed": 0,
                        "NoAvailableWorkers": 0,
                        "NoTasksClaimed": 0,
                        "PoolFilled": 0,
                        "RanOutOfCapacity": 66,
                        "RunningAtCapacity": 34
                    }
                }
            }
        },
        "workload":
        {
            "status": "OK",
            "timestamp": "2023-06-15T20:51:10.414Z",
            "value":
            {
                "capacity_requirements":
                {
                    "per_day": 41,
                    "per_hour": 3913,
                    "per_minute": 63
                },
                "count": 909,
                "estimated_schedule_density":
                [
                    7,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    0,
                    5,
                    8,
                    15,
                    21,
                    20,
                    20,
                    20,
                    20,
                    20,
                    9,
                    1,
                    10
                ],
                "non_recurring": 909,
                "overdue": 291,
                "overdue_non_recurring": 291,
                "owner_ids": 1,
                "schedule":
                [
                    [
                        "3s",
                        1
                    ],
                    [
                        "10s",
                        1
                    ],
                    [
                        "1m",
                        35
                    ],
                    [
                        "60s",
                        2
                    ],
                    [
                        "3m",
                        1
                    ],
                    [
                        "5m",
                        316
                    ],
                    [
                        "10m",
                        1
                    ],
                    [
                        "15m",
                        20
                    ],
                    [
                        "30m",
                        1
                    ],
                    [
                        "45m",
                        1
                    ],
                    [
                        "1h",
                        6
                    ],
                    [
                        "60m",
                        4
                    ],
                    [
                        "3600s",
                        2
                    ],
                    [
                        "2h",
                        1
                    ],
                    [
                        "3h",
                        1
                    ],
                    [
                        "720m",
                        2
                    ],
                    [
                        "1d",
                        9
                    ],
                    [
                        "24h",
                        8
                    ]
                ],
                "task_types":
                {
                    "Fleet-Usage-Logger":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "Fleet-Usage-Sender":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "ML:saved-objects-sync":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "actions:.webhook":
                    {
                        "count": 492,
                        "status":
                        {
                            "claiming": 10,
                            "idle": 179,
                            "running": 303
                        }
                    },
                    "actions_telemetry":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "alerting:.es-query":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "alerting:logs.alert.document.count":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "alerting:metrics.alert.inventory.threshold":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "alerting:monitoring_alert_cluster_health":
                    {
                        "count": 3,
                        "status":
                        {
                            "idle": 3
                        }
                    },
                    "alerting:monitoring_alert_cpu_usage":
                    {
                        "count": 3,
                        "status":
                        {
                            "idle": 3
                        }
                    },
                    "alerting:monitoring_alert_disk_usage":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "alerting:monitoring_alert_elasticsearch_version_mismatch":
                    {
                        "count": 3,
                        "status":
                        {
                            "idle": 3
                        }
                    },
                    "alerting:monitoring_alert_jvm_memory_usage":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "alerting:monitoring_alert_kibana_version_mismatch":
                    {
                        "count": 3,
                        "status":
                        {
                            "idle": 3
                        }
                    },
                    "alerting:monitoring_alert_license_expiration":
                    {
                        "count": 3,
                        "status":
                        {
                            "idle": 3
                        }
                    },
                    "alerting:monitoring_alert_logstash_version_mismatch":
                    {
                        "count": 3,
                        "status":
                        {
                            "idle": 3
                        }
                    },
                    "alerting:monitoring_alert_missing_monitoring_data":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "alerting:monitoring_alert_nodes_changed":
                    {
                        "count": 3,
                        "status":
                        {
                            "idle": 3
                        }
                    },
                    "alerting:monitoring_alert_thread_pool_search_rejections":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "alerting:monitoring_alert_thread_pool_write_rejections":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "alerting:monitoring_ccr_read_exceptions":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "alerting:monitoring_shard_size":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "alerting:siem.eqlRule":
                    {
                        "count": 266,
                        "status":
                        {
                            "idle": 266
                        }
                    },
                    "alerting:siem.mlRule":
                    {
                        "count": 19,
                        "status":
                        {
                            "idle": 19
                        }
                    },
                    "alerting:siem.newTermsRule":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "alerting:siem.queryRule":
                    {
                        "count": 47,
                        "status":
                        {
                            "idle": 47
                        }
                    },
                    "alerting:siem.thresholdRule":
                    {
                        "count": 5,
                        "status":
                        {
                            "idle": 5
                        }
                    },
                    "alerting:xpack.uptime.alerts.monitorStatus":
                    {
                        "count": 4,
                        "status":
                        {
                            "idle": 4
                        }
                    },
                    "alerting_health_check":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "alerting_telemetry":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "alerts_invalidate_api_keys":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "apm-telemetry-task":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "cases-telemetry-task":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "cleanup_failed_action_executions":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "dashboard_telemetry":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "endpoint:metadata-check-transforms-task":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "endpoint:user-artifact-packager":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "fleet:check-deleted-files-task":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "lens_telemetry":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "osquery:telemetry-configs":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "osquery:telemetry-packs":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "osquery:telemetry-saved-queries":
                    {
                        "count": 2,
                        "status":
                        {
                            "idle": 2
                        }
                    },
                    "reports:monitor":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "search_sessions_cleanup":
                    {
                        "count": 1,
                        "status":
                        {
                            "unrecognized": 1
                        }
                    },
                    "search_sessions_expire":
                    {
                        "count": 1,
                        "status":
                        {
                            "unrecognized": 1
                        }
                    },
                    "search_sessions_monitor":
                    {
                        "count": 1,
                        "status":
                        {
                            "unrecognized": 1
                        }
                    },
                    "security:endpoint-diagnostics":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "security:endpoint-meta-telemetry":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "security:telemetry-configuration":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "security:telemetry-detection-rules":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "security:telemetry-filterlist-artifact":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "security:telemetry-lists":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "security:telemetry-prebuilt-rule-alerts":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "security:telemetry-timelines":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "session_cleanup":
                    {
                        "count": 1,
                        "status":
                        {
                            "idle": 1
                        }
                    },
                    "vis_telemetry":
                    {
                        "count": 1,
                        "status":
                        {
                            "failed": 1
                        }
                    }
                }
            }
        }
    },
    "status": "warn",
    "timestamp": "2023-06-15T20:51:11.331Z"
}

@ppisljar ppisljar added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Sep 11, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience enhancement New value added to drive a business result Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

4 participants