[Data Views] `has_es_data` request hangs when remote clusters are unresponsive #200280

davismcphee · 2024-11-14T23:18:12Z

In #191566 we switched from using the resolve/index API to the resolve/cluster API for checking if any user data exists in Kibana. This was done for performance reasons, since in most cases resolve/cluster should respond significantly faster than resolve/index and return a smaller payload. However, this created an issue when any of the remote clusters are unresponsive, causing the resolve/cluster request to hang until it eventually times out, which can take upward of a minute. In these cases, the Kibana user is left waiting in a loading state in the UI (e.g. in Discover and Dashboard) until the request timeout.

We confirmed resolve/cluster was the cause by executing the underlying request sent by has_es_data directly in dev tools in an affected environment:

GET /_resolve/cluster/*%2C-.*%2C-logs-enterprise_search.api-default%2C-logs-enterprise_search.audit-default%2C*%3A*%2C*%3A-.*%2C*%3A-logs-enterprise_search.api-default%2C*%3A-logs-enterprise_search.audit-default?allow_no_indices=true&ignore_unavailable=true

We then executed a request against just the local indices, confirming it was fast:

GET /_resolve/cluster/%2A%2C-.%2A%2C-logs-enterprise_search.api-default%2C-logs-enterprise_search.audit-default?allow_no_indices=true&ignore_unavailable=true

And another against just the remote indices, confirming it was slow:

GET /_resolve/cluster/%2A%3A%2A%2C%2A%3A-.%2A%2C%2A%3A-logs-enterprise_search.api-default%2C%2A%3A-logs-enterprise_search.audit-default?allow_no_indices=true&ignore_unavailable=true

Notes:

Currently we are sending a single resolve/cluster request for both local and remote clusters. In the previous implementation that relied on resolve/index, we sent two requests -- one to check the local cluster first, and then a second for remote clusters if the first came back empty. Using this approach with resolve/cluster would also mitigate the situation, and likely for the majority of use cases.
Passing a custom timeout from Kibana would help mitigate the issue, but resolve/index doesn't currently support it, although there's a request for it here: [Resolve Clusters API] Add option to configure cluster timeout elasticsearch#114020. In the meantime, we could instead implement a timeout for the Kibana endpoint so the UI doesn't hang.
Maybe relying on resolve/cluster isn't the right approach, and we should use something else instead. ~~One alternative suggested in Need performant method of determining whether there are indices elasticsearch#112307 (comment) was to use the Exists API instead.~~ It seems the Exists API doesn't work how we'd need it to when wildcards are used. See Indices Exists API should return 404 for empty wildcards elasticsearch#34499.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-11-14T23:18:14Z

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs. ![CleanShot 2024-11-18 at 23 47 34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043) Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]> (cherry picked from commit 96fd4b6)

… can cause Kibana to hang (#200476) (#201025) # Backport This will backport the following commits from `main` to `8.x`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Davis McPhee <[email protected]>

…k can cause Kibana to hang (#200476) (#201024) # Backport This will backport the following commits from `main` to `8.16`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Davis McPhee <[email protected]>

…k can cause Kibana to hang (#200476) (#201023) # Backport This will backport the following commits from `main` to `8.15`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  --------- Co-authored-by: Davis McPhee <[email protected]>

…a to hang (elastic#200476) ## Summary This PR mitigates an issue where the `has_es_data` check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue: - The `resolve/cluster` request in the `has_es_data` endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request. - In cases where there is no local data and the remote `resolve/cluster` request hangs, a new `data_views.hasEsDataTimeout` config has been added to `kibana.yml` (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs. ![CleanShot 2024-11-18 at 23 47 34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043) Fixes elastic#200280. ### Notes - Modifying the existing version of the `has_es_data` endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review). - For a long term fix, the ES team is investigating the issue with `resolve/cluster` and will aim to have it behave like `resolve/index`, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: elastic/elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible. - If ES ends up providing another performant method for checking if indices exist instead of `resolve/cluster`, Kibana should migrate to that. More details in elastic/elasticsearch#112307. ### Testing notes To reproduce the issue locally, follow these steps: - Follow [these instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756) to set up a local CCS environment. - Stop the remote cluster process. - Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive cluster. See elastic/elasticsearch#32678 for more context. - Navigate to Discover and observe that the `has_es_data` request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast. ### Checklist - [x] Any text added follows [EUI's writing guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses sentence case text and includes [i18n support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md) - [ ] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [ ] If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the [docker list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker) - [x] This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The `release_note:breaking` label should be applied in these situations. - [ ] [Flaky Test Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was used on any tests changed - [x] The PR description includes the appropriate Release Notes section, and the correct `release_node:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <[email protected]>

davismcphee self-assigned this Nov 14, 2024

davismcphee mentioned this issue Nov 18, 2024

[Data Views] Mitigate issue where has_es_data check can cause Kibana to hang #200476

Merged

7 tasks

davismcphee closed this as completed in #200476 Nov 20, 2024

davismcphee closed this as completed in 96fd4b6 Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Views] `has_es_data` request hangs when remote clusters are unresponsive #200280

[Data Views] `has_es_data` request hangs when remote clusters are unresponsive #200280

davismcphee commented Nov 14, 2024 •

edited

Loading

elasticmachine commented Nov 14, 2024

[Data Views] has_es_data request hangs when remote clusters are unresponsive #200280

[Data Views] has_es_data request hangs when remote clusters are unresponsive #200280

Comments

davismcphee commented Nov 14, 2024 • edited Loading

elasticmachine commented Nov 14, 2024

[Data Views] `has_es_data` request hangs when remote clusters are unresponsive #200280

[Data Views] `has_es_data` request hangs when remote clusters are unresponsive #200280

davismcphee commented Nov 14, 2024 •

edited

Loading