Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[8.x] [Data Views] Mitigate issue where `has_es_data` check…
… can cause Kibana to hang (#200476) (#201025) # Backport This will backport the following commits from `main` to `8.x`: - [[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)](#200476) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Davis McPhee","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-20T18:52:47Z","message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n![CleanShot 2024-11-18 at 23 47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","v9.0.0","Team:DataDiscovery","backport:prev-major"],"title":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang","number":200476,"url":"https://github.com/elastic/kibana/pull/200476","mergeCommit":{"message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n![CleanShot 2024-11-18 at 23 47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/200476","number":200476,"mergeCommit":{"message":"[Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the `has_es_data` check can hang when\r\nsome remote clusters are unresponsive, leaving users stuck in a loading\r\nstate in some apps (e.g. Discover and Dashboard) until the request times\r\nout. There are two main changes that help mitigate this issue:\r\n- The `resolve/cluster` request in the `has_es_data` endpoint has been\r\nsplit into two requests -- one for local data first, then another for\r\nremote data second. In cases where remote clusters are unresponsive but\r\nthere is data available in the local cluster, the remote check is never\r\nperformed and the check completes quickly. This likely resolves the\r\nmajority of cases and is also likely faster in general than checking\r\nboth local and remote clusters in a single request.\r\n- In cases where there is no local data and the remote `resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout` config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to abort the request after a\r\nshort delay. This scenario is handled in the front end by displaying an\r\nerror toast to the user informing them of the issue, and assuming there\r\nis data available to avoid blocking them. When this occurs, a warning is\r\nalso logged to the Kibana server logs.\r\n\r\n![CleanShot 2024-11-18 at 23 47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes #200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the `has_es_data` endpoint in this\r\nway should be backward compatible since the behaviour should remain\r\nunchanged from before when the client and server versions don't match\r\n(please validate if this seems accurate during review).\r\n- For a long term fix, the ES team is investigating the issue with\r\n`resolve/cluster` and will aim to have it behave like `resolve/index`,\r\nwhich fails quickly when remote clusters are unresponsive. They may also\r\nimplement other mitigations like a configurable timeout in ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The purpose of\r\nthis PR is to provide an immediate solution in Kibana that mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing another performant method for checking if\r\nindices exist instead of `resolve/cluster`, Kibana should migrate to\r\nthat. More details in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n### Testing notes\r\n\r\nTo reproduce the issue locally, follow these steps:\r\n- Follow [these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto set up a local CCS environment.\r\n- Stop the remote cluster process.\r\n- Use Netcat on the remote cluster port to listen to requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive\r\ncluster. See elastic/elasticsearch#32678 for\r\nmore context.\r\n- Navigate to Discover and observe that the `has_es_data` request hangs.\r\nWhen testing in this PR branch, the request will only wait for 5 seconds\r\nbefore assuming data exists and displaying a toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing), uses\r\nsentence case text and includes [i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n- [ ]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas added for features that require explanation or tutorials\r\n- [x] [Unit or functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere updated or added to match the most common scenarios\r\n- [ ] If a plugin configuration key changed, check if it needs to be\r\nallowlisted in the cloud and added to the [docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n- [x] This was checked for breaking HTTP API changes, and any breaking\r\nchanges have been approved by the breaking-change committee. The\r\n`release_note:breaking` label should be applied in these situations.\r\n- [ ] [Flaky Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was\r\nused on any tests changed\r\n- [x] The PR description includes the appropriate Release Notes section,\r\nand the correct `release_node:*` label is applied per the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by: kibanamachine <[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}}]}] BACKPORT--> Co-authored-by: Davis McPhee <[email protected]>
- Loading branch information