Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.16] [Data Views] Mitigate issue where `has_es_data` check can cause Kibana to hang (#200476) #201024

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

kibanamachine
Copy link
Contributor

Backport

This will backport the following commits from main to 8.16:

Questions ?

Please refer to the Backport tool documentation

…a to hang (elastic#200476)

## Summary

This PR mitigates an issue where the `has_es_data` check can hang when
some remote clusters are unresponsive, leaving users stuck in a loading
state in some apps (e.g. Discover and Dashboard) until the request times
out. There are two main changes that help mitigate this issue:
- The `resolve/cluster` request in the `has_es_data` endpoint has been
split into two requests -- one for local data first, then another for
remote data second. In cases where remote clusters are unresponsive but
there is data available in the local cluster, the remote check is never
performed and the check completes quickly. This likely resolves the
majority of cases and is also likely faster in general than checking
both local and remote clusters in a single request.
- In cases where there is no local data and the remote `resolve/cluster`
request hangs, a new `data_views.hasEsDataTimeout` config has been added
to `kibana.yml` (defaults to 5 seconds) to abort the request after a
short delay. This scenario is handled in the front end by displaying an
error toast to the user informing them of the issue, and assuming there
is data available to avoid blocking them. When this occurs, a warning is
also logged to the Kibana server logs.

![CleanShot 2024-11-18 at 23 47
34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)

Fixes elastic#200280.

### Notes
- Modifying the existing version of the `has_es_data` endpoint in this
way should be backward compatible since the behaviour should remain
unchanged from before when the client and server versions don't match
(please validate if this seems accurate during review).
- For a long term fix, the ES team is investigating the issue with
`resolve/cluster` and will aim to have it behave like `resolve/index`,
which fails quickly when remote clusters are unresponsive. They may also
implement other mitigations like a configurable timeout in ES:
elastic/elasticsearch#114020. The purpose of
this PR is to provide an immediate solution in Kibana that mitigates the
issue as much as possible.
- If ES ends up providing another performant method for checking if
indices exist instead of `resolve/cluster`, Kibana should migrate to
that. More details in
elastic/elasticsearch#112307.

### Testing notes

To reproduce the issue locally, follow these steps:
- Follow [these
instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)
to set up a local CCS environment.
- Stop the remote cluster process.
- Use Netcat on the remote cluster port to listen to requests but not
respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive
cluster. See elastic/elasticsearch#32678 for
more context.
- Navigate to Discover and observe that the `has_es_data` request hangs.
When testing in this PR branch, the request will only wait for 5 seconds
before assuming data exists and displaying a toast.

### Checklist

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [x] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_node:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
(cherry picked from commit 96fd4b6)
@kibanamachine kibanamachine merged commit 44f4203 into elastic:8.16 Nov 20, 2024
26 checks passed
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
dataViews 53 55 +2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
dataViews 1.9KB 1.9KB -1.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
dataViews 61.6KB 62.6KB +999.0B
Unknown metric groups

API count

id before after diff
dataViews 1224 1225 +1

ESLint disabled line counts

id before after diff
dataViews 11 12 +1

Total ESLint disabled count

id before after diff
dataViews 13 14 +1

cc @davismcphee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants