Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Views] Mitigate issue where has_es_data check can cause Kibana to hang #200476

Merged
merged 10 commits into from
Nov 20, 2024

Conversation

davismcphee
Copy link
Contributor

@davismcphee davismcphee commented Nov 18, 2024

Summary

This PR mitigates an issue where the has_es_data check can hang when some remote clusters are unresponsive, leaving users stuck in a loading state in some apps (e.g. Discover and Dashboard) until the request times out. There are two main changes that help mitigate this issue:

  • The resolve/cluster request in the has_es_data endpoint has been split into two requests -- one for local data first, then another for remote data second. In cases where remote clusters are unresponsive but there is data available in the local cluster, the remote check is never performed and the check completes quickly. This likely resolves the majority of cases and is also likely faster in general than checking both local and remote clusters in a single request.
  • In cases where there is no local data and the remote resolve/cluster request hangs, a new data_views.hasEsDataTimeout config has been added to kibana.yml (defaults to 5 seconds) to abort the request after a short delay. This scenario is handled in the front end by displaying an error toast to the user informing them of the issue, and assuming there is data available to avoid blocking them. When this occurs, a warning is also logged to the Kibana server logs.

CleanShot 2024-11-18 at 23 47 34@2x

Fixes #200280.

Notes

  • Modifying the existing version of the has_es_data endpoint in this way should be backward compatible since the behaviour should remain unchanged from before when the client and server versions don't match (please validate if this seems accurate during review).
  • For a long term fix, the ES team is investigating the issue with resolve/cluster and will aim to have it behave like resolve/index, which fails quickly when remote clusters are unresponsive. They may also implement other mitigations like a configurable timeout in ES: [Resolve Clusters API] Add option to configure cluster timeout elasticsearch#114020. The purpose of this PR is to provide an immediate solution in Kibana that mitigates the issue as much as possible.
  • If ES ends up providing another performant method for checking if indices exist instead of resolve/cluster, Kibana should migrate to that. More details in Need performant method of determining whether there are indices elasticsearch#112307.

Testing notes

To reproduce the issue locally, follow these steps:

  • Follow these instructions to set up a local CCS environment.
  • Stop the remote cluster process.
  • Use Netcat on the remote cluster port to listen to requests but not respond (e.g. on macOS: nc -l 9600), simulating an unresponsive cluster. See CCS: Should timeout parameter be honored? elasticsearch#32678 for more context.
  • Navigate to Discover and observe that the has_es_data request hangs. When testing in this PR branch, the request will only wait for 5 seconds before assuming data exists and displaying a toast.

Checklist

  • Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
  • Documentation was added for features that require explanation or tutorials
  • Unit or functional tests were updated or added to match the most common scenarios
  • If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
  • This was checked for breaking HTTP API changes, and any breaking changes have been approved by the breaking-change committee. The release_note:breaking label should be applied in these situations.
  • Flaky Test Runner was used on any tests changed
  • The PR description includes the appropriate Release Notes section, and the correct release_node:* label is applied per the guidelines

@davismcphee davismcphee added release_note:fix Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. backport:prev-major Backport to (8.x, 8.17, 8.16) the previous major branch and other branches in development labels Nov 18, 2024
@davismcphee davismcphee self-assigned this Nov 18, 2024
@davismcphee davismcphee force-pushed the fix-has-es-data-hanging branch from 958cf78 to 6d29a2a Compare November 19, 2024 02:23
@davismcphee davismcphee marked this pull request as ready for review November 19, 2024 05:31
@davismcphee davismcphee requested a review from a team as a code owner November 19, 2024 05:31
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

Copy link
Member

@lukasolson lukasolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of minor notes below

e.body?.statusCode === 400 &&
e.body?.attributes?.failureReason === HasEsDataFailureReason.remoteDataTimeout
) {
core.notifications.toasts.addDanger({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an API perspective, is it possible that consumers will want to swallow toasts/error messages? Does it make sense to have consumers pass in a onRemoteDataTimeout function that defaults to this behavior, but would also allow consumers to handle it in different ways?

(I think toasts are a decent default behavior but I don't think every possible consumer of this API will want to show a toast in these error scenarios.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's reasonable. It makes sense to provide a way to override it in cases where consumers have a better way to handle it. Updated here: 47965df.

@@ -82,6 +106,9 @@ export class HasData {

// ES Data

private isResponseError = (e: any): e is IHttpFetchError<ResponseErrorBody> =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Don't need to use any

Suggested change
private isResponseError = (e: any): e is IHttpFetchError<ResponseErrorBody> =>
private isResponseError = (e: Error): e is IHttpFetchError<ResponseErrorBody> =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! Updated: 079588b.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

Comment on lines 102 to 107
return res.badRequest({
body: {
message: timeoutMessage,
attributes: { failureReason: timeoutReason },
},
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure a "bad request" response makes sense here... The client didn't send anything wrong. Maybe we an do a custom error with a 408 status code?

Suggested change
return res.badRequest({
body: {
message: timeoutMessage,
attributes: { failureReason: timeoutReason },
},
});
return res.customError({
body: {
statusCode: 408,
message: timeoutMessage,
attributes: { failureReason: timeoutReason },
},
});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, 400 isn't really appropriate for this. I originally looked at 408 too, but my understanding is that it's used when a server times out waiting on a request from the client, not when it times out trying to return a response. After reading into it a bit more, I feel like 504 Gateway Timeout might be most appropriate for this case, so I updated it here: f130736.

I was hoping to avoid a generic 500 since it may be misleading and look like a Kibana server failure, but we could instead just go with that if 504 doesn't seem good either.

Comment on lines 112 to 117
return res.badRequest({
body: {
message: errorMessage,
attributes: { failureReason: HasEsDataFailureReason.unknown },
},
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I'm not sure what to do... We can probably leave as is. Are there any known cases we might fail here? If so, we might want to check e.meta.statusCode and use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, also not a good case for 400. And nope, no known cases... Which makes me realize it probably makes sense to just return a 500 here since it's unexpected. I think this is good enough for the client, and we log the underlying error if needed for further investigation. Updated here: f130736.

Copy link
Contributor Author

@davismcphee davismcphee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukasolson Thanks for the feedback, and I made some updates.

e.body?.statusCode === 400 &&
e.body?.attributes?.failureReason === HasEsDataFailureReason.remoteDataTimeout
) {
core.notifications.toasts.addDanger({
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that's reasonable. It makes sense to provide a way to override it in cases where consumers have a better way to handle it. Updated here: 47965df.

@@ -82,6 +106,9 @@ export class HasData {

// ES Data

private isResponseError = (e: any): e is IHttpFetchError<ResponseErrorBody> =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! Updated: 079588b.

Comment on lines 102 to 107
return res.badRequest({
body: {
message: timeoutMessage,
attributes: { failureReason: timeoutReason },
},
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, 400 isn't really appropriate for this. I originally looked at 408 too, but my understanding is that it's used when a server times out waiting on a request from the client, not when it times out trying to return a response. After reading into it a bit more, I feel like 504 Gateway Timeout might be most appropriate for this case, so I updated it here: f130736.

I was hoping to avoid a generic 500 since it may be misleading and look like a Kibana server failure, but we could instead just go with that if 504 doesn't seem good either.

Comment on lines 112 to 117
return res.badRequest({
body: {
message: errorMessage,
attributes: { failureReason: HasEsDataFailureReason.unknown },
},
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, also not a good case for 400. And nope, no known cases... Which makes me realize it probably makes sense to just return a 500 here since it's unexpected. I think this is good enough for the client, and we log the underlying error if needed for further investigation. Updated here: f130736.

@elasticmachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
dataViews 53 55 +2

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
dataViews 1.9KB 1.9KB -1.0B

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
dataViews 61.6KB 62.6KB +999.0B
Unknown metric groups

API count

id before after diff
dataViews 1224 1225 +1

ESLint disabled line counts

id before after diff
dataViews 12 13 +1

Total ESLint disabled count

id before after diff
dataViews 14 15 +1

History

cc @davismcphee

Copy link
Member

@lukasolson lukasolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes LGTM!

@davismcphee davismcphee merged commit 96fd4b6 into elastic:main Nov 20, 2024
25 checks passed
@davismcphee davismcphee deleted the fix-has-es-data-hanging branch November 20, 2024 18:52
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.15, 8.16, 8.x

https://github.com/elastic/kibana/actions/runs/11939870117

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 20, 2024
…a to hang (elastic#200476)

## Summary

This PR mitigates an issue where the `has_es_data` check can hang when
some remote clusters are unresponsive, leaving users stuck in a loading
state in some apps (e.g. Discover and Dashboard) until the request times
out. There are two main changes that help mitigate this issue:
- The `resolve/cluster` request in the `has_es_data` endpoint has been
split into two requests -- one for local data first, then another for
remote data second. In cases where remote clusters are unresponsive but
there is data available in the local cluster, the remote check is never
performed and the check completes quickly. This likely resolves the
majority of cases and is also likely faster in general than checking
both local and remote clusters in a single request.
- In cases where there is no local data and the remote `resolve/cluster`
request hangs, a new `data_views.hasEsDataTimeout` config has been added
to `kibana.yml` (defaults to 5 seconds) to abort the request after a
short delay. This scenario is handled in the front end by displaying an
error toast to the user informing them of the issue, and assuming there
is data available to avoid blocking them. When this occurs, a warning is
also logged to the Kibana server logs.

![CleanShot 2024-11-18 at 23 47
34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)

Fixes elastic#200280.

### Notes
- Modifying the existing version of the `has_es_data` endpoint in this
way should be backward compatible since the behaviour should remain
unchanged from before when the client and server versions don't match
(please validate if this seems accurate during review).
- For a long term fix, the ES team is investigating the issue with
`resolve/cluster` and will aim to have it behave like `resolve/index`,
which fails quickly when remote clusters are unresponsive. They may also
implement other mitigations like a configurable timeout in ES:
elastic/elasticsearch#114020. The purpose of
this PR is to provide an immediate solution in Kibana that mitigates the
issue as much as possible.
- If ES ends up providing another performant method for checking if
indices exist instead of `resolve/cluster`, Kibana should migrate to
that. More details in
elastic/elasticsearch#112307.

### Testing notes

To reproduce the issue locally, follow these steps:
- Follow [these
instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)
to set up a local CCS environment.
- Stop the remote cluster process.
- Use Netcat on the remote cluster port to listen to requests but not
respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive
cluster. See elastic/elasticsearch#32678 for
more context.
- Navigate to Discover and observe that the `has_es_data` request hangs.
When testing in this PR branch, the request will only wait for 5 seconds
before assuming data exists and displaying a toast.

### Checklist

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [x] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_node:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
(cherry picked from commit 96fd4b6)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 20, 2024
…a to hang (elastic#200476)

## Summary

This PR mitigates an issue where the `has_es_data` check can hang when
some remote clusters are unresponsive, leaving users stuck in a loading
state in some apps (e.g. Discover and Dashboard) until the request times
out. There are two main changes that help mitigate this issue:
- The `resolve/cluster` request in the `has_es_data` endpoint has been
split into two requests -- one for local data first, then another for
remote data second. In cases where remote clusters are unresponsive but
there is data available in the local cluster, the remote check is never
performed and the check completes quickly. This likely resolves the
majority of cases and is also likely faster in general than checking
both local and remote clusters in a single request.
- In cases where there is no local data and the remote `resolve/cluster`
request hangs, a new `data_views.hasEsDataTimeout` config has been added
to `kibana.yml` (defaults to 5 seconds) to abort the request after a
short delay. This scenario is handled in the front end by displaying an
error toast to the user informing them of the issue, and assuming there
is data available to avoid blocking them. When this occurs, a warning is
also logged to the Kibana server logs.

![CleanShot 2024-11-18 at 23 47
34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)

Fixes elastic#200280.

### Notes
- Modifying the existing version of the `has_es_data` endpoint in this
way should be backward compatible since the behaviour should remain
unchanged from before when the client and server versions don't match
(please validate if this seems accurate during review).
- For a long term fix, the ES team is investigating the issue with
`resolve/cluster` and will aim to have it behave like `resolve/index`,
which fails quickly when remote clusters are unresponsive. They may also
implement other mitigations like a configurable timeout in ES:
elastic/elasticsearch#114020. The purpose of
this PR is to provide an immediate solution in Kibana that mitigates the
issue as much as possible.
- If ES ends up providing another performant method for checking if
indices exist instead of `resolve/cluster`, Kibana should migrate to
that. More details in
elastic/elasticsearch#112307.

### Testing notes

To reproduce the issue locally, follow these steps:
- Follow [these
instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)
to set up a local CCS environment.
- Stop the remote cluster process.
- Use Netcat on the remote cluster port to listen to requests but not
respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive
cluster. See elastic/elasticsearch#32678 for
more context.
- Navigate to Discover and observe that the `has_es_data` request hangs.
When testing in this PR branch, the request will only wait for 5 seconds
before assuming data exists and displaying a toast.

### Checklist

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [x] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_node:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
(cherry picked from commit 96fd4b6)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Nov 20, 2024
…a to hang (elastic#200476)

## Summary

This PR mitigates an issue where the `has_es_data` check can hang when
some remote clusters are unresponsive, leaving users stuck in a loading
state in some apps (e.g. Discover and Dashboard) until the request times
out. There are two main changes that help mitigate this issue:
- The `resolve/cluster` request in the `has_es_data` endpoint has been
split into two requests -- one for local data first, then another for
remote data second. In cases where remote clusters are unresponsive but
there is data available in the local cluster, the remote check is never
performed and the check completes quickly. This likely resolves the
majority of cases and is also likely faster in general than checking
both local and remote clusters in a single request.
- In cases where there is no local data and the remote `resolve/cluster`
request hangs, a new `data_views.hasEsDataTimeout` config has been added
to `kibana.yml` (defaults to 5 seconds) to abort the request after a
short delay. This scenario is handled in the front end by displaying an
error toast to the user informing them of the issue, and assuming there
is data available to avoid blocking them. When this occurs, a warning is
also logged to the Kibana server logs.

![CleanShot 2024-11-18 at 23 47
34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)

Fixes elastic#200280.

### Notes
- Modifying the existing version of the `has_es_data` endpoint in this
way should be backward compatible since the behaviour should remain
unchanged from before when the client and server versions don't match
(please validate if this seems accurate during review).
- For a long term fix, the ES team is investigating the issue with
`resolve/cluster` and will aim to have it behave like `resolve/index`,
which fails quickly when remote clusters are unresponsive. They may also
implement other mitigations like a configurable timeout in ES:
elastic/elasticsearch#114020. The purpose of
this PR is to provide an immediate solution in Kibana that mitigates the
issue as much as possible.
- If ES ends up providing another performant method for checking if
indices exist instead of `resolve/cluster`, Kibana should migrate to
that. More details in
elastic/elasticsearch#112307.

### Testing notes

To reproduce the issue locally, follow these steps:
- Follow [these
instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)
to set up a local CCS environment.
- Stop the remote cluster process.
- Use Netcat on the remote cluster port to listen to requests but not
respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive
cluster. See elastic/elasticsearch#32678 for
more context.
- Navigate to Discover and observe that the `has_es_data` request hangs.
When testing in this PR branch, the request will only wait for 5 seconds
before assuming data exists and displaying a toast.

### Checklist

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [x] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_node:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
(cherry picked from commit 96fd4b6)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.15
8.16
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Nov 20, 2024
… can cause Kibana to hang (#200476) (#201025)

# Backport

This will backport the following commits from `main` to `8.x`:
- [[Data Views] Mitigate issue where &#x60;has_es_data&#x60; check can
cause Kibana to hang
(#200476)](#200476)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Davis
McPhee","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-20T18:52:47Z","message":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to hang
(#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the
`has_es_data` check can hang when\r\nsome remote clusters are
unresponsive, leaving users stuck in a loading\r\nstate in some apps
(e.g. Discover and Dashboard) until the request times\r\nout. There are
two main changes that help mitigate this issue:\r\n- The
`resolve/cluster` request in the `has_es_data` endpoint has
been\r\nsplit into two requests -- one for local data first, then
another for\r\nremote data second. In cases where remote clusters are
unresponsive but\r\nthere is data available in the local cluster, the
remote check is never\r\nperformed and the check completes quickly. This
likely resolves the\r\nmajority of cases and is also likely faster in
general than checking\r\nboth local and remote clusters in a single
request.\r\n- In cases where there is no local data and the remote
`resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout`
config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to
abort the request after a\r\nshort delay. This scenario is handled in
the front end by displaying an\r\nerror toast to the user informing them
of the issue, and assuming there\r\nis data available to avoid blocking
them. When this occurs, a warning is\r\nalso logged to the Kibana server
logs.\r\n\r\n![CleanShot 2024-11-18 at 23
47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes
#200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the
`has_es_data` endpoint in this\r\nway should be backward compatible
since the behaviour should remain\r\nunchanged from before when the
client and server versions don't match\r\n(please validate if this seems
accurate during review).\r\n- For a long term fix, the ES team is
investigating the issue with\r\n`resolve/cluster` and will aim to have
it behave like `resolve/index`,\r\nwhich fails quickly when remote
clusters are unresponsive. They may also\r\nimplement other mitigations
like a configurable timeout in
ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The
purpose of\r\nthis PR is to provide an immediate solution in Kibana that
mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing
another performant method for checking if\r\nindices exist instead of
`resolve/cluster`, Kibana should migrate to\r\nthat. More details
in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n###
Testing notes\r\n\r\nTo reproduce the issue locally, follow these
steps:\r\n- Follow
[these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto
set up a local CCS environment.\r\n- Stop the remote cluster
process.\r\n- Use Netcat on the remote cluster port to listen to
requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an
unresponsive\r\ncluster. See
elastic/elasticsearch#32678 for\r\nmore
context.\r\n- Navigate to Discover and observe that the `has_es_data`
request hangs.\r\nWhen testing in this PR branch, the request will only
wait for 5 seconds\r\nbefore assuming data exists and displaying a
toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [x] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] If a plugin
configuration key changed, check if it needs to be\r\nallowlisted in the
cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[x] This was checked for breaking HTTP API changes, and any
breaking\r\nchanges have been approved by the breaking-change committee.
The\r\n`release_note:breaking` label should be applied in these
situations.\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] The PR description includes
the appropriate Release Notes section,\r\nand the correct
`release_node:*` label is applied per
the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","v9.0.0","Team:DataDiscovery","backport:prev-major"],"title":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to
hang","number":200476,"url":"https://github.com/elastic/kibana/pull/200476","mergeCommit":{"message":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to hang
(#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the
`has_es_data` check can hang when\r\nsome remote clusters are
unresponsive, leaving users stuck in a loading\r\nstate in some apps
(e.g. Discover and Dashboard) until the request times\r\nout. There are
two main changes that help mitigate this issue:\r\n- The
`resolve/cluster` request in the `has_es_data` endpoint has
been\r\nsplit into two requests -- one for local data first, then
another for\r\nremote data second. In cases where remote clusters are
unresponsive but\r\nthere is data available in the local cluster, the
remote check is never\r\nperformed and the check completes quickly. This
likely resolves the\r\nmajority of cases and is also likely faster in
general than checking\r\nboth local and remote clusters in a single
request.\r\n- In cases where there is no local data and the remote
`resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout`
config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to
abort the request after a\r\nshort delay. This scenario is handled in
the front end by displaying an\r\nerror toast to the user informing them
of the issue, and assuming there\r\nis data available to avoid blocking
them. When this occurs, a warning is\r\nalso logged to the Kibana server
logs.\r\n\r\n![CleanShot 2024-11-18 at 23
47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes
#200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the
`has_es_data` endpoint in this\r\nway should be backward compatible
since the behaviour should remain\r\nunchanged from before when the
client and server versions don't match\r\n(please validate if this seems
accurate during review).\r\n- For a long term fix, the ES team is
investigating the issue with\r\n`resolve/cluster` and will aim to have
it behave like `resolve/index`,\r\nwhich fails quickly when remote
clusters are unresponsive. They may also\r\nimplement other mitigations
like a configurable timeout in
ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The
purpose of\r\nthis PR is to provide an immediate solution in Kibana that
mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing
another performant method for checking if\r\nindices exist instead of
`resolve/cluster`, Kibana should migrate to\r\nthat. More details
in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n###
Testing notes\r\n\r\nTo reproduce the issue locally, follow these
steps:\r\n- Follow
[these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto
set up a local CCS environment.\r\n- Stop the remote cluster
process.\r\n- Use Netcat on the remote cluster port to listen to
requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an
unresponsive\r\ncluster. See
elastic/elasticsearch#32678 for\r\nmore
context.\r\n- Navigate to Discover and observe that the `has_es_data`
request hangs.\r\nWhen testing in this PR branch, the request will only
wait for 5 seconds\r\nbefore assuming data exists and displaying a
toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [x] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] If a plugin
configuration key changed, check if it needs to be\r\nallowlisted in the
cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[x] This was checked for breaking HTTP API changes, and any
breaking\r\nchanges have been approved by the breaking-change committee.
The\r\n`release_note:breaking` label should be applied in these
situations.\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] The PR description includes
the appropriate Release Notes section,\r\nand the correct
`release_node:*` label is applied per
the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/200476","number":200476,"mergeCommit":{"message":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to hang
(#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the
`has_es_data` check can hang when\r\nsome remote clusters are
unresponsive, leaving users stuck in a loading\r\nstate in some apps
(e.g. Discover and Dashboard) until the request times\r\nout. There are
two main changes that help mitigate this issue:\r\n- The
`resolve/cluster` request in the `has_es_data` endpoint has
been\r\nsplit into two requests -- one for local data first, then
another for\r\nremote data second. In cases where remote clusters are
unresponsive but\r\nthere is data available in the local cluster, the
remote check is never\r\nperformed and the check completes quickly. This
likely resolves the\r\nmajority of cases and is also likely faster in
general than checking\r\nboth local and remote clusters in a single
request.\r\n- In cases where there is no local data and the remote
`resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout`
config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to
abort the request after a\r\nshort delay. This scenario is handled in
the front end by displaying an\r\nerror toast to the user informing them
of the issue, and assuming there\r\nis data available to avoid blocking
them. When this occurs, a warning is\r\nalso logged to the Kibana server
logs.\r\n\r\n![CleanShot 2024-11-18 at 23
47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes
#200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the
`has_es_data` endpoint in this\r\nway should be backward compatible
since the behaviour should remain\r\nunchanged from before when the
client and server versions don't match\r\n(please validate if this seems
accurate during review).\r\n- For a long term fix, the ES team is
investigating the issue with\r\n`resolve/cluster` and will aim to have
it behave like `resolve/index`,\r\nwhich fails quickly when remote
clusters are unresponsive. They may also\r\nimplement other mitigations
like a configurable timeout in
ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The
purpose of\r\nthis PR is to provide an immediate solution in Kibana that
mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing
another performant method for checking if\r\nindices exist instead of
`resolve/cluster`, Kibana should migrate to\r\nthat. More details
in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n###
Testing notes\r\n\r\nTo reproduce the issue locally, follow these
steps:\r\n- Follow
[these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto
set up a local CCS environment.\r\n- Stop the remote cluster
process.\r\n- Use Netcat on the remote cluster port to listen to
requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an
unresponsive\r\ncluster. See
elastic/elasticsearch#32678 for\r\nmore
context.\r\n- Navigate to Discover and observe that the `has_es_data`
request hangs.\r\nWhen testing in this PR branch, the request will only
wait for 5 seconds\r\nbefore assuming data exists and displaying a
toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [x] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] If a plugin
configuration key changed, check if it needs to be\r\nallowlisted in the
cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[x] This was checked for breaking HTTP API changes, and any
breaking\r\nchanges have been approved by the breaking-change committee.
The\r\n`release_note:breaking` label should be applied in these
situations.\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] The PR description includes
the appropriate Release Notes section,\r\nand the correct
`release_node:*` label is applied per
the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}}]}]
BACKPORT-->

Co-authored-by: Davis McPhee <[email protected]>
kibanamachine added a commit that referenced this pull request Nov 20, 2024
…k can cause Kibana to hang (#200476) (#201024)

# Backport

This will backport the following commits from `main` to `8.16`:
- [[Data Views] Mitigate issue where &#x60;has_es_data&#x60; check can
cause Kibana to hang
(#200476)](#200476)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Davis
McPhee","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-20T18:52:47Z","message":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to hang
(#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the
`has_es_data` check can hang when\r\nsome remote clusters are
unresponsive, leaving users stuck in a loading\r\nstate in some apps
(e.g. Discover and Dashboard) until the request times\r\nout. There are
two main changes that help mitigate this issue:\r\n- The
`resolve/cluster` request in the `has_es_data` endpoint has
been\r\nsplit into two requests -- one for local data first, then
another for\r\nremote data second. In cases where remote clusters are
unresponsive but\r\nthere is data available in the local cluster, the
remote check is never\r\nperformed and the check completes quickly. This
likely resolves the\r\nmajority of cases and is also likely faster in
general than checking\r\nboth local and remote clusters in a single
request.\r\n- In cases where there is no local data and the remote
`resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout`
config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to
abort the request after a\r\nshort delay. This scenario is handled in
the front end by displaying an\r\nerror toast to the user informing them
of the issue, and assuming there\r\nis data available to avoid blocking
them. When this occurs, a warning is\r\nalso logged to the Kibana server
logs.\r\n\r\n![CleanShot 2024-11-18 at 23
47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes
#200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the
`has_es_data` endpoint in this\r\nway should be backward compatible
since the behaviour should remain\r\nunchanged from before when the
client and server versions don't match\r\n(please validate if this seems
accurate during review).\r\n- For a long term fix, the ES team is
investigating the issue with\r\n`resolve/cluster` and will aim to have
it behave like `resolve/index`,\r\nwhich fails quickly when remote
clusters are unresponsive. They may also\r\nimplement other mitigations
like a configurable timeout in
ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The
purpose of\r\nthis PR is to provide an immediate solution in Kibana that
mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing
another performant method for checking if\r\nindices exist instead of
`resolve/cluster`, Kibana should migrate to\r\nthat. More details
in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n###
Testing notes\r\n\r\nTo reproduce the issue locally, follow these
steps:\r\n- Follow
[these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto
set up a local CCS environment.\r\n- Stop the remote cluster
process.\r\n- Use Netcat on the remote cluster port to listen to
requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an
unresponsive\r\ncluster. See
elastic/elasticsearch#32678 for\r\nmore
context.\r\n- Navigate to Discover and observe that the `has_es_data`
request hangs.\r\nWhen testing in this PR branch, the request will only
wait for 5 seconds\r\nbefore assuming data exists and displaying a
toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [x] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] If a plugin
configuration key changed, check if it needs to be\r\nallowlisted in the
cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[x] This was checked for breaking HTTP API changes, and any
breaking\r\nchanges have been approved by the breaking-change committee.
The\r\n`release_note:breaking` label should be applied in these
situations.\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] The PR description includes
the appropriate Release Notes section,\r\nand the correct
`release_node:*` label is applied per
the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","v9.0.0","Team:DataDiscovery","backport:prev-major"],"title":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to
hang","number":200476,"url":"https://github.com/elastic/kibana/pull/200476","mergeCommit":{"message":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to hang
(#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the
`has_es_data` check can hang when\r\nsome remote clusters are
unresponsive, leaving users stuck in a loading\r\nstate in some apps
(e.g. Discover and Dashboard) until the request times\r\nout. There are
two main changes that help mitigate this issue:\r\n- The
`resolve/cluster` request in the `has_es_data` endpoint has
been\r\nsplit into two requests -- one for local data first, then
another for\r\nremote data second. In cases where remote clusters are
unresponsive but\r\nthere is data available in the local cluster, the
remote check is never\r\nperformed and the check completes quickly. This
likely resolves the\r\nmajority of cases and is also likely faster in
general than checking\r\nboth local and remote clusters in a single
request.\r\n- In cases where there is no local data and the remote
`resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout`
config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to
abort the request after a\r\nshort delay. This scenario is handled in
the front end by displaying an\r\nerror toast to the user informing them
of the issue, and assuming there\r\nis data available to avoid blocking
them. When this occurs, a warning is\r\nalso logged to the Kibana server
logs.\r\n\r\n![CleanShot 2024-11-18 at 23
47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes
#200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the
`has_es_data` endpoint in this\r\nway should be backward compatible
since the behaviour should remain\r\nunchanged from before when the
client and server versions don't match\r\n(please validate if this seems
accurate during review).\r\n- For a long term fix, the ES team is
investigating the issue with\r\n`resolve/cluster` and will aim to have
it behave like `resolve/index`,\r\nwhich fails quickly when remote
clusters are unresponsive. They may also\r\nimplement other mitigations
like a configurable timeout in
ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The
purpose of\r\nthis PR is to provide an immediate solution in Kibana that
mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing
another performant method for checking if\r\nindices exist instead of
`resolve/cluster`, Kibana should migrate to\r\nthat. More details
in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n###
Testing notes\r\n\r\nTo reproduce the issue locally, follow these
steps:\r\n- Follow
[these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto
set up a local CCS environment.\r\n- Stop the remote cluster
process.\r\n- Use Netcat on the remote cluster port to listen to
requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an
unresponsive\r\ncluster. See
elastic/elasticsearch#32678 for\r\nmore
context.\r\n- Navigate to Discover and observe that the `has_es_data`
request hangs.\r\nWhen testing in this PR branch, the request will only
wait for 5 seconds\r\nbefore assuming data exists and displaying a
toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [x] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] If a plugin
configuration key changed, check if it needs to be\r\nallowlisted in the
cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[x] This was checked for breaking HTTP API changes, and any
breaking\r\nchanges have been approved by the breaking-change committee.
The\r\n`release_note:breaking` label should be applied in these
situations.\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] The PR description includes
the appropriate Release Notes section,\r\nand the correct
`release_node:*` label is applied per
the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/200476","number":200476,"mergeCommit":{"message":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to hang
(#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the
`has_es_data` check can hang when\r\nsome remote clusters are
unresponsive, leaving users stuck in a loading\r\nstate in some apps
(e.g. Discover and Dashboard) until the request times\r\nout. There are
two main changes that help mitigate this issue:\r\n- The
`resolve/cluster` request in the `has_es_data` endpoint has
been\r\nsplit into two requests -- one for local data first, then
another for\r\nremote data second. In cases where remote clusters are
unresponsive but\r\nthere is data available in the local cluster, the
remote check is never\r\nperformed and the check completes quickly. This
likely resolves the\r\nmajority of cases and is also likely faster in
general than checking\r\nboth local and remote clusters in a single
request.\r\n- In cases where there is no local data and the remote
`resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout`
config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to
abort the request after a\r\nshort delay. This scenario is handled in
the front end by displaying an\r\nerror toast to the user informing them
of the issue, and assuming there\r\nis data available to avoid blocking
them. When this occurs, a warning is\r\nalso logged to the Kibana server
logs.\r\n\r\n![CleanShot 2024-11-18 at 23
47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes
#200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the
`has_es_data` endpoint in this\r\nway should be backward compatible
since the behaviour should remain\r\nunchanged from before when the
client and server versions don't match\r\n(please validate if this seems
accurate during review).\r\n- For a long term fix, the ES team is
investigating the issue with\r\n`resolve/cluster` and will aim to have
it behave like `resolve/index`,\r\nwhich fails quickly when remote
clusters are unresponsive. They may also\r\nimplement other mitigations
like a configurable timeout in
ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The
purpose of\r\nthis PR is to provide an immediate solution in Kibana that
mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing
another performant method for checking if\r\nindices exist instead of
`resolve/cluster`, Kibana should migrate to\r\nthat. More details
in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n###
Testing notes\r\n\r\nTo reproduce the issue locally, follow these
steps:\r\n- Follow
[these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto
set up a local CCS environment.\r\n- Stop the remote cluster
process.\r\n- Use Netcat on the remote cluster port to listen to
requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an
unresponsive\r\ncluster. See
elastic/elasticsearch#32678 for\r\nmore
context.\r\n- Navigate to Discover and observe that the `has_es_data`
request hangs.\r\nWhen testing in this PR branch, the request will only
wait for 5 seconds\r\nbefore assuming data exists and displaying a
toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [x] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] If a plugin
configuration key changed, check if it needs to be\r\nallowlisted in the
cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[x] This was checked for breaking HTTP API changes, and any
breaking\r\nchanges have been approved by the breaking-change committee.
The\r\n`release_note:breaking` label should be applied in these
situations.\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] The PR description includes
the appropriate Release Notes section,\r\nand the correct
`release_node:*` label is applied per
the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}}]}]
BACKPORT-->

Co-authored-by: Davis McPhee <[email protected]>
kibanamachine added a commit that referenced this pull request Nov 20, 2024
…k can cause Kibana to hang (#200476) (#201023)

# Backport

This will backport the following commits from `main` to `8.15`:
- [[Data Views] Mitigate issue where &#x60;has_es_data&#x60; check can
cause Kibana to hang
(#200476)](#200476)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Davis
McPhee","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-11-20T18:52:47Z","message":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to hang
(#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the
`has_es_data` check can hang when\r\nsome remote clusters are
unresponsive, leaving users stuck in a loading\r\nstate in some apps
(e.g. Discover and Dashboard) until the request times\r\nout. There are
two main changes that help mitigate this issue:\r\n- The
`resolve/cluster` request in the `has_es_data` endpoint has
been\r\nsplit into two requests -- one for local data first, then
another for\r\nremote data second. In cases where remote clusters are
unresponsive but\r\nthere is data available in the local cluster, the
remote check is never\r\nperformed and the check completes quickly. This
likely resolves the\r\nmajority of cases and is also likely faster in
general than checking\r\nboth local and remote clusters in a single
request.\r\n- In cases where there is no local data and the remote
`resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout`
config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to
abort the request after a\r\nshort delay. This scenario is handled in
the front end by displaying an\r\nerror toast to the user informing them
of the issue, and assuming there\r\nis data available to avoid blocking
them. When this occurs, a warning is\r\nalso logged to the Kibana server
logs.\r\n\r\n![CleanShot 2024-11-18 at 23
47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes
#200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the
`has_es_data` endpoint in this\r\nway should be backward compatible
since the behaviour should remain\r\nunchanged from before when the
client and server versions don't match\r\n(please validate if this seems
accurate during review).\r\n- For a long term fix, the ES team is
investigating the issue with\r\n`resolve/cluster` and will aim to have
it behave like `resolve/index`,\r\nwhich fails quickly when remote
clusters are unresponsive. They may also\r\nimplement other mitigations
like a configurable timeout in
ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The
purpose of\r\nthis PR is to provide an immediate solution in Kibana that
mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing
another performant method for checking if\r\nindices exist instead of
`resolve/cluster`, Kibana should migrate to\r\nthat. More details
in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n###
Testing notes\r\n\r\nTo reproduce the issue locally, follow these
steps:\r\n- Follow
[these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto
set up a local CCS environment.\r\n- Stop the remote cluster
process.\r\n- Use Netcat on the remote cluster port to listen to
requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an
unresponsive\r\ncluster. See
elastic/elasticsearch#32678 for\r\nmore
context.\r\n- Navigate to Discover and observe that the `has_es_data`
request hangs.\r\nWhen testing in this PR branch, the request will only
wait for 5 seconds\r\nbefore assuming data exists and displaying a
toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [x] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] If a plugin
configuration key changed, check if it needs to be\r\nallowlisted in the
cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[x] This was checked for breaking HTTP API changes, and any
breaking\r\nchanges have been approved by the breaking-change committee.
The\r\n`release_note:breaking` label should be applied in these
situations.\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] The PR description includes
the appropriate Release Notes section,\r\nand the correct
`release_node:*` label is applied per
the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589","branchLabelMapping":{"^v9.0.0$":"main","^v8.17.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:fix","v9.0.0","Team:DataDiscovery","backport:prev-major"],"title":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to
hang","number":200476,"url":"https://github.com/elastic/kibana/pull/200476","mergeCommit":{"message":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to hang
(#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the
`has_es_data` check can hang when\r\nsome remote clusters are
unresponsive, leaving users stuck in a loading\r\nstate in some apps
(e.g. Discover and Dashboard) until the request times\r\nout. There are
two main changes that help mitigate this issue:\r\n- The
`resolve/cluster` request in the `has_es_data` endpoint has
been\r\nsplit into two requests -- one for local data first, then
another for\r\nremote data second. In cases where remote clusters are
unresponsive but\r\nthere is data available in the local cluster, the
remote check is never\r\nperformed and the check completes quickly. This
likely resolves the\r\nmajority of cases and is also likely faster in
general than checking\r\nboth local and remote clusters in a single
request.\r\n- In cases where there is no local data and the remote
`resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout`
config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to
abort the request after a\r\nshort delay. This scenario is handled in
the front end by displaying an\r\nerror toast to the user informing them
of the issue, and assuming there\r\nis data available to avoid blocking
them. When this occurs, a warning is\r\nalso logged to the Kibana server
logs.\r\n\r\n![CleanShot 2024-11-18 at 23
47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes
#200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the
`has_es_data` endpoint in this\r\nway should be backward compatible
since the behaviour should remain\r\nunchanged from before when the
client and server versions don't match\r\n(please validate if this seems
accurate during review).\r\n- For a long term fix, the ES team is
investigating the issue with\r\n`resolve/cluster` and will aim to have
it behave like `resolve/index`,\r\nwhich fails quickly when remote
clusters are unresponsive. They may also\r\nimplement other mitigations
like a configurable timeout in
ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The
purpose of\r\nthis PR is to provide an immediate solution in Kibana that
mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing
another performant method for checking if\r\nindices exist instead of
`resolve/cluster`, Kibana should migrate to\r\nthat. More details
in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n###
Testing notes\r\n\r\nTo reproduce the issue locally, follow these
steps:\r\n- Follow
[these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto
set up a local CCS environment.\r\n- Stop the remote cluster
process.\r\n- Use Netcat on the remote cluster port to listen to
requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an
unresponsive\r\ncluster. See
elastic/elasticsearch#32678 for\r\nmore
context.\r\n- Navigate to Discover and observe that the `has_es_data`
request hangs.\r\nWhen testing in this PR branch, the request will only
wait for 5 seconds\r\nbefore assuming data exists and displaying a
toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [x] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] If a plugin
configuration key changed, check if it needs to be\r\nallowlisted in the
cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[x] This was checked for breaking HTTP API changes, and any
breaking\r\nchanges have been approved by the breaking-change committee.
The\r\n`release_note:breaking` label should be applied in these
situations.\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] The PR description includes
the appropriate Release Notes section,\r\nand the correct
`release_node:*` label is applied per
the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/200476","number":200476,"mergeCommit":{"message":"[Data
Views] Mitigate issue where `has_es_data` check can cause Kibana to hang
(#200476)\n\n## Summary\r\n\r\nThis PR mitigates an issue where the
`has_es_data` check can hang when\r\nsome remote clusters are
unresponsive, leaving users stuck in a loading\r\nstate in some apps
(e.g. Discover and Dashboard) until the request times\r\nout. There are
two main changes that help mitigate this issue:\r\n- The
`resolve/cluster` request in the `has_es_data` endpoint has
been\r\nsplit into two requests -- one for local data first, then
another for\r\nremote data second. In cases where remote clusters are
unresponsive but\r\nthere is data available in the local cluster, the
remote check is never\r\nperformed and the check completes quickly. This
likely resolves the\r\nmajority of cases and is also likely faster in
general than checking\r\nboth local and remote clusters in a single
request.\r\n- In cases where there is no local data and the remote
`resolve/cluster`\r\nrequest hangs, a new `data_views.hasEsDataTimeout`
config has been added\r\nto `kibana.yml` (defaults to 5 seconds) to
abort the request after a\r\nshort delay. This scenario is handled in
the front end by displaying an\r\nerror toast to the user informing them
of the issue, and assuming there\r\nis data available to avoid blocking
them. When this occurs, a warning is\r\nalso logged to the Kibana server
logs.\r\n\r\n![CleanShot 2024-11-18 at 23
47\r\n34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)\r\n\r\nFixes
#200280.\r\n\r\n### Notes\r\n- Modifying the existing version of the
`has_es_data` endpoint in this\r\nway should be backward compatible
since the behaviour should remain\r\nunchanged from before when the
client and server versions don't match\r\n(please validate if this seems
accurate during review).\r\n- For a long term fix, the ES team is
investigating the issue with\r\n`resolve/cluster` and will aim to have
it behave like `resolve/index`,\r\nwhich fails quickly when remote
clusters are unresponsive. They may also\r\nimplement other mitigations
like a configurable timeout in
ES:\r\nhttps://github.com/elastic/elasticsearch/issues/114020. The
purpose of\r\nthis PR is to provide an immediate solution in Kibana that
mitigates the\r\nissue as much as possible.\r\n- If ES ends up providing
another performant method for checking if\r\nindices exist instead of
`resolve/cluster`, Kibana should migrate to\r\nthat. More details
in\r\nhttps://github.com/elastic/elasticsearch/issues/112307.\r\n\r\n###
Testing notes\r\n\r\nTo reproduce the issue locally, follow these
steps:\r\n- Follow
[these\r\ninstructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)\r\nto
set up a local CCS environment.\r\n- Stop the remote cluster
process.\r\n- Use Netcat on the remote cluster port to listen to
requests but not\r\nrespond (e.g. on macOS: `nc -l 9600`), simulating an
unresponsive\r\ncluster. See
elastic/elasticsearch#32678 for\r\nmore
context.\r\n- Navigate to Discover and observe that the `has_es_data`
request hangs.\r\nWhen testing in this PR branch, the request will only
wait for 5 seconds\r\nbefore assuming data exists and displaying a
toast.\r\n\r\n### Checklist\r\n\r\n- [x] Any text added follows [EUI's
writing\r\nguidelines](https://elastic.github.io/eui/#/guidelines/writing),
uses\r\nsentence case text and includes
[i18n\r\nsupport](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)\r\n-
[
]\r\n[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)\r\nwas
added for features that require explanation or tutorials\r\n- [x] [Unit
or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [ ] If a plugin
configuration key changed, check if it needs to be\r\nallowlisted in the
cloud and added to the
[docker\r\nlist](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)\r\n-
[x] This was checked for breaking HTTP API changes, and any
breaking\r\nchanges have been approved by the breaking-change committee.
The\r\n`release_note:breaking` label should be applied in these
situations.\r\n- [ ] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] The PR description includes
the appropriate Release Notes section,\r\nand the correct
`release_node:*` label is applied per
the\r\n[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)\r\n\r\n---------\r\n\r\nCo-authored-by:
kibanamachine
<[email protected]>","sha":"96fd4b682b77f6c1d6d1c6ab0742462d9e9d2589"}}]}]
BACKPORT-->

---------

Co-authored-by: Davis McPhee <[email protected]>
@mistic
Copy link
Member

mistic commented Nov 21, 2024

This PR didn't make it into the latest BC of v8.16.1. Updating the labels.

@mistic mistic added v8.16.2 and removed v8.16.1 labels Nov 21, 2024
TattdCodeMonkey pushed a commit to TattdCodeMonkey/kibana that referenced this pull request Nov 21, 2024
…a to hang (elastic#200476)

## Summary

This PR mitigates an issue where the `has_es_data` check can hang when
some remote clusters are unresponsive, leaving users stuck in a loading
state in some apps (e.g. Discover and Dashboard) until the request times
out. There are two main changes that help mitigate this issue:
- The `resolve/cluster` request in the `has_es_data` endpoint has been
split into two requests -- one for local data first, then another for
remote data second. In cases where remote clusters are unresponsive but
there is data available in the local cluster, the remote check is never
performed and the check completes quickly. This likely resolves the
majority of cases and is also likely faster in general than checking
both local and remote clusters in a single request.
- In cases where there is no local data and the remote `resolve/cluster`
request hangs, a new `data_views.hasEsDataTimeout` config has been added
to `kibana.yml` (defaults to 5 seconds) to abort the request after a
short delay. This scenario is handled in the front end by displaying an
error toast to the user informing them of the issue, and assuming there
is data available to avoid blocking them. When this occurs, a warning is
also logged to the Kibana server logs.

![CleanShot 2024-11-18 at 23 47
34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)

Fixes elastic#200280.

### Notes
- Modifying the existing version of the `has_es_data` endpoint in this
way should be backward compatible since the behaviour should remain
unchanged from before when the client and server versions don't match
(please validate if this seems accurate during review).
- For a long term fix, the ES team is investigating the issue with
`resolve/cluster` and will aim to have it behave like `resolve/index`,
which fails quickly when remote clusters are unresponsive. They may also
implement other mitigations like a configurable timeout in ES:
elastic/elasticsearch#114020. The purpose of
this PR is to provide an immediate solution in Kibana that mitigates the
issue as much as possible.
- If ES ends up providing another performant method for checking if
indices exist instead of `resolve/cluster`, Kibana should migrate to
that. More details in
elastic/elasticsearch#112307.

### Testing notes

To reproduce the issue locally, follow these steps:
- Follow [these
instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)
to set up a local CCS environment.
- Stop the remote cluster process.
- Use Netcat on the remote cluster port to listen to requests but not
respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive
cluster. See elastic/elasticsearch#32678 for
more context.
- Navigate to Discover and observe that the `has_es_data` request hangs.
When testing in this PR branch, the request will only wait for 5 seconds
before assuming data exists and displaying a toast.

### Checklist

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [x] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_node:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
paulinashakirova pushed a commit to paulinashakirova/kibana that referenced this pull request Nov 26, 2024
…a to hang (elastic#200476)

## Summary

This PR mitigates an issue where the `has_es_data` check can hang when
some remote clusters are unresponsive, leaving users stuck in a loading
state in some apps (e.g. Discover and Dashboard) until the request times
out. There are two main changes that help mitigate this issue:
- The `resolve/cluster` request in the `has_es_data` endpoint has been
split into two requests -- one for local data first, then another for
remote data second. In cases where remote clusters are unresponsive but
there is data available in the local cluster, the remote check is never
performed and the check completes quickly. This likely resolves the
majority of cases and is also likely faster in general than checking
both local and remote clusters in a single request.
- In cases where there is no local data and the remote `resolve/cluster`
request hangs, a new `data_views.hasEsDataTimeout` config has been added
to `kibana.yml` (defaults to 5 seconds) to abort the request after a
short delay. This scenario is handled in the front end by displaying an
error toast to the user informing them of the issue, and assuming there
is data available to avoid blocking them. When this occurs, a warning is
also logged to the Kibana server logs.

![CleanShot 2024-11-18 at 23 47
34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)

Fixes elastic#200280.

### Notes
- Modifying the existing version of the `has_es_data` endpoint in this
way should be backward compatible since the behaviour should remain
unchanged from before when the client and server versions don't match
(please validate if this seems accurate during review).
- For a long term fix, the ES team is investigating the issue with
`resolve/cluster` and will aim to have it behave like `resolve/index`,
which fails quickly when remote clusters are unresponsive. They may also
implement other mitigations like a configurable timeout in ES:
elastic/elasticsearch#114020. The purpose of
this PR is to provide an immediate solution in Kibana that mitigates the
issue as much as possible.
- If ES ends up providing another performant method for checking if
indices exist instead of `resolve/cluster`, Kibana should migrate to
that. More details in
elastic/elasticsearch#112307.

### Testing notes

To reproduce the issue locally, follow these steps:
- Follow [these
instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)
to set up a local CCS environment.
- Stop the remote cluster process.
- Use Netcat on the remote cluster port to listen to requests but not
respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive
cluster. See elastic/elasticsearch#32678 for
more context.
- Navigate to Discover and observe that the `has_es_data` request hangs.
When testing in this PR branch, the request will only wait for 5 seconds
before assuming data exists and displaying a toast.

### Checklist

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [x] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_node:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
CAWilson94 pushed a commit to CAWilson94/kibana that referenced this pull request Dec 12, 2024
…a to hang (elastic#200476)

## Summary

This PR mitigates an issue where the `has_es_data` check can hang when
some remote clusters are unresponsive, leaving users stuck in a loading
state in some apps (e.g. Discover and Dashboard) until the request times
out. There are two main changes that help mitigate this issue:
- The `resolve/cluster` request in the `has_es_data` endpoint has been
split into two requests -- one for local data first, then another for
remote data second. In cases where remote clusters are unresponsive but
there is data available in the local cluster, the remote check is never
performed and the check completes quickly. This likely resolves the
majority of cases and is also likely faster in general than checking
both local and remote clusters in a single request.
- In cases where there is no local data and the remote `resolve/cluster`
request hangs, a new `data_views.hasEsDataTimeout` config has been added
to `kibana.yml` (defaults to 5 seconds) to abort the request after a
short delay. This scenario is handled in the front end by displaying an
error toast to the user informing them of the issue, and assuming there
is data available to avoid blocking them. When this occurs, a warning is
also logged to the Kibana server logs.

![CleanShot 2024-11-18 at 23 47
34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)

Fixes elastic#200280.

### Notes
- Modifying the existing version of the `has_es_data` endpoint in this
way should be backward compatible since the behaviour should remain
unchanged from before when the client and server versions don't match
(please validate if this seems accurate during review).
- For a long term fix, the ES team is investigating the issue with
`resolve/cluster` and will aim to have it behave like `resolve/index`,
which fails quickly when remote clusters are unresponsive. They may also
implement other mitigations like a configurable timeout in ES:
elastic/elasticsearch#114020. The purpose of
this PR is to provide an immediate solution in Kibana that mitigates the
issue as much as possible.
- If ES ends up providing another performant method for checking if
indices exist instead of `resolve/cluster`, Kibana should migrate to
that. More details in
elastic/elasticsearch#112307.

### Testing notes

To reproduce the issue locally, follow these steps:
- Follow [these
instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)
to set up a local CCS environment.
- Stop the remote cluster process.
- Use Netcat on the remote cluster port to listen to requests but not
respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive
cluster. See elastic/elasticsearch#32678 for
more context.
- Navigate to Discover and observe that the `has_es_data` request hangs.
When testing in this PR branch, the request will only wait for 5 seconds
before assuming data exists and displaying a toast.

### Checklist

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [x] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_node:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-major Backport to (8.x, 8.17, 8.16) the previous major branch and other branches in development release_note:fix Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. v8.15.5 v8.16.2 v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data Views] has_es_data request hangs when remote clusters are unresponsive
5 participants