Skip to content

Commit

Permalink
[Data Views] Mitigate issue where has_es_data check can cause Kiban…
Browse files Browse the repository at this point in the history
…a to hang (elastic#200476)

## Summary

This PR mitigates an issue where the `has_es_data` check can hang when
some remote clusters are unresponsive, leaving users stuck in a loading
state in some apps (e.g. Discover and Dashboard) until the request times
out. There are two main changes that help mitigate this issue:
- The `resolve/cluster` request in the `has_es_data` endpoint has been
split into two requests -- one for local data first, then another for
remote data second. In cases where remote clusters are unresponsive but
there is data available in the local cluster, the remote check is never
performed and the check completes quickly. This likely resolves the
majority of cases and is also likely faster in general than checking
both local and remote clusters in a single request.
- In cases where there is no local data and the remote `resolve/cluster`
request hangs, a new `data_views.hasEsDataTimeout` config has been added
to `kibana.yml` (defaults to 5 seconds) to abort the request after a
short delay. This scenario is handled in the front end by displaying an
error toast to the user informing them of the issue, and assuming there
is data available to avoid blocking them. When this occurs, a warning is
also logged to the Kibana server logs.

![CleanShot 2024-11-18 at 23 47
34@2x](https://github.com/user-attachments/assets/6ea14869-b6b6-4d89-a90c-8150d6e6b043)

Fixes elastic#200280.

### Notes
- Modifying the existing version of the `has_es_data` endpoint in this
way should be backward compatible since the behaviour should remain
unchanged from before when the client and server versions don't match
(please validate if this seems accurate during review).
- For a long term fix, the ES team is investigating the issue with
`resolve/cluster` and will aim to have it behave like `resolve/index`,
which fails quickly when remote clusters are unresponsive. They may also
implement other mitigations like a configurable timeout in ES:
elastic/elasticsearch#114020. The purpose of
this PR is to provide an immediate solution in Kibana that mitigates the
issue as much as possible.
- If ES ends up providing another performant method for checking if
indices exist instead of `resolve/cluster`, Kibana should migrate to
that. More details in
elastic/elasticsearch#112307.

### Testing notes

To reproduce the issue locally, follow these steps:
- Follow [these
instructions](https://gist.github.com/lukasolson/d0861aa3e6ee476ac8dd7189ed476756)
to set up a local CCS environment.
- Stop the remote cluster process.
- Use Netcat on the remote cluster port to listen to requests but not
respond (e.g. on macOS: `nc -l 9600`), simulating an unresponsive
cluster. See elastic/elasticsearch#32678 for
more context.
- Navigate to Discover and observe that the `has_es_data` request hangs.
When testing in this PR branch, the request will only wait for 5 seconds
before assuming data exists and displaying a toast.

### Checklist

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] If a plugin configuration key changed, check if it needs to be
allowlisted in the cloud and added to the [docker
list](https://github.com/elastic/kibana/blob/main/src/dev/build/tasks/os_packages/docker_generator/resources/base/bin/kibana-docker)
- [x] This was checked for breaking HTTP API changes, and any breaking
changes have been approved by the breaking-change committee. The
`release_note:breaking` label should be applied in these situations.
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_node:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <[email protected]>
(cherry picked from commit 96fd4b6)
  • Loading branch information
davismcphee committed Nov 20, 2024
1 parent 63f1de7 commit eff8c2c
Show file tree
Hide file tree
Showing 11 changed files with 582 additions and 29 deletions.
9 changes: 9 additions & 0 deletions src/plugins/data_views/common/constants.ts
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,12 @@ export const EXISTING_INDICES_PATH = '/internal/data_views/_existing_indices';
export const DATA_VIEWS_FIELDS_EXCLUDED_TIERS = 'data_views:fields_excluded_data_tiers';

export const DEFAULT_DATA_VIEW_ID = 'defaultIndex';

/**
* Valid `failureReason` attribute values for `has_es_data` API error responses
*/
export enum HasEsDataFailureReason {
localDataTimeout = 'local_data_timeout',
remoteDataTimeout = 'remote_data_timeout',
unknown = 'unknown',
}
1 change: 1 addition & 0 deletions src/plugins/data_views/common/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ export {
META_FIELDS,
DATA_VIEW_SAVED_OBJECT_TYPE,
MAX_DATA_VIEW_FIELD_DESCRIPTION_LENGTH,
HasEsDataFailureReason,
} from './constants';

export { LATEST_VERSION } from './content_management/v1/constants';
Expand Down
1 change: 1 addition & 0 deletions src/plugins/data_views/common/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -571,4 +571,5 @@ export interface ClientConfigType {
scriptedFieldsEnabled?: boolean;
dataTiersExcludedForFields?: string;
fieldListCachingEnabled?: boolean;
hasEsDataTimeout: number;
}
73 changes: 73 additions & 0 deletions src/plugins/data_views/public/services/has_data.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import { coreMock } from '@kbn/core/public/mocks';

import { HasData } from './has_data';
import { HttpFetchError } from '@kbn/core-http-browser-internal/src/http_fetch_error';

describe('when calling hasData service', () => {
describe('hasDataView', () => {
Expand Down Expand Up @@ -170,6 +171,78 @@ describe('when calling hasData service', () => {

expect(await response).toBe(false);
});

it('should return true and show an error toast when checking for remote cluster data times out', async () => {
const coreStart = coreMock.createStart();
const http = coreStart.http;

// Mock getIndices
const spy = jest.spyOn(http, 'get').mockImplementation(() =>
Promise.reject(
new HttpFetchError(
'Timeout while checking for Elasticsearch data',
'TimeoutError',
new Request(''),
undefined,
{
statusCode: 504,
message: 'Timeout while checking for Elasticsearch data',
attributes: {
failureReason: 'remote_data_timeout',
},
}
)
)
);
const hasData = new HasData();
const hasDataService = hasData.start(coreStart, true);
const response = hasDataService.hasESData();

expect(spy).toHaveBeenCalledTimes(1);
expect(await response).toBe(true);
expect(coreStart.notifications.toasts.addDanger).toHaveBeenCalledTimes(1);
expect(coreStart.notifications.toasts.addDanger).toHaveBeenCalledWith({
title: 'Remote cluster timeout',
text: 'Checking for data on remote clusters timed out. One or more remote clusters may be unavailable.',
});
});

it('should return true and not show an error toast when checking for remote cluster data times out, but onRemoteDataTimeout is overridden', async () => {
const coreStart = coreMock.createStart();
const http = coreStart.http;

// Mock getIndices
const responseBody = {
statusCode: 504,
message: 'Timeout while checking for Elasticsearch data',
attributes: {
failureReason: 'remote_data_timeout',
},
};
const spy = jest
.spyOn(http, 'get')
.mockImplementation(() =>
Promise.reject(
new HttpFetchError(
'Timeout while checking for Elasticsearch data',
'TimeoutError',
new Request(''),
undefined,
responseBody
)
)
);
const hasData = new HasData();
const hasDataService = hasData.start(coreStart, true);
const onRemoteDataTimeout = jest.fn();
const response = hasDataService.hasESData({ onRemoteDataTimeout });

expect(spy).toHaveBeenCalledTimes(1);
expect(await response).toBe(true);
expect(coreStart.notifications.toasts.addDanger).not.toHaveBeenCalled();
expect(onRemoteDataTimeout).toHaveBeenCalledTimes(1);
expect(onRemoteDataTimeout).toHaveBeenCalledWith(responseBody);
});
});

describe('resolve/cluster not available', () => {
Expand Down
56 changes: 49 additions & 7 deletions src/plugins/data_views/public/services/has_data.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,22 @@
*/

import { CoreStart, HttpStart } from '@kbn/core/public';
import { DEFAULT_ASSETS_TO_IGNORE } from '../../common';
import { IHttpFetchError, ResponseErrorBody, isHttpFetchError } from '@kbn/core-http-browser';
import { isObject } from 'lodash';
import { i18n } from '@kbn/i18n';
import { DEFAULT_ASSETS_TO_IGNORE, HasEsDataFailureReason } from '../../common';
import { HasDataViewsResponse, IndicesViaSearchResponse } from '..';
import { IndicesResponse, IndicesResponseModified } from '../types';

export interface HasEsDataParams {
/**
* Callback to handle the case where checking for remote data times out.
* If not provided, the default behavior is to show a toast notification.
* @param body The error response body
*/
onRemoteDataTimeout?: (body: ResponseErrorBody) => void;
}

export class HasData {
private removeAliases = (source: IndicesResponseModified): boolean => !source.item.indices;

Expand All @@ -38,28 +50,55 @@ export class HasData {
return hasLocalESData;
};

const hasESDataViaResolveCluster = async () => {
const hasESDataViaResolveCluster = async (
onRemoteDataTimeout: (body: ResponseErrorBody) => void
) => {
try {
const { hasEsData } = await http.get<{ hasEsData: boolean }>(
'/internal/data_views/has_es_data',
{
version: '1',
}
{ version: '1' }
);

return hasEsData;
} catch (e) {
if (
this.isResponseError(e) &&
e.body?.statusCode === 504 &&
e.body?.attributes?.failureReason === HasEsDataFailureReason.remoteDataTimeout
) {
onRemoteDataTimeout(e.body);

// In the case of a remote cluster timeout,
// we can't be sure if there is data or not,
// so just assume there is
return true;
}

// fallback to previous implementation
return hasESDataViaResolveIndex();
}
};

const showRemoteDataTimeoutToast = () =>
core.notifications.toasts.addDanger({
title: i18n.translate('dataViews.hasData.remoteDataTimeoutTitle', {
defaultMessage: 'Remote cluster timeout',
}),
text: i18n.translate('dataViews.hasData.remoteDataTimeoutText', {
defaultMessage:
'Checking for data on remote clusters timed out. One or more remote clusters may be unavailable.',
}),
});

return {
/**
* Check to see if ES data exists
*/
hasESData: async (): Promise<boolean> => {
hasESData: async ({
onRemoteDataTimeout = showRemoteDataTimeoutToast,
}: HasEsDataParams = {}): Promise<boolean> => {
if (callResolveCluster) {
return hasESDataViaResolveCluster();
return hasESDataViaResolveCluster(onRemoteDataTimeout);
}
return hasESDataViaResolveIndex();
},
Expand All @@ -82,6 +121,9 @@ export class HasData {

// ES Data

private isResponseError = (e: Error): e is IHttpFetchError<ResponseErrorBody> =>
isHttpFetchError(e) && isObject(e.body) && 'message' in e.body && 'statusCode' in e.body;

private responseToItemArray = (response: IndicesResponse): IndicesResponseModified[] => {
const { indices = [], aliases = [] } = response;
const source: IndicesResponseModified[] = [];
Expand Down
2 changes: 1 addition & 1 deletion src/plugins/data_views/server/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ const configSchema = schema.object({
schema.boolean({ defaultValue: false }),
schema.never()
),

dataTiersExcludedForFields: schema.conditional(
schema.contextRef('serverless'),
true,
Expand All @@ -60,6 +59,7 @@ const configSchema = schema.object({
schema.boolean({ defaultValue: false }),
schema.boolean({ defaultValue: true })
),
hasEsDataTimeout: schema.number({ defaultValue: 5000 }),
});

type ConfigType = TypeOf<typeof configSchema>;
Expand Down
2 changes: 2 additions & 0 deletions src/plugins/data_views/server/plugin.ts
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,11 @@ export class DataViewsServerPlugin

registerRoutes({
http: core.http,
logger: this.logger,
getStartServices: core.getStartServices,
isRollupsEnabled: () => this.rollupsEnabled,
dataViewRestCounter,
hasEsDataTimeout: config.hasEsDataTimeout,
});

expressions.registerFunction(getIndexPatternLoad({ getStartServices: core.getStartServices }));
Expand Down
Loading

0 comments on commit eff8c2c

Please sign in to comment.