Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Alerting] Investigate ES query rule firing unexpectedly #175980

Closed
ymao1 opened this issue Jan 31, 2024 · 16 comments · Fixed by #186332
Closed

[Response Ops][Alerting] Investigate ES query rule firing unexpectedly #175980

ymao1 opened this issue Jan 31, 2024 · 16 comments · Fixed by #186332
Assignees
Labels
Feature:Alerting research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented Jan 31, 2024

We have gotten several reports from users of receiving alert notifications from the ES query rule where they were unable to trace back to the underlying documents that may have generated the alert. We need to investigate how and why that may be happening.

There seem to be several commonalities between the rule definitions that spawn the zombie alerts:

  • They are KQL queries (using the search source client to query)
  • They are querying over CCS indices (there is a theory that an alert might fire if the CCS connection is lost/broken)
  • They have excludeMatchesFromPreviousRuns set to true.
@ymao1 ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) research labels Jan 31, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member

pmuellr commented Feb 1, 2024

I think we should start with a code review. One possibility here is that for some reason the date we are starting the search from is somehow set further back than we want (or not set at all), and so we find older documents. I already did this once, didn't see any obvious way this could happen, but would be good to have another set of eyes. Note that we store the date to start querying from based on the dates of the documents returned, which is stored in the rule task state.

I think we could perhaps add some kind of diagnostic as well. Have the rule peek at the documents returned, and determine if they are in the search range. If not, log an error message with tons of info (search JSON, search options, results), and a unique tag we can search on, then remove those from the "active documents found".

@doakalexi doakalexi moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Feb 1, 2024
@pmuellr
Copy link
Member

pmuellr commented Feb 12, 2024

I tried out an internal AI-based search facility to see if elasticsearch is known to return documents not matching the specified filter - it's AI so take it with large grains of salt:

question:

Does elasticsearch ever return documents not matching the search filter?

answer:

Yes, Elasticsearch can sometimes return documents that do not match the search filter due to various reasons, including:

  1. Query and Filter Context Misuse: If a query is placed in a filter context or vice versa, it may not behave as expected. Filters are cached and do not contribute to scoring, while queries are scored. Using them incorrectly can lead to unexpected results.

  2. Cache Issues: The results of filters are cached for faster performance. If there's an issue with the cache, it might return incorrect results.

  3. Shard Failures: If there are issues with some of the shards and they fail to return the correct documents, the overall result set might include documents that do not match the filter.

  4. Mapping Issues: If the field mappings are not correctly defined, Elasticsearch might not be able to apply filters correctly, leading to unexpected documents in the result set.

  5. Analyzer Configuration: If the field being filtered on is analyzed, the filter will apply to the tokens produced by the analyzer, not the raw text. This can lead to unexpected behavior if the analyzer configuration is not aligned with the expectations of the filter.

  6. Version Mismatches: In a scenario where there are mixed versions of Elasticsearch nodes in a cluster, there might be inconsistencies in how filters are applied.

  7. Nested Documents: When dealing with nested documents, if the query is not correctly structured to handle the nested nature of the data, documents that do not match the filter criteria might be returned.

  8. Specific Bugs or Edge Cases: There might be specific bugs or edge cases in Elasticsearch that could cause it to return documents that do not match the search filter. For example, as mentioned in the provided context, App Search filters had unexpected behaviors when the filtered field name matched specific keywords like to, from, distance, unit, center, etc. This issue was fixed in versions 7.16 and above (as per the context from "/knowledge/view/d672e2bb").

In general, Elasticsearch is designed to return accurate results based on the provided filters. However, configuration issues, bugs, or other unforeseen circumstances can lead to unexpected results. It's important to ensure that the cluster is healthy, the mappings are correct, and the queries are well-formed to avoid such issues.

For the specific case of App Search filters returning unexpected results due to field names matching specific keywords, this was identified as a bug and fixed in later versions of Elasticsearch (as per the context from "/knowledge/view/d672e2bb").

We are currently placing the time window for the search in a range filter - so I don't think we're misuing query / filter here (1).

Seems unlikely to be 4 as well, since we only see this transiently - presumably we'd see this more consistently if there was a mapping issue, for instance, the same old doc showing up in the search hits for subsequent rule runs. We don't.

We'd have to check if there is an analyzer for the time fields, but it seems hard to imagine that's the problem (5) - again, we'd see the same old docs in subsequent runs, but we've not seen that in practice.

AFAIK we do not have multiple versions of ES in the mix here, ruling out 6.

These aren't nested documents, ruling out 7.

Leaving 2, 3, and 8. I have NO IDEA how accurate this answer is, it's somewhat AI-generated :-). The simplest answer and perhaps easiest to check is if we have a shard failure (3). I guess look in the ES logs around the time of the rule run?

@pmuellr pmuellr self-assigned this Jun 13, 2024
@pmuellr
Copy link
Member

pmuellr commented Jun 13, 2024

I've looked into this a few times, and come up with nothing.

One thing from https://github.com/elastic/sdh-kibana/issues/4177 , is that it appears old documents, out of the range we were searching through, were returned as hits. That could explain the other referenced issues as well.

So, here's one thing we can do to try to "catch" that issue: add some code after the query is run to check that the documents' time range matches what we were searching for. If we find documents out of the range (basically, old documents that should not have been in the search), dump a bunch of info to the logger. The query, the document id's / times, maybe some of the search result meta data to see if there are any clues there. Perhaps we should even return an error from the rule run, so that we do NOT create alerts, and make the failure more obvious (than just being logged).

These appear to be transient issues, so very likely the customer wouldn't notice the error in the rule run since the next one is likely to be successful, so I'm a little torn on throwing the error.

pmuellr added a commit to pmuellr/kibana that referenced this issue Jul 12, 2024
resolves elastic#175980

Adds a check with logging if an ES Query rule returns hits which fall
outside the time range it's searching.  This shouldn't ever happen, but
seems to be happening on rare occaisons, so we wanted to add some
diagnostics to try to help narrow down the problem.
pmuellr added a commit to pmuellr/kibana that referenced this issue Jul 23, 2024
…ge (elastic#186332)

resolves elastic#175980

## Summary

Adds a check with logging if an ES Query rule returns hits which fall
outside the time range it's searching. This shouldn't ever happen, but
seems to be happening on rare occaisons, so we wanted to add some
diagnostics to try to help narrow down the problem.

Note that the ES|QL flavor rule does not use this diagnostic, just
search source (KQL) and query dsl.

We check 3 things:
- ensure the `dateStart` sent to fetch was valid
- ensure the `dateEnd` sent to fetch was valid
- ensure the relevant time fields in hits are within the
dateStart/dateEnd range

These produce three different error messages:

`For rule '<rule-id>', hits were returned with invalid time range start
date '<date>' from field '<field>' using query <query>`

`For rule '<rule-id>', hits were returned with invalid time range end
date '<date>' from field '<field>' using query <query>`

`For rule '<rule-id>', the hit with date '<date>' from field '<field>'
is outside the query time range. Query: <query>. Document: <document>`

Each message has one tag on it: `query-result-out-of-time-range`

## To Verify

To test invalid dateStart/dateEnd, hack the Kibana code to set the
values to NaN's:

https://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264

For instance, change that to:

    const epochStart = new Date('x').getTime();
    const epochEnd = new Date('y').getTime();

To test the invdivual document hits, first back out the change you made
above - when those error, the checks we're testing below do not run.
Hack the Kibana code to make the time out of range:

https://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294

For instance, change that to:

    const epochDate = epochStart - 100

For both tests, create an es query rule - kql or dsl - make the relevant
changes, and arrange for the rule to get hits each time. The relevant
messages should be logged in the Kibana console when the rule runs.

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

---------

Co-authored-by: Elastic Machine <[email protected]>
(cherry picked from commit e12e449)

# Conflicts:
#	x-pack/plugins/stack_alerts/server/rule_types/es_query/lib/fetch_search_source_query.ts
pmuellr referenced this issue Jul 24, 2024
…ime range (#186332) (#189019)

# Backport

This will backport the following commits from `main` to `8.15`:
- [[ResponseOps] log error when ES Query rules find docs out of time
range (#186332)](#186332)

<!--- Backport version: 8.9.8 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Patrick
Mueller","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-07-23T20:17:11Z","message":"[ResponseOps]
log error when ES Query rules find docs out of time range
(#186332)\n\nresolves
https://github.com/elastic/kibana/issues/175980\r\n\r\n##
Summary\r\n\r\nAdds a check with logging if an ES Query rule returns
hits which fall\r\noutside the time range it's searching. This shouldn't
ever happen, but\r\nseems to be happening on rare occaisons, so we
wanted to add some\r\ndiagnostics to try to help narrow down the
problem.\r\n\r\nNote that the ES|QL flavor rule does not use this
diagnostic, just\r\nsearch source (KQL) and query dsl.\r\n\r\nWe check 3
things:\r\n- ensure the `dateStart` sent to fetch was valid\r\n- ensure
the `dateEnd` sent to fetch was valid\r\n- ensure the relevant time
fields in hits are within the\r\ndateStart/dateEnd range\r\n\r\nThese
produce three different error messages:\r\n\r\n`For rule '<rule-id>',
hits were returned with invalid time range start\r\ndate '<date>' from
field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', hits
were returned with invalid time range end\r\ndate '<date>' from field
'<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', the hit
with date '<date>' from field '<field>'\r\nis outside the query time
range. Query: <query>. Document: <document>`\r\n\r\nEach message has one
tag on it: `query-result-out-of-time-range`\r\n\r\n## To
Verify\r\n\r\nTo test invalid dateStart/dateEnd, hack the Kibana code to
set the\r\nvalues to
NaN's:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264\r\n\r\nFor
instance, change that to:\r\n\r\n const epochStart = new
Date('x').getTime();\r\n const epochEnd = new
Date('y').getTime();\r\n\r\nTo test the invdivual document hits, first
back out the change you made\r\nabove - when those error, the checks
we're testing below do not run.\r\nHack the Kibana code to make the time
out of
range:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294\r\n\r\nFor
instance, change that to:\r\n\r\n const epochDate = epochStart -
100\r\n\r\nFor both tests, create an es query rule - kql or dsl - make
the relevant\r\nchanges, and arrange for the rule to get hits each time.
The relevant\r\nmessages should be logged in the Kibana console when the
rule runs.\r\n\r\n### Checklist\r\n\r\nDelete any items that are not
applicable to this PR.\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"e12e4496e0594d34509ef9235d1d7b7e3461e5d8","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Feature:Alerting","release_note:skip","Team:ResponseOps","Feature:Alerting/RulesFramework","backport:prev-minor","v8.16.0"],"number":186332,"url":"https://github.com/elastic/kibana/pull/186332","mergeCommit":{"message":"[ResponseOps]
log error when ES Query rules find docs out of time range
(#186332)\n\nresolves
https://github.com/elastic/kibana/issues/175980\r\n\r\n##
Summary\r\n\r\nAdds a check with logging if an ES Query rule returns
hits which fall\r\noutside the time range it's searching. This shouldn't
ever happen, but\r\nseems to be happening on rare occaisons, so we
wanted to add some\r\ndiagnostics to try to help narrow down the
problem.\r\n\r\nNote that the ES|QL flavor rule does not use this
diagnostic, just\r\nsearch source (KQL) and query dsl.\r\n\r\nWe check 3
things:\r\n- ensure the `dateStart` sent to fetch was valid\r\n- ensure
the `dateEnd` sent to fetch was valid\r\n- ensure the relevant time
fields in hits are within the\r\ndateStart/dateEnd range\r\n\r\nThese
produce three different error messages:\r\n\r\n`For rule '<rule-id>',
hits were returned with invalid time range start\r\ndate '<date>' from
field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', hits
were returned with invalid time range end\r\ndate '<date>' from field
'<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', the hit
with date '<date>' from field '<field>'\r\nis outside the query time
range. Query: <query>. Document: <document>`\r\n\r\nEach message has one
tag on it: `query-result-out-of-time-range`\r\n\r\n## To
Verify\r\n\r\nTo test invalid dateStart/dateEnd, hack the Kibana code to
set the\r\nvalues to
NaN's:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264\r\n\r\nFor
instance, change that to:\r\n\r\n const epochStart = new
Date('x').getTime();\r\n const epochEnd = new
Date('y').getTime();\r\n\r\nTo test the invdivual document hits, first
back out the change you made\r\nabove - when those error, the checks
we're testing below do not run.\r\nHack the Kibana code to make the time
out of
range:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294\r\n\r\nFor
instance, change that to:\r\n\r\n const epochDate = epochStart -
100\r\n\r\nFor both tests, create an es query rule - kql or dsl - make
the relevant\r\nchanges, and arrange for the rule to get hits each time.
The relevant\r\nmessages should be logged in the Kibana console when the
rule runs.\r\n\r\n### Checklist\r\n\r\nDelete any items that are not
applicable to this PR.\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"e12e4496e0594d34509ef9235d1d7b7e3461e5d8"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.16.0","labelRegex":"^v8.16.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/186332","number":186332,"mergeCommit":{"message":"[ResponseOps]
log error when ES Query rules find docs out of time range
(#186332)\n\nresolves
https://github.com/elastic/kibana/issues/175980\r\n\r\n##
Summary\r\n\r\nAdds a check with logging if an ES Query rule returns
hits which fall\r\noutside the time range it's searching. This shouldn't
ever happen, but\r\nseems to be happening on rare occaisons, so we
wanted to add some\r\ndiagnostics to try to help narrow down the
problem.\r\n\r\nNote that the ES|QL flavor rule does not use this
diagnostic, just\r\nsearch source (KQL) and query dsl.\r\n\r\nWe check 3
things:\r\n- ensure the `dateStart` sent to fetch was valid\r\n- ensure
the `dateEnd` sent to fetch was valid\r\n- ensure the relevant time
fields in hits are within the\r\ndateStart/dateEnd range\r\n\r\nThese
produce three different error messages:\r\n\r\n`For rule '<rule-id>',
hits were returned with invalid time range start\r\ndate '<date>' from
field '<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', hits
were returned with invalid time range end\r\ndate '<date>' from field
'<field>' using query <query>`\r\n\r\n`For rule '<rule-id>', the hit
with date '<date>' from field '<field>'\r\nis outside the query time
range. Query: <query>. Document: <document>`\r\n\r\nEach message has one
tag on it: `query-result-out-of-time-range`\r\n\r\n## To
Verify\r\n\r\nTo test invalid dateStart/dateEnd, hack the Kibana code to
set the\r\nvalues to
NaN's:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L263-L264\r\n\r\nFor
instance, change that to:\r\n\r\n const epochStart = new
Date('x').getTime();\r\n const epochEnd = new
Date('y').getTime();\r\n\r\nTo test the invdivual document hits, first
back out the change you made\r\nabove - when those error, the checks
we're testing below do not run.\r\nHack the Kibana code to make the time
out of
range:\r\n\r\n\r\nhttps://github.com/elastic/kibana/blob/d30da09707f85d84d7fd555733ba8e0cb595228b/x-pack/plugins/stack_alerts/server/rule_types/es_query/executor.ts#L294\r\n\r\nFor
instance, change that to:\r\n\r\n const epochDate = epochStart -
100\r\n\r\nFor both tests, create an es query rule - kql or dsl - make
the relevant\r\nchanges, and arrange for the rule to get hits each time.
The relevant\r\nmessages should be logged in the Kibana console when the
rule runs.\r\n\r\n### Checklist\r\n\r\nDelete any items that are not
applicable to this PR.\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<[email protected]>","sha":"e12e4496e0594d34509ef9235d1d7b7e3461e5d8"}}]}]
BACKPORT-->
@pmuellr
Copy link
Member

pmuellr commented Jul 24, 2024

Woops, didn't really mean to close this - we still haven't figured out the problem, but do have some new diagnostics from #186332 if we see this happen again ...

@pmuellr
Copy link
Member

pmuellr commented Nov 13, 2024

In both SDH 5049 and 5056, it appears there was ILM activity before/during the rule run. Thinking this may have something to do with it ...

@pmuellr
Copy link
Member

pmuellr commented Nov 14, 2024

Regarding the new logging we added in #186332 : turns out it's producing a lot of false positives, and is difficult to analyze by sight. So issue #200023 has been opened to improve that.

I did write a node js script to analyze the existing messages though: https://gist.github.com/pmuellr/fe30b8e28261e2d22996e6ba573945b3 - it expects to be passed a file which contains the results of an es query, custom tailored to our overview clusters. It should weed through the false positives and log any true positives (or potential positives anyway). Ran it last night on a large set of the messages we're seeing, and found no true positives.

@pmuellr
Copy link
Member

pmuellr commented Nov 14, 2024

Discussing this issue with the team, it was suggested we could try to repro this with ILM, to see if we can catch it ourselves. Here's a basic idea:

  • create a new deployment in cloud
  • set xpack.alerting.rules.minimumScheduleInterval.value: "1s" in the ECH config overrides
  • tweak the event log's ILM policy (it's customer editable) to roll over the EL indices as fast as we possibly can, hopefully multiple times per hour
  • write a rule which queries the event log

This puts us in a place where we have a rule querying over an index subject to ILM. What I'm not sure about is what the alert should alert on. I guess we can depend on the existing logging and script (mentioned in comment ^^^), and have the rule just match on all the event log docs.


note

I just realized we no longer have a separate ILM policy for the event log, we now use a setting in the index template that handles this:

export function getIndexTemplate(esNames: EsNames) {
const indexTemplateBody = {
_meta: {
description: 'index template for the Kibana event log',
managed: true,
},
index_patterns: [esNames.dataStream],
data_stream: {
hidden: true,
},
priority: 50,
template: {
settings: {
hidden: true,
number_of_shards: 1,
auto_expand_replicas: '0-1',
},
lifecycle: {
data_retention: '90d',
},
mappings,
},
};
return indexTemplateBody;
}

You can use a datastream lifecycle API to modify the default 90d setting; the following seems to work:

PUT _data_stream/.kibana-event-log-ds/_lifecycle
{
  "data_retention": "5m"
}

I didn't see a way to manually rollover, and it didn't seem to happen automagically with the PUT, but some searching yielded this, which did seem to rollover (and hopefully the new retention kicks in):

POST .kibana-event-log-ds/_rollover

@pmuellr
Copy link
Member

pmuellr commented Nov 14, 2024

Ah, I don't think using a plain old datastream with the lifecycle data_retention is going to work - it doesn't actually roll over indices and retain them, it only rolls over and deletes the old one.

Still ... maybe? I'll let this go for a while, but suspect we will have to create an ILM policy, index template, and initial index. And then? I guess have the rule read from that index, and add an index connector to also write to that index - the rule will be reading and writing to the same index :-).

@Dariaus999
Copy link

@pmuellr -@hisashinz is out of the office for the next two weeks so that I will act as a proxy for the case.
I read the whole process, and it still confused me, so please let me know which next plan of action I should recommend for the customer.

A) Let the datastream index rollover manually to see how it goes
B) create an ILM policy, index template, and initial index - also add an index connector to write to the old index.

@pmuellr
Copy link
Member

pmuellr commented Nov 18, 2024

@Dariaus999 Which case are you referring to? We would only want to do some additional logging, maybe add an index connector, if the rule is firely unexpected, on occaison. I think some of the cases have been one-off's, so additional logging wouldn't make sense. Please respond in the relevant SDH, and ping me on Slack so I can make sure I'm following that SDH.

The new thread I started last week, here, is mainly to discuss some additional diagnostics / testing that we would be doing in our team, to try to track down the issue, as a development task. Not something a customer would do.

I think one of the SDH's did indicate ILM was having some issue with a particular index - that's not something I can help with, you should follow what ES support says.

@pmuellr pmuellr self-assigned this Nov 20, 2024
@pmuellr
Copy link
Member

pmuellr commented Nov 21, 2024

From the recent SDH's on this, I've done some analysis and think I've found the general problem. Write up here, for elastician's only: https://docs.google.com/document/d/1jbQ5OOEzSWI947D0tMzNJu-TwnATkCOVqM9kCFWmAvg/edit?usp=sharing

Basic idea is that ES Query, when using KQL, uses DataViews. As part of rule execution, the DataView is requesting fieldcaps over the index pattern it manages. When that request does not return a 200 - in the SDH's it was returning 404's and 502's - I'm thinking the Rule &| DataView are not calculating the date range correctly. Perhaps not adding the date range at all? Unclear.

Certainly in the case of the 502, you'd hope the fieldcaps request would be retried, however you can see from the SDH that it doesn't appear to be completely transient. Perhaps there's a performance issue? Timeout?

Even in the 404 case, perhaps it should retry? After all, the rule was likely created in the UX and the dataview was pointing to something, at one point in time. In the SDH with the 404, it appears that ES likely should NOT have returned a 404 though - unclear ATM. Seems like a stretch to retry a 404 in this case, but we can wait for ES to chime in with whether this was a bug or can be expected.

In the end, it appears the ES "hiccups" caused further damage downstream in the rule processing. So we need to track that down, figure out what we should do. Feels like the rule should fail in both the 404 and 502 case (assuming the 502 can/should be retried and is being retried a few times). So more time needed to research what's going on, and then how to fix it. Hopefully come up with a repro ...

@pmuellr
Copy link
Member

pmuellr commented Nov 21, 2024

I just attempted to repro this, by simulating the 404 response, by changing the following code, specifically changing index: indices to index: 'nope-${indices}':

return await callCluster.fieldCaps(
{
index: indices,
fields,
ignore_unavailable: true,
index_filter: indexFilter,
expand_wildcards: expandWildcards,
types: fieldTypes,
include_empty_fields: includeEmptyFields ?? true,
...fieldCapsOptions,
},
{ meta: true }
);

It displays this error in the Kibana logs:

Executing Rule default:.es-query:1eb05c20-b6f0-4f85-8803-fbd2c477539b has resulted in Error: Data view with ID 964b9c01-a7c8-4d5b-8bed-dc9860dc9dad no longer contains a time field. - Error: Data view with ID 964b9c01-a7c8-4d5b-8bed-dc9860dc9dad no longer contains a time field.

That code was added here: #182883, in 8.14.x / 8.15.0.

And that's the behavior we want, for the 404 - the rule should fail, in which it would be retried later.

Can't quite figure out how to repro the 502. It does look like something is eating "error" responses, but I'm guessing the behaviour is going to be the same as the 404 - because of the "missing" fieldcaps (error and not 404), it won't be able to determine the time field, and die with the same message.

Ideally we could provide a better message here, indicating an unexpected result from getting fieldcaps.

So I think the good news is, seems likely this was fixed in 8.14.x (not sure which minor). Checking all the SDHs linked to this issue, they are all < 8.14.x

@mikecote
Copy link
Contributor

Can't quite figure out how to repro the 502. It does look like something is eating "error" responses, but I'm guessing the behaviour is going to be the same as the 404 - because of the "missing" fieldcaps (error and not 404), it won't be able to determine the time field, and die with the same message.

If you observe 502s at the proxy level and nothing handling it in Kibana, it may be because the ES client is retrying those requests at the transport layer: https://github.com/elastic/elastic-transport-js/blob/c4a4431b187d1f7af813f2ed3b123b081eaa4ab7/src/Transport.ts#L585

@pmuellr
Copy link
Member

pmuellr commented Nov 22, 2024

it may be because the ES client is retrying those requests

There was only one entry in the proxy logs (link in the GDoc to the subject proxy logs). But these are slightly old stack versions, it's possible we are retrying now. I did note that in the case of the fieldcaps returning a 404, DV logged a Kibana message twice about the missing index pattern, but actually ALSO only made one call.

And in the SDH with the 404, it was also noted that fetching fieldcaps for indices that exist can in fact return a 404, under special conditions. We may need to consider doing a retry on those, though I think treating it as a failure and letting the rule re-run later is probably fine.

@pmuellr
Copy link
Member

pmuellr commented Dec 10, 2024

From the GDoc linked in #175980 (comment) , here's the final sort of TL;DR:

It also appears that we may have “fixed” this issue, as recent stack releases now generate errors for 404’s, rather than apparently interpreting the 404 badly, and generating a bad (or no) time range. Presumably we do the same for 502’s, but I’ve not been able to repro that yet - we should be doing that while resolving the issue I created ^^^.

This is the "fix" we wanted - on an error condition obtaining fieldcaps during a rule run, we want the rule run to fail, and not somehow be silently treated as not an error and misinterpreting the incorrect data when generating the query.

Looking at the logging added in #186332 , shipped in 8.15.0, not seeing any "true positives", but lots of "false negatives" trying to catch the condition. Which aligns with this being "fixed" in 8.14.x.

So, closing this issue for now. 🤞🏻

@pmuellr pmuellr closed this as completed Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants