Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kibana alert fires when it should not have due to temporary disconnect of remote CCS connection #168293

Open
henrikno opened this issue Oct 6, 2023 · 7 comments
Labels
bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@henrikno
Copy link
Contributor

henrikno commented Oct 6, 2023

Kibana version:
8.10.2

Elasticsearch version:
8.10.2

Server OS version:
Elastic Cloud

Original install method (e.g. download page, yum, from source, etc.):
Elastic Cloud

Describe the bug:
We have an alert that queries for a specific document showing up at least 8 times within 10 minutes over a remote CCS connection. The alert triggers, but when we check there were zero documents that match the query, and we did not delete any documents. The history does not say that the query failed, it shows up as "Succeeded", yet no info about what triggered it. The only hit that something iffy happened is that the query took 15 seconds instead of the normal 1-2 seconds.

Steps to reproduce:

  1. Create a Kibana alert that queries over a remote connection every minute.
  2. Restart nodes, do an upgrade, or disconnect the nodes in any way.
  3. Kibana alert triggers, the history shows Succeeded, but no info about why it triggered. It does not show up as timeout or failed/unknown status.

Expected behavior:
I expected the alert not to fire because there were no hits. Or at least give context about it firing because it could not get results.

Ideal scenario would be to not trigger if it's a transient issue, but if it's a sustained issue (for a configurable time), then trigger. For instance this seems to trigger when we do an upgrade, but then resolves itself.

Screenshots (if relevant):
image

Provide logs and/or server output (if relevant):

Any additional context:

@henrikno henrikno added the bug Fixes for quality problems that affect the customer experience label Oct 6, 2023
@botelastic botelastic bot added the needs-team Issues missing a team label label Oct 6, 2023
@jughosta jughosta added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Oct 17, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member

pmuellr commented Jan 24, 2024

Can you provide the rule type, and parameters used in the rule?

@XavierM XavierM self-assigned this Jan 25, 2024
@elkargig
Copy link

another case where we had this problem was using the "Elasticsearch query" rule

image

rule check: every 5 minutes

@pmuellr
Copy link
Member

pmuellr commented Jan 25, 2024

potentially related to #168293

@pmuellr
Copy link
Member

pmuellr commented Jan 25, 2024

The action being used was iterating over the context.hits to print a field from the doc hits. We advised to also print {{_source._id}} from the hits, as we will then - in the future if this happens - see the actual document id's that the search returned. Hopefully this will provide more background into what is happening.

@XavierM
Copy link
Contributor

XavierM commented Jan 29, 2024

@henrikno I talked to @ymao1 and @pmuellr about this issue. We have other SDH related to that problem but we do not have access to the data like here. For us to find a solution, we need to investigate but to do that we need to log a little bit more information in the message like that alertId (_id of the document) and the timestamp of the alert.

Do you think that's possible? and will we be able to access this kibana?

@ymao1
Copy link
Contributor

ymao1 commented Jan 31, 2024

Created a dedicated investigation issue for this #175980 and linking this for the rule definition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

7 participants