[Alerting][Security] Rules fail due to a security exception: missing authentication credentials for REST request #118520

gmmorris · 2021-11-15T12:43:17Z

Kibana version: 7.15.0

Looking at Kibana Server logs on cloud I've noticed a high rate of security errors causing many of our Rule Types to fail.

Specifically:

Executing Alert default:.es-query:{uuid} has resulted in Error: security_exception: [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges], caused by: ""

...appears a lot and accounts for around 200 rule execution failures per minute.

Interestingly, this seems to happen predominantly to the following Rule Types:

monitoring_shard_size
.es-query
siem.signals

So this is likely not something that's happening at the platform level, but rather specific to the implementation of these three rule types.

elasticmachine · 2021-11-15T12:43:19Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

elasticmachine · 2021-11-15T12:43:19Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

elasticmachine · 2021-11-15T12:43:19Z

Pinging @elastic/kibana-security (Team:Security)

elasticmachine · 2021-11-15T12:43:19Z

Pinging @elastic/security-solution (Team: SecuritySolution)

dhurley14 · 2021-11-15T15:24:44Z

The security solution executor utilizes the _has_privileges api to determine if the rule can query the index patterns provided so this is probably coming from our rules. In 7.16 if the _has_privileges api yields an error, we display a partial failure banner in the rule details page of the security solution.

gmmorris · 2021-11-16T14:52:41Z

The security solution executor utilizes the _has_privileges api to determine if the rule can query the index patterns provided so this is probably coming from our rules. In 7.16 if the _has_privileges api yields an error, we display a partial failure banner in the rule details page of the security solution.

Ah, that makes sense, thanks @dhurley14 !
So from your perspective, this is a valid error? As in, it's intentional in nature rather than an exception. 🤔

pmuellr · 2021-11-17T14:15:17Z

I'd been thinking this could be caused by the api key / task doc race condition issues: #106292 and #110096 .

Another source of this could be (see linked SDH issue above) a bad upgrade, where the original encryption key isn't available during the migration. It appears in such cases we migrate the rule with the API key set to null. Clearly we want to "disable" the rule, but we can't really, since the task document still exists and we need to delete it, but can't during the migration. We also presumably have an API key that should be invalidated, but we can't since we couldn't recover it.

This gets complicated to reason about, because for "no security" deployments, the API key WILL BE null. Somehow we need a better guard rail here - check the API key after extraction, and if it's null and it's not supposed to be null (however we check "no security"), we should disable it then and there - with hopefully some kind of notification to the user. Maybe we need a disableReason or such ... relevant code here: alerting/server/task_runner/task_runner.ts

In a Slack conversation, @ymao1 noted:

because we can't delete the task document in migration, we also can't set the rule to disabled in migration, as that would create another task document when it's later enabled, and then there would be two tasks for the rule
the migration logic was last updated in PR Gracefully handle decryption errors during ESO migrations #105968 - and there was some question whether we would fail the migration for cases like this. If the only problem was the correct encryption key was not set during an upgrade, and a second migration could be run with the correct encryption key, then this would be the best solution (fail the migration). I still feel like there's too many "if's" in that logic, and we could be causing migration failures that we don't need to be, if we did just fail on every decrypt failure.

pmuellr · 2021-11-18T17:00:43Z

I was able to repro changing the encryption key on a migration will cause this error.

During migration, this was logged for every rule:

[error][encryptedSavedObjects][plugins] Failed to decrypt "apiKey" attribute: Unsupported state or unable to authenticate data
[WARN ][savedobjects-service] Decryption failed for encrypted Saved Object 
  "fbca0b70-4887-11ec-9ff1-157c47ec9f4a" of type "alert" with error: 
  Unable to decrypt attribute "apiKey". Encrypted attributes have 
  been stripped from the original document and migration will be applied but 
  this may cause errors later on.

It didn't lie! It did cause problems seconds later:

[plugins.alerting] Executing Alert default:.index-threshold:fe9efd60-4887-11ec-9ff1-157c47ec9f4a 
  has resulted in Error: security_exception: 
  [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges], 
  caused by: ""

Seems like we need to do better than logging during migration. I think we need to mark these somehow as not-runnable, and then disable them sometime after startup. I wonder if we could even do it DURING startup? Or does it need to be a cleanup task so not every Kibana will try to "fix" these?

Another possibility is fixing these as-needed - if we recognize we'll get this error because there SHOULD be an API key, but isn't, disable the rule instead of running it. But then you won't know till you try to run it.

mikecote · 2021-11-18T17:41:21Z

if we recognize we'll get this error because there SHOULD be an API key, but isn't

One theory that may cause this.. if a user sets up alerting rules with security disabled (xpack.security.enabled: false) and later on enables security, their alerting rules would run into this problem because the apiKey field is empty and now that security is enabled, it expects a value there.

Though, I don't think this scenario is possible on Cloud.. (security always enabled?).

pmuellr · 2021-11-18T18:21:34Z

Ya, security is always on for cloud, but this could obviously happen on-prem. Thought about that for a second when I was doing my repro, but shoved it to the back of my mind. Obviously we need to take this into account though. We want to disable these, because they NEED an API key at that point, but I guess the question is - when do we make that call and actually disable them. And how do we notify the user that we disabled them.

mikecote · 2021-11-18T18:29:47Z

Obviously we need to take this into account though. We want to disable these, because they NEED an API key at that point, but I guess the question is - when do we make that call and actually disable them. And how do we notify the user that we disabled them.

This overlaps well with upcoming efforts to ensure alerting rules run continuously. This becomes a scenario where rules stop running indefinitely until a user intervenes. And we'll need to find a way to notify the user in these cases. So lots TBD :)

banderror · 2021-12-27T15:29:01Z

@deepikakeshav-qasource reproduced this issue in her test Cloud environment in #120872 without doing any Kibana upgrades - this was a fresh 8.0.0 deployment.

Could this mean that the race condition mentioned by @pmuellr might be the root cause in this case?

I'd been thinking this could be caused by the api key / task doc race condition issues: #106292 and #110096 .

I wasn't able to reproduce it though, even in the same Cloud environment where she managed to do that.

jugsofbeer · 2022-03-03T08:49:56Z

Sadly this had the side effect of killing our kibana nodes connection to elasticsearch and requests to kibana would display tls handshake errors.

If we restarted kibana it would work for 3 or 4 minutes then tls errors.

after 2days of chaos, we disabled all alerts... and problem resolved temporarily and all errors stopped in our logs.

We have a few alerts that are throwing the has privilages error, so more investigation needed.

We are running v7.16.3 onpremise. Began life as v6.4.2 and upgraded versions over past 3years.

Support case was opened today as well. Let me know if you want the number.

jportner · 2022-03-03T14:39:50Z

It sounds like this isn't a Platform Security issue, so I'll remove our team's label.

Support case was opened today as well. Let me know if you want the number.

@jugsofbeer Thanks for chiming in, it's helpful to know on this issue if users are affected, and we will get the right eyes on the support case!

fopson · 2022-10-26T22:09:57Z

We had the same issue with our On-Prem deployment of 8.4.x. Some rules would produce this error for weeks. We found that if you edit the rule and re-save it, it stops failing.

Hope this helps.

gmmorris mentioned this issue Dec 7, 2021

Alerting rules can end up in a state where they stop running indefinitely until a user intervenes to fix the problem #119650

Closed

spong mentioned this issue Dec 9, 2021

[Security Solution] Rule is getting failed intermittently when enable or disable the rule even not able to enable the failed rule from rule details page #120872

Closed

mikecote added this to AppEx: ResponseOps - Execution & Connectors Jan 4, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

jportner removed the Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! label Mar 3, 2022

pmuellr mentioned this issue Jun 30, 2022

Rules failed after upgrade: [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges] #135386

Closed

mikecote moved this to Todo in AppEx: ResponseOps - Execution & Connectors Sep 22, 2022

miltonhultgren added the Feature:Stack Monitoring label Jun 23, 2023

smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting][Security] Rules fail due to a security exception: missing authentication credentials for REST request #118520

[Alerting][Security] Rules fail due to a security exception: missing authentication credentials for REST request #118520

gmmorris commented Nov 15, 2021

elasticmachine commented Nov 15, 2021

elasticmachine commented Nov 15, 2021

elasticmachine commented Nov 15, 2021

elasticmachine commented Nov 15, 2021

dhurley14 commented Nov 15, 2021

gmmorris commented Nov 16, 2021

pmuellr commented Nov 17, 2021

pmuellr commented Nov 18, 2021

mikecote commented Nov 18, 2021

pmuellr commented Nov 18, 2021

mikecote commented Nov 18, 2021

banderror commented Dec 27, 2021

jugsofbeer commented Mar 3, 2022 •

edited

Loading

jportner commented Mar 3, 2022

fopson commented Oct 26, 2022

[Alerting][Security] Rules fail due to a security exception: missing authentication credentials for REST request #118520

[Alerting][Security] Rules fail due to a security exception: missing authentication credentials for REST request #118520

Comments

gmmorris commented Nov 15, 2021

elasticmachine commented Nov 15, 2021

elasticmachine commented Nov 15, 2021

elasticmachine commented Nov 15, 2021

elasticmachine commented Nov 15, 2021

dhurley14 commented Nov 15, 2021

gmmorris commented Nov 16, 2021

pmuellr commented Nov 17, 2021

pmuellr commented Nov 18, 2021

mikecote commented Nov 18, 2021

pmuellr commented Nov 18, 2021

mikecote commented Nov 18, 2021

banderror commented Dec 27, 2021

jugsofbeer commented Mar 3, 2022 • edited Loading

jportner commented Mar 3, 2022

fopson commented Oct 26, 2022

jugsofbeer commented Mar 3, 2022 •

edited

Loading