Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alerting] API key lost when frequently disabling / enabling a rule #106292

Closed
pmuellr opened this issue Jul 20, 2021 · 18 comments · Fixed by #119239
Closed

[alerting] API key lost when frequently disabling / enabling a rule #106292

pmuellr opened this issue Jul 20, 2021 · 18 comments · Fixed by #119239
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Jul 20, 2021

See issue #106276 for the repro. Click the enable / disable button, a lot, frequently, and you will likely see the following error in console (some lines deleted from original error),

Unhandled Promise rejection detected:

ResponseError: security_exception: [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges]
    at onBody (/Users/pmuellr/Projects/elastic/kibana-prs/node_modules/@elastic/elasticsearch/lib/Transport.js:337:23)
    at IncomingMessage.onEnd (/Users/pmuellr/Projects/elastic/kibana-prs/node_modules/@elastic/elasticsearch/lib/Transport.js:264:11)
  meta: {
    body: { error: [Object], status: 401 },
    statusCode: 401,
    headers: {
      'www-authenticate': 'Basic realm="security" charset="UTF-8", Bearer realm="security", ApiKey',
      'content-type': 'application/json;charset=utf-8',
      'content-length': '519'
    },
  }
}

Turning on some debug logging, you can also see the following:

[2021-07-20T12:34:57.394-04:00][DEBUG][plugins.alerting] executing alert example.always-firing:64302740-e8c9-11eb-9ca0-a7e4e62d9b7a at 2021-07-20T16:34:57.394Z
fetching api key for alert <-
GET /.kibana_8.0.0/_doc/alert%3A64302740-e8c9-11eb-9ca0-a7e4e62d9b7a
server    log   [12:34:57.404] [debug][encryptedSavedObjects][plugins] The following attributes of saved object "alert,64302740-e8c9-11eb-9ca0-a7e4e62d9b7a" should have been decrypted: apiKey, but found only:
fetching api key for alert -> null
server    log   [12:34:57.407] [debug][data][elasticsearch][query] 200
POST /.kibana_8.0.0/_search?from=0&rest_total_hits_as_int=true&size=100
{"size":100,"seq_no_primary_term":true,"from":0,"query":{"bool":{"filter":[{"bool":{"filter":[{"bool":{"should":[{"match":{"search-session.persisted":"true"}}],"minimum_should_match":1}},{"bool":{"should":[{"match":{"search-session.status":"in_progress"}}],"minimum_should_match":1}}]}},{"bool":{"should":[{"bool":{"must":[{"term":{"type":"search-session"}}],"must_not":[{"exists":{"field":"namespaces"}}]}}],"minimum_should_match":1}}]}},"sort":[{"search-session.touched":{"order":"asc","unmapped_type":"date"}}]}
server    log   [12:34:57.408] [debug][data][elasticsearch][query] 401
POST /_security/user/_has_privileges
{"index":[],"application":[{"application":"kibana-.kibana","resources":["space:default"],"privileges":["version:8.0.0","login:"]}]} [security_exception]: missing authentication credentials for REST request [/_security/user/_has_privileges]
server    log   [12:34:57.409] [debug][server][Kibana][cookie-session-storage][http] TypeError: Cannot read property 'toolkit' of undefined

Note the lines fetching api key for alert are from some console.log()s I added here, which seems to imply the API key is null.

async getApiKeyForAlertPermissions(alertId: string, spaceId: string) {
const namespace = this.context.spaceIdToNamespace(spaceId);
// Only fetch encrypted attributes here, we'll create a saved objects client
// scoped with the API key to fetch the remaining data.
const {
attributes: { apiKey },
} = await this.context.encryptedSavedObjectsClient.getDecryptedAsInternalUser<RawAlert>(
'alert',
alertId,
{ namespace }
);
return apiKey;
}

I'm not sure about that line The following attributes of saved object "alert,..." should have been decrypted: apiKey, but found only:; is that expected for a disabled alert? So we see that all the time? Or did something get broken somewhere?

I'm not sure if the simple fix is to treat this as a further indication that the rule is disabled. If we got this far into rule execution, because we think the rule is enabled, but happen to get a null api key, indicating that it is disabled, then just treat it as a disabled at that point.

But I'm also wondering if the rule is inconsistent at this point. Maybe the rule is enabled, but the apiKey is null? So we'll keep trying to run it, keep not being able to because of the API key, and ... will the user notice?

I'm also not completely sure who should be trapping that exception that is resulting in the unhandled promise rejection - that certainly needs to be fixed as well.

@pmuellr pmuellr added bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 20, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote mikecote added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 21, 2021
@pmuellr
Copy link
Member Author

pmuellr commented Jul 21, 2021

I wasn't able to narrow down which function is actually throwing an exception that we're not catching - and I'm just assuming it is alerting code. So that's not good - seems like we need to find it and add a try/catch.

However, I was able to prevent the crash from happening, by adding some code here:

try {
apiKey = await this.getApiKeyForAlertPermissions(alertId, spaceId);
} catch (err) {
throw new ErrorWithReason(AlertExecutionStatusErrorReasons.Decrypt, err);
}
const [services, alertsClient] = this.getServicesWithSpaceLevelPermissions(spaceId, apiKey);

Adding the following code after the try/catch and before the call to this.getServicesWithSpaceLevelPermissions()

    if (apiKey == null) {
      throw new Error('api key is not available');
    }

At one point, I had gotten a rule in such a state that Kibana would crash whenever it tried to execute the rule (or could it even be the action - not sure!). Adding this additional check and throw seems to prevent Kibana from crashing.

It does log the following lines to the console though:

Executing Alert {{rule-id}} has resulted in Error: api key is not available

Oddly, I'm also seeing this message multiple times, as if there are multiple task records for this rule running. And occaisonally seeing the following, which would also kind of be a clue that the alert is being run multiple times, close together:

error updating alert execution status for example.always-firing:{{rule-id}} [{{rule-id}}]: version conflict, 
  required seqNo [26467], primary term [1]. current document has seqNo [26468] and primary term [1]:
  version_conflict_engine_exception: [version_conflict_engine_exception] Reason: {{message repeated}}

Checking the task manager index, it appears there are 9 task records that got created for this one alert, somehow. So it seems like maybe not just a race condition with api key / enable/disable getting out of sync, but also task creation.

I'm going to start with a clean set of indices and the new throw logic, to see if I can reproduce this behavior - hoping we won't see it, but ...

@pmuellr
Copy link
Member Author

pmuellr commented Jul 21, 2021

well dang it, even with the additional throw, I'm seeing additional task manager docs getting added. I can disable the alert, which will remove one of the docs, but the others remain, and will cause the alert to continue to execute, even though it's disabled. Or at least it tried to - it now logs the error about the api key being missing which I added.

So, seems like this is a bigger problem. I'm worried that we could potentially see other cases of this, not just with the rapid enable/disable (that we'll fix in issue #106276). Feels like we need to figure out how to make all the things that happen in enable/disable atomic, to keep things consistent.

@ymao1
Copy link
Contributor

ymao1 commented Jul 22, 2021

As part of this PR, Kibana security suggested in the case of decryption failures to strip the encrypted attribute before migrating (vs migrating the original, unencrypted document). Once I did that, I was seeing the same Unhandled Promise rejection detected: error and I assumed it was due to the apiKey being completely missing (vs set to null) but maybe it happens as well when the apiKey is null?

In that PR I added a special case for when apiKey is undefined to throw an error but maybe it should just throw if it is null or undefined?

@ymao1
Copy link
Contributor

ymao1 commented Jul 22, 2021

I also spent some time trying to figure out where exactly that exception was being thrown but was unsuccessful. It seems to be happening somewhere in the rules client get function, possible during ensureAuthorized but I tried wrapping many things inside try/catches and was still seeing this Unhandled Promised rejection

@mikecote
Copy link
Contributor

I also spent some time trying to figure out where exactly that exception was being thrown but was unsuccessful.

@ymao1 have you tried running Kibana with --trace-warnings. There's a few NodeJS options that may help (https://nodejs.org/api/cli.html#cli_trace_warnings).

@ymao1
Copy link
Contributor

ymao1 commented Jul 22, 2021

I have not...I'll give that a shot thanks!

@ymao1
Copy link
Contributor

ymao1 commented Jul 22, 2021

It looks like the Uncaught Promise rejection comes from

const { body } = await clusterClient.asScoped(request).asCurrentUser.security.hasPrivileges({
body: {
cluster: privileges.elasticsearch?.cluster,
index: Object.entries(privileges.elasticsearch?.index ?? {}).map(
([name, indexPrivileges]) => ({
names: [name],
privileges: indexPrivileges,
})
),
application: [
{ application: applicationName, resources, privileges: allApplicationPrivileges },
],
},
});

If I wrap that in a try/catch, I no longer see the Unhandled Promise rejection error

I notice that the difference in the request between a rule that has an apiKey and a rule that doesn't (or apiKey is null) is in the headers:

with api key field:

{
	"params": {},
	"query": {},
	"body": {},
	"withoutSecretHeaders": true,
	"id": "b24d57c1-697c-43f6-9338-11f03409eafc",
	"uuid": "c278ca70-1232-49c2-9546-2e0920e77220",
	"url": {
		"href": "/"
	},
	"route": {
		"path": "/",
		"options": {
			"authRequired": true,
			"xsrfRequired": true,
			"tags": [],
			"timeout": {},
			"body": {}
		}
	},
	"headers": {
		"authorization": "ApiKey eVNSVnNIb0JMTVdoSUMyTEtyc046MGZuUE1FYnlRNks2WjJ2a3hFSEtMQQ=="
	},
	"isSystemRequest": false,
	"socket": {},
	"auth": {
		"isAuthenticated": false
	}
}

no api key field:

{
	"params": {},
	"query": {},
	"body": {},
	"withoutSecretHeaders": true,
	"id": "9449407d-0cc8-4794-bec4-f984b31feb35",
	"uuid": "67aa4a1c-7ab9-4b7d-a523-f248493fa3f3",
	"url": {
		"href": "/"
	},
	"route": {
		"path": "/",
		"options": {
			"authRequired": true,
			"xsrfRequired": true,
			"tags": [],
			"timeout": {},
			"body": {}
		}
	},
	"headers": {},
	"isSystemRequest": false,
	"socket": {},
	"auth": {
		"isAuthenticated": false
	}
}

@pmuellr
Copy link
Member Author

pmuellr commented Jul 22, 2021

I ended up putting a console.trace(), at the top of this function, since I think it's the call the exception is being thrown from:

return function checkPrivilegesWithRequest(request: KibanaRequest): CheckPrivileges {
const checkPrivilegesAtResources = async (
resources: string[],
privileges: CheckPrivilegesPayload
): Promise<CheckPrivilegesResponse> => {
const kibanaPrivileges = Array.isArray(privileges.kibana)
? privileges.kibana
: privileges.kibana
? [privileges.kibana]
: [];
const allApplicationPrivileges = uniq([actions.version, actions.login, ...kibanaPrivileges]);

yielded this:

Trace: checkPrivilegesWithRequest
    at Object.checkPrivilegesWithRequest (x-pack/plugins/security/server/authorization/check_privileges.ts:42:13)
    at SecureSpacesClientWrapper.ensureAuthorizedAtSpace (x-pack/plugins/security/server/spaces/secure_spaces_client_wrapper.ts:384:48)
    at SecureSpacesClientWrapper.get (x-pack/plugins/security/server/spaces/secure_spaces_client_wrapper.ts:164:20)
    at Object.getActiveSpace (x-pack/plugins/spaces/server/spaces_service/spaces_service.ts:120:64)
    at getSpace (x-pack/plugins/alerting/server/plugin.ts:332:46)
    at new AlertingAuthorization (x-pack/plugins/alerting/server/authorization/alerting_authorization.ts:107:24)
    at AlertingAuthorizationClientFactory.create (x-pack/plugins/alerting/server/alerting_authorization_client_factory.ts:47:12)
    at AlertsClientFactory.create (x-pack/plugins/alerting/server/alerts_client_factory.ts:90:41)
    at Object.getAlertsClientWithRequest (x-pack/plugins/alerting/server/plugin.ts:360:35)
    at TaskRunner.getServicesWithSpaceLevelPermissions (x-pack/plugins/alerting/server/task_runner/task_runner.ts:159:61)
    at TaskRunner.loadAlertAttributesAndRun (x-pack/plugins/alerting/server/task_runner/task_runner.ts:462:43)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
    at errorAsAlertTaskRunResult (x-pack/plugins/alerting/server/task_runner/task_runner.ts:947:12)
    at TaskRunner.run (x-pack/plugins/alerting/server/task_runner/task_runner.ts:549:33)
    at TaskManagerRunner.run (x-pack/plugins/task_manager/server/task_running/task_runner.ts:259:22)

Looking at the place where getSpace() is called, (alerting_authorization), there's some code:

this.spaceId = getSpace(request).then((maybeSpace) => maybeSpace?.id);
this.featuresIds = getSpace(request)

so, missing a .catch() on that first getSpace() call, but we shouldn't be calling this twice if each one is checking privileges, heh. There is a .catch() on that second getSpace() call ...

@pmuellr
Copy link
Member Author

pmuellr commented Jul 22, 2021

I think what's going on in that stack trace, is that we end up calling new AlertingAuthorization() to construct an object, sync, but it's making async calls, which it can handle via .then() (can't use await in a constructor). Seems like having a constructor do that kind of thing is bad form anyway - perhaps the logic should be split so the constructor does minimal sync work, and then the async work is done in a new async method.

@ymao1
Copy link
Contributor

ymao1 commented Jul 22, 2021

The line for this.spaceId = getSpace(request).then((maybeSpace) => maybeSpace?.id); is getting changed soon to use the synchronous getSpaceId method in this PR,

But that makes sense what you're saying.

@yctercero
Copy link
Contributor

Yes! @ymao1 is correct that we're moving to use the synchronous getSpaceId as we did notice that that line you pointed out is currently causing an unhandled rejection promise to throw.

@pmuellr
Copy link
Member Author

pmuellr commented Jul 22, 2021

I should also note, the stack trace in #106292 (comment) may be a red herring. Looks like there were a number of calls to check privileges, here's another stack:

    at checkPrivilegesWithRequest (x-pack/plugins/security/server/authorization/check_privileges.ts:42:13)
    at Object.checkPrivilegesDynamicallyWithRequest (x-pack/plugins/security/server/authorization/check_privileges_dynamically.ts:30:29)
    at AlertingAuthorization.ensureAuthorized (x-pack/plugins/alerting/server/authorization/alerting_authorization.ts:190:45)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
    at AlertsClient.get (x-pack/plugins/alerting/server/alerts_client/alerts_client.ts:370:7)
    at TaskRunner.loadAlertAttributesAndRun (x-pack/plugins/alerting/server/task_runner/task_runner.ts:468:15)
    at errorAsAlertTaskRunResult (x-pack/plugins/alerting/server/task_runner/task_runner.ts:947:12)
    at TaskRunner.run (x-pack/plugins/alerting/server/task_runner/task_runner.ts:549:33)
    at TaskManagerRunner.run (x-pack/plugins/task_manager/server/task_running/task_runner.ts:259:22)    

Could probably try to do a little more precision try/catching in the checkPrivileges call to narrow down the exact stack ...

And I tend to think of things before/after runMicrotasks() as potentially being unrelated, given the way v8 implements this stuff.

Lastly, not sure if this helped, it didn't seem to do anything before I added the explicit console.trace(), but I'm running Kibana with the following:

export NODE_OPTIONS="--trace-uncaught --report-uncaught-exception"

@pmuellr
Copy link
Member Author

pmuellr commented Jul 27, 2021

I'm thinking that if we can cut down the number of "things" we are creating / destroying, over the various alerting rule lifecycle, we'll have fewer chances of race conditions. Even if we can't do it over the whole lifecycle, just enable/disable, would probably be a win.

There's a separate issue open discussing changing how often we regenerate API keys - #106876 . For enable/disable, it would mean that disable would not invalidate the API key, but leave it alone. And enable wouldn't then have to generate a new API key. The downside is that we use this disable/re-enable flow today specifically to regenerate a new API key, so we'd need to switch have users use "update api key" HTTP endpoint instead, and provide somewhere to do this in the UI.

Next thing that it would be nice to not have to re-create - task manager tasks! I'll have to poke around and see if it's possible to "disable" a task, as opposed to deleting and re-creating it. The situation I had with multiple tasks associated with the same rule was one that we don't really have any way at all to remediate, without explicitly deleting documents from the task manager index.

@mikecote
Copy link
Contributor

mikecote commented Aug 9, 2021

I found another way to make the rule run while losing the API key.. If you export a running rule and import w/ the overwrite option, it will overwrite the rule object with enabled:false and no apiKey. Since it's a simple overwrite, the task in task manager remains, thinking the rule is still running. The next time the task runs, it will run the rule without an apiKey and fail.

@gmmorris gmmorris added resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 13, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@gmmorris gmmorris added the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Sep 16, 2021
@gmmorris
Copy link
Contributor

gmmorris commented Nov 4, 2021

I'm thinking that if we can cut down the number of "things" we are creating / destroying, over the various alerting rule lifecycle, we'll have fewer chances of race conditions. Even if we can't do it over the whole lifecycle, just enable/disable, would probably be a win.

This really chimes with me. Sure, it doesn't fix the race condition, but:
racecondition

I'm +1 on using this issue as an opportunity to reduce the cases where we regenerate an API Key.
Any thoughts @mikecote @arisonl ?

Having chated with @mikecote I now realise these are unrelated problems... so changing the API Key behaviour in enable/disable will not help us here as much as we'd need it to.

@mikecote
Copy link
Contributor

mikecote commented Nov 4, 2021

I think the main problem here is that rules are running when the rule is disabled. From what I recall, rules can either not have an API key and be disabled, or have an API key and be enabled. If we add guardrails to ensure rules don't run when disabled, I believe this problem goes away.

We could also consider my scenario above (#106292 (comment)) when adding guardrails to ensure the task id is the same as the one associated with the rule (!rule.scheduleTaskId || rule.scheduledTaskId === task.id). !rule.scheduleTaskId is important as tasks can start running during the rule creation process while we haven't had a chance to update scheduleTaskId yet (https://github.com/elastic/kibana/blob/main/x-pack/plugins/alerting/server/rules_client/rules_client.ts#L397-L399).

@chrisronline
Copy link
Contributor

I agree with most of this comment. The second comment should be moot, as this recent PR should ensure we don't have multiple tasks for the same rule.

We need to check if the rule is enabled or not before attempting to execute it, and we can modify this function to also return the enabled property and if that's not enabled, we do not attempt to fetch the rule here (which will eventually lead to rule execution).

The outstanding question is what do we do in this scenario? Do we treat this as an error state that goes into the event log as such? Should we add a new error state for "disabled" and log it as such?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Alerting impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants