Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish error types(source) for actions serverless metrics #171416

Merged
merged 33 commits into from
Dec 20, 2023

Conversation

ersin-erdal
Copy link
Contributor

@ersin-erdal ersin-erdal commented Nov 16, 2023

Resolves: #168635

This PR intends to add source data (user or framework) to the errors that returned by the actions plugin and use them to create two new metrics in TaskManager.

New metrics:

USER_ERRORS = 'user_errors'
FRAMEWORK_ERRORS = 'framework_errors'

Error source types:

  FRAMEWORK = 'framework'
  USER = 'user'

I tried to keep the errorSource definition close to the place that the error is thrown as much as possible.

To check the types / numbers for these, you can curl $KB_URL/api/task_manager/metrics?reset=false to get the values that our support infrastructure will be using. To test how well this code works, you can inject errors (add code) into a connector executor that you use in an alert action, and then use the endpoint to see how the errors were counted.

@ersin-erdal ersin-erdal added release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.12.0 labels Nov 16, 2023
@ersin-erdal ersin-erdal self-assigned this Nov 16, 2023
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to move this here as we are not allowed to import something from /server to /common and i had to use TaskErrorSource in actions/common

@ersin-erdal ersin-erdal force-pushed the 168635-error-distinguish-actions branch 3 times, most recently from e2aee64 to 618b414 Compare November 20, 2023 11:37
@ersin-erdal ersin-erdal force-pushed the 168635-error-distinguish-actions branch from 618b414 to 5471902 Compare November 20, 2023 13:19
@ersin-erdal
Copy link
Contributor Author

@elasticmachine merge upstream

// increment success counters
if (success) {
if (success || ignoreError) {
this.incrementCounters(TaskRunKeys.SUCCESS, taskType, taskTypeGroup);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we add user_errors to success metric in order to subtract user_erros from SLO target.
Rather than adding them here, we can use user_errors metric in SLO (success + user_errors) as well.

Please let me know WDYT.
I can remove ignoreError if we decide to do the addition in the SLO.

Copy link
Contributor

@mikecote mikecote Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the use user_errors metric in SLO (success + user_errors) option because we are trying to categorize the errors regardless of the type. This will allow various SLOs to determine whether or not a specific type of error is considered success for them or not.

For example:

  • Alerting framework success in the future would exclude user errors and rule type implementation errors
  • Solution teams would create SLOs on their rule type and exclude framework and user errors

This allows a breakdown down the line as mentioned here: #167055

@ersin-erdal ersin-erdal marked this pull request as ready for review November 22, 2023 13:20
@ersin-erdal ersin-erdal requested a review from a team as a code owner November 22, 2023 13:20
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member

pmuellr commented Nov 22, 2023

I think we are not labelling enough errors as "USER". For example, I went into the Slack executor, and added a line at the top

if (true) return errorResult(execOptions.actionId, 'rando!')

I think these sort of errors should be marked as User. Basically, if the executor returns a result, instead of throwing an error, we probably want to label it as USER. If we catch an error from the executor, label that as FRAMEWORK. It's not perfect, but I think this more or less captures how I was thinking of splitting these.

Certainly we should make it such that if an executor throws an error, that implies a framework error - but perhaps some connectors are lazy and don't return a structured error instead - we could change those by having them return a structured error instead (in subsequent PRs). Likewise, if it doesn't throw an error, but returns an actual result, label that as USER. But we may want an "override" on that later, to allow a structured result to indicate it's a framework error.

The main problem without doing this is that all of the Slack errors we see (bad token, etc) that are actually user, still end up being labelled as framework.

@ersin-erdal
Copy link
Contributor Author

ersin-erdal commented Nov 22, 2023

I think we are not labelling enough errors as "USER". For example, I went into the Slack executor, and added a line at the top

Should be ok now, I marked all the errors thrown by the action types as USER error.

@ymao1
Copy link
Contributor

ymao1 commented Dec 18, 2023

I am not sure about marking the validation errors as USER? I feel like most likely the validation issues come when we as the developers improperly modify the schema or forget a migration. WDYT?

@ersin-erdal
Copy link
Contributor Author

ersin-erdal commented Dec 18, 2023

I am not sure about marking the validation errors as USER? I feel like most likely the validation issues come when we as the developers improperly modify the schema or forget a migration. WDYT?

That was my opinion as well but we marked validation errors as user error in alerting: https://github.com/elastic/kibana/pull/169306/files/50d6d3bfd697b4d02303f2e2aa1c665039bf0e2d#diff-ce9a445b40db3d05e90a5d293e8754a57308df8fd9d9cb9af9503a255d75daf4R97

I think the idea is: an outdated schema could never go to prod, therefore a validation error could be caused by polluted data!

But I am not sure either.
@mikecote and @pmuellr WDYT?

@mikecote
Copy link
Contributor

mikecote commented Dec 19, 2023

I am not sure about marking the validation errors as USER? I feel like most likely the validation issues come when we as the developers improperly modify the schema or forget a migration. WDYT?

That was my opinion as well but we marked validation errors as user error in alerting: https://github.com/elastic/kibana/pull/169306/files/50d6d3bfd697b4d02303f2e2aa1c665039bf0e2d#diff-ce9a445b40db3d05e90a5d293e8754a57308df8fd9d9cb9af9503a255d75daf4R97

I think the idea is: an outdated schema could never go to prod, therefore a validation error could be caused by polluted data!

But I am not sure either.
@mikecote and @pmuellr WDYT?

Do we have an example in the code I can take a look at? There will be sections where it will be on the user (ex: calling the execute API, generating actions from rules) and there will be sections where it's on the framework (ex: pre-run checks where we skip tasks indefinitely). Maybe there's scenarios where ambiguity exists?

@ersin-erdal
Copy link
Contributor Author

Do we have an example in the code I can take a look at? There will be sections where it will be on the user (ex: calling the execute API, generating actions from rules) and there will be sections where it's on the framework (ex: pre-run checks where we skip tasks indefinitely). Maybe there's scenarios where ambiguity exists?

I think the one we discuss here is: https://github.com/elastic/kibana/pull/171416/files#diff-9a1c0e513210895d1de20260558acd5fadacf4b8331879fba6f37a3e96375fabR544

result of:

 validatedParams = validateParams(actionType, params, validatorServices);
 validatedConfig = validateConfig(actionType, config, validatorServices);
 validatedSecrets = validateSecrets(actionType, secrets, validatorServices);

@ymao1
Copy link
Contributor

ymao1 commented Dec 19, 2023

There will be sections where it will be on the user (ex: calling the execute API, generating actions from rules) and there will be sections where it's on the framework

That's true, it is tricky to distinguish. I guess my fear is that we'll introduce a validation error that we end up ignoring because it's marked as user error. Maybe it's fine to go with this and see if we need to make further adjustments later.

@mikecote
Copy link
Contributor

I think we should only flag user errors if we're highly confident that it's always caused by the user. In this example, we know we validate user input for config and secrets when creating or updating connectors via the HTTP APIs, so failures at this level would be caused by the system.

Regarding params, I am not sure if we validate prior to the executor being called (via execute, enqueueExecute and bulkEnqueueExecute), so this may be a case where it is a mix of system and user errors. But because we can't be highly certain on the cause, it's best to default to system errors which contribute to our SLIs and once we can distinguish better we can incorporate user errors.

So in short I think it's best to consider this system errors.

@ersin-erdal
Copy link
Contributor Author

Ok, changed validation errors to FRAMEWORK.
@mikecote @ymao1

@@ -274,6 +272,7 @@ export class ActionExecutor {
serviceMessage: err.message,
error: err,
retry: true,
errorSource: TaskErrorSource.USER,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if an error is thrown from the actionType.executor, we assign it a user error, but if the action type executor handles its own error and gracefully returns a rawResult with an error object, no error type would be assigned (meaning it would default to a framework error). I think we need to add the errorSource in this scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do strip the raw error object but inside the task_runner_factory, we do this check:

if (executorResult.status === 'error') {
   throw throwRetryableError(
            createTaskRunError(new Error(executorResult.message), executorResult.errorSource),
            executorResult.retry as boolean | Date
          );
}

so maybe we can do something here with the error source? I believe most of our executors return controlled results so we would be mis-classifying a large source of user errors if we don't handle this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can inject an errorSource in the action_executor even while stripping out the raw error, since the status is error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -23,6 +23,7 @@ import { securityMock } from '@kbn/security-plugin/server/mocks';
import { finished } from 'stream/promises';
import { PassThrough } from 'stream';
import { SecurityConnectorFeatureId } from '../../common';
import { TaskErrorSource } from '@kbn/task-manager-plugin/common';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for what is returned when the actionType.executor doesn't throw an error but returns a result with an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1137,6 +1140,7 @@ test('throws an error when params is invalid', async () => {
status: 'error',
retry: false,
message: `error validating action params: [param1]: expected value of type [string] but got [undefined]`,
errorSource: TaskErrorSource.FRAMEWORK,
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test below throws an error when passing isESOCanEncrypt with value of false, can we update it to test for the error source inside the error that's thrown? I think there are few other tests that could be updated this way as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good idea.
I wrapped those errors with try-catch and marked them as FRAMEWORK explicitly where they are thrown.
Modified the tests as well. Thanks.

@ymao1
Copy link
Contributor

ymao1 commented Dec 20, 2023

Looking through the code, here is how I see the breakdown

Error type Error description
USER User is not authorized to execute system action
USER User is not authorized to execute Sentinel One action
USER Action type disabled or license error
USER Missing encryption key
USER Errors thrown inside executor
?? Errors returned gracefully inside the executor result
FRAMEWORK Error in the action_executor that are not from the action type executor
FRAMEWORK Error getting decrypted action task param SO
FRAMEWORK Error getting decrypted connector SO
FRAMEWORK Error injecting saved object references
FRAMEWORK Config/secret/params validation

Going to try to verify these scenarios and then maybe we can start a document to keep track of these different types of errors?

@ymao1
Copy link
Contributor

ymao1 commented Dec 20, 2023

Verified all of the above listed error scenarios. I think the only thing is ensuring that controlled errors from the action type executors are reported as USER not FRAMEWORK and this PR is good to go!

@ersin-erdal
Copy link
Contributor Author

Verified all of the above listed error scenarios. I think the only thing is ensuring that controlled errors from the action type executors are reported as USER not FRAMEWORK and this PR is good to go!

Patrick thinks that they should be user errors:
#171416 (comment)

@ymao1
Copy link
Contributor

ymao1 commented Dec 20, 2023

Verified all of the above listed error scenarios. I think the only thing is ensuring that controlled errors from the action type executors are reported as USER not FRAMEWORK and this PR is good to go!

Patrick thinks that they should be user errors: #171416 (comment)

I agree they should be user! Currently they are being returned as framework errors.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
actions 263 264 +1
Unknown metric groups

API count

id before after diff
actions 269 270 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ersin-erdal

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's start a document to keep track of the different types of errors and what we're classifying them as, so we can refer back to it as needed.

@ersin-erdal ersin-erdal merged commit dc25e89 into elastic:main Dec 20, 2023
37 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Dec 20, 2023
@ersin-erdal ersin-erdal deleted the 168635-error-distinguish-actions branch December 20, 2023 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.13.0
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Split action metrics between total, framework, action type executor, and user errors in the serverless metrics
7 participants