Distinguish error types(source) for actions serverless metrics #171416

ersin-erdal · 2023-11-16T15:11:21Z

This PR intends to add source data (user or framework) to the errors that returned by the actions plugin and use them to create two new metrics in TaskManager.

New metrics:

USER_ERRORS = 'user_errors'
FRAMEWORK_ERRORS = 'framework_errors'

Error source types:

  FRAMEWORK = 'framework'
  USER = 'user'

I tried to keep the errorSource definition close to the place that the error is thrown as much as possible.

To check the types / numbers for these, you can curl $KB_URL/api/task_manager/metrics?reset=false to get the values that our support infrastructure will be using. To test how well this code works, you can inject errors (add code) into a connector executor that you use in an alert action, and then use the endpoint to see how the errors were counted.

ersin-erdal · 2023-11-16T17:00:27Z

x-pack/plugins/task_manager/common/constants.ts

I had to move this here as we are not allowed to import something from /server to /common and i had to use TaskErrorSource in actions/common

…-fix'

ersin-erdal · 2023-11-22T07:29:24Z

@elasticmachine merge upstream

ersin-erdal · 2023-11-22T12:37:48Z

x-pack/plugins/task_manager/server/metrics/task_run_metrics_aggregator.ts

    // increment success counters
-    if (success) {
+    if (success || ignoreError) {
      this.incrementCounters(TaskRunKeys.SUCCESS, taskType, taskTypeGroup);


Here we add user_errors to success metric in order to subtract user_erros from SLO target.
Rather than adding them here, we can use user_errors metric in SLO (success + user_errors) as well.

Please let me know WDYT.
I can remove ignoreError if we decide to do the addition in the SLO.

I prefer the use user_errors metric in SLO (success + user_errors) option because we are trying to categorize the errors regardless of the type. This will allow various SLOs to determine whether or not a specific type of error is considered success for them or not.

For example:

Alerting framework success in the future would exclude user errors and rule type implementation errors

Solution teams would create SLOs on their rule type and exclude framework and user errors

This allows a breakdown down the line as mentioned here: #167055

elasticmachine · 2023-11-22T13:20:18Z

Pinging @elastic/response-ops (Team:ResponseOps)

…' into 168635-error-distinguish-actions

pmuellr · 2023-11-22T16:59:34Z

I think we are not labelling enough errors as "USER". For example, I went into the Slack executor, and added a line at the top

if (true) return errorResult(execOptions.actionId, 'rando!')

I think these sort of errors should be marked as User. Basically, if the executor returns a result, instead of throwing an error, we probably want to label it as USER. If we catch an error from the executor, label that as FRAMEWORK. It's not perfect, but I think this more or less captures how I was thinking of splitting these.

Certainly we should make it such that if an executor throws an error, that implies a framework error - but perhaps some connectors are lazy and don't return a structured error instead - we could change those by having them return a structured error instead (in subsequent PRs). Likewise, if it doesn't throw an error, but returns an actual result, label that as USER. But we may want an "override" on that later, to allow a structured result to indicate it's a framework error.

The main problem without doing this is that all of the Slack errors we see (bad token, etc) that are actually user, still end up being labelled as framework.

ersin-erdal · 2023-11-22T18:39:54Z

I think we are not labelling enough errors as "USER". For example, I went into the Slack executor, and added a line at the top

Should be ok now, I marked all the errors thrown by the action types as USER error.

ymao1 · 2023-12-18T15:36:25Z

I am not sure about marking the validation errors as USER? I feel like most likely the validation issues come when we as the developers improperly modify the schema or forget a migration. WDYT?

ersin-erdal · 2023-12-18T16:23:25Z

I am not sure about marking the validation errors as USER? I feel like most likely the validation issues come when we as the developers improperly modify the schema or forget a migration. WDYT?

That was my opinion as well but we marked validation errors as user error in alerting: https://github.com/elastic/kibana/pull/169306/files/50d6d3bfd697b4d02303f2e2aa1c665039bf0e2d#diff-ce9a445b40db3d05e90a5d293e8754a57308df8fd9d9cb9af9503a255d75daf4R97

I think the idea is: an outdated schema could never go to prod, therefore a validation error could be caused by polluted data!

But I am not sure either.
@mikecote and @pmuellr WDYT?

mikecote · 2023-12-19T11:07:14Z

I am not sure about marking the validation errors as USER? I feel like most likely the validation issues come when we as the developers improperly modify the schema or forget a migration. WDYT?

That was my opinion as well but we marked validation errors as user error in alerting: https://github.com/elastic/kibana/pull/169306/files/50d6d3bfd697b4d02303f2e2aa1c665039bf0e2d#diff-ce9a445b40db3d05e90a5d293e8754a57308df8fd9d9cb9af9503a255d75daf4R97

I think the idea is: an outdated schema could never go to prod, therefore a validation error could be caused by polluted data!

But I am not sure either.
@mikecote and @pmuellr WDYT?

Do we have an example in the code I can take a look at? There will be sections where it will be on the user (ex: calling the execute API, generating actions from rules) and there will be sections where it's on the framework (ex: pre-run checks where we skip tasks indefinitely). Maybe there's scenarios where ambiguity exists?

ersin-erdal · 2023-12-19T12:42:39Z

Do we have an example in the code I can take a look at? There will be sections where it will be on the user (ex: calling the execute API, generating actions from rules) and there will be sections where it's on the framework (ex: pre-run checks where we skip tasks indefinitely). Maybe there's scenarios where ambiguity exists?

I think the one we discuss here is: https://github.com/elastic/kibana/pull/171416/files#diff-9a1c0e513210895d1de20260558acd5fadacf4b8331879fba6f37a3e96375fabR544

result of:

 validatedParams = validateParams(actionType, params, validatorServices);
 validatedConfig = validateConfig(actionType, config, validatorServices);
 validatedSecrets = validateSecrets(actionType, secrets, validatorServices);

ymao1 · 2023-12-19T13:06:58Z

There will be sections where it will be on the user (ex: calling the execute API, generating actions from rules) and there will be sections where it's on the framework

That's true, it is tricky to distinguish. I guess my fear is that we'll introduce a validation error that we end up ignoring because it's marked as user error. Maybe it's fine to go with this and see if we need to make further adjustments later.

mikecote · 2023-12-19T13:27:02Z

I think we should only flag user errors if we're highly confident that it's always caused by the user. In this example, we know we validate user input for config and secrets when creating or updating connectors via the HTTP APIs, so failures at this level would be caused by the system.

Regarding params, I am not sure if we validate prior to the executor being called (via execute, enqueueExecute and bulkEnqueueExecute), so this may be a case where it is a mix of system and user errors. But because we can't be highly certain on the cause, it's best to default to system errors which contribute to our SLIs and once we can distinguish better we can incorporate user errors.

So in short I think it's best to consider this system errors.

ersin-erdal · 2023-12-19T18:12:56Z

Ok, changed validation errors to FRAMEWORK.
@mikecote @ymao1

ymao1 · 2023-12-20T14:25:28Z

x-pack/plugins/actions/server/lib/action_executor.ts

@@ -274,6 +272,7 @@ export class ActionExecutor {
              serviceMessage: err.message,
              error: err,
              retry: true,
+              errorSource: TaskErrorSource.USER,


It looks like if an error is thrown from the actionType.executor, we assign it a user error, but if the action type executor handles its own error and gracefully returns a rawResult with an error object, no error type would be assigned (meaning it would default to a framework error). I think we need to add the errorSource in this scenario.

We do strip the raw error object but inside the task_runner_factory, we do this check:

if (executorResult.status === 'error') { throw throwRetryableError( createTaskRunError(new Error(executorResult.message), executorResult.errorSource), executorResult.retry as boolean | Date ); }

so maybe we can do something here with the error source? I believe most of our executors return controlled results so we would be mis-classifying a large source of user errors if we don't handle this

We can inject an errorSource in the action_executor even while stripping out the raw error, since the status is error

ymao1 · 2023-12-20T14:30:13Z

x-pack/plugins/actions/server/lib/action_executor.test.ts

@@ -23,6 +23,7 @@ import { securityMock } from '@kbn/security-plugin/server/mocks';
 import { finished } from 'stream/promises';
 import { PassThrough } from 'stream';
 import { SecurityConnectorFeatureId } from '../../common';
+import { TaskErrorSource } from '@kbn/task-manager-plugin/common';


Can we add a test for what is returned when the actionType.executor doesn't throw an error but returns a result with an error?

ymao1 · 2023-12-20T14:31:19Z

x-pack/plugins/actions/server/lib/action_executor.test.ts

@@ -1137,6 +1140,7 @@ test('throws an error when params is invalid', async () => {
    status: 'error',
    retry: false,
    message: `error validating action params: [param1]: expected value of type [string] but got [undefined]`,
+    errorSource: TaskErrorSource.FRAMEWORK,
  });
 });


In the test below throws an error when passing isESOCanEncrypt with value of false, can we update it to test for the error source inside the error that's thrown? I think there are few other tests that could be updated this way as well.

This is good idea.
I wrapped those errors with try-catch and marked them as FRAMEWORK explicitly where they are thrown.
Modified the tests as well. Thanks.

ymao1 · 2023-12-20T14:37:45Z

Looking through the code, here is how I see the breakdown

Error type	Error description
USER	User is not authorized to execute system action
USER	User is not authorized to execute Sentinel One action
USER	Action type disabled or license error
USER	Missing encryption key
USER	Errors thrown inside executor
??	Errors returned gracefully inside the executor result
FRAMEWORK	Error in the `action_executor` that are not from the action type executor
FRAMEWORK	Error getting decrypted action task param SO
FRAMEWORK	Error getting decrypted connector SO
FRAMEWORK	Error injecting saved object references
FRAMEWORK	Config/secret/params validation

Going to try to verify these scenarios and then maybe we can start a document to keep track of these different types of errors?

ymao1 · 2023-12-20T15:17:57Z

Verified all of the above listed error scenarios. I think the only thing is ensuring that controlled errors from the action type executors are reported as USER not FRAMEWORK and this PR is good to go!

…' into 168635-error-distinguish-actions

…error type

ersin-erdal · 2023-12-20T15:56:18Z

Verified all of the above listed error scenarios. I think the only thing is ensuring that controlled errors from the action type executors are reported as USER not FRAMEWORK and this PR is good to go!

Patrick thinks that they should be user errors:
#171416 (comment)

ymao1 · 2023-12-20T15:58:26Z

Verified all of the above listed error scenarios. I think the only thing is ensuring that controlled errors from the action type executors are reported as USER not FRAMEWORK and this PR is good to go!

Patrick thinks that they should be user errors: #171416 (comment)

I agree they should be user! Currently they are being returned as framework errors.

kibana-ci · 2023-12-20T18:17:19Z

💚 Build Succeeded

Buildkite Build
Commit: a6e27b2

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`actions`	263	264	+1

Unknown metric groups

API count

id	before	after	diff
`actions`	269	270	+1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ersin-erdal

ymao1

LGTM! Let's start a document to keep track of the different types of errors and what we're classifying them as, so we can refer back to it as needed.

ersin-erdal added release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.12.0 labels Nov 16, 2023

ersin-erdal self-assigned this Nov 16, 2023

ersin-erdal commented Nov 16, 2023

View reviewed changes

ersin-erdal force-pushed the 168635-error-distinguish-actions branch 3 times, most recently from e2aee64 to 618b414 Compare November 20, 2023 11:37

Distinguish error types(source) for actions serverless metrics

5471902

ersin-erdal force-pushed the 168635-error-distinguish-actions branch from 618b414 to 5471902 Compare November 20, 2023 13:19

ersin-erdal and others added 5 commits November 21, 2023 12:19

Add framework_errors and user_errors metrics

c7356ba

Add unit tests

26b5b80

merge main

d66bf7d

[CI] Auto-commit changed files from 'node scripts/eslint --no-cache -…

17a64aa

…-fix'

Merge branch 'main' into 168635-error-distinguish-actions

6a6c906

kibanamachine and others added 2 commits November 22, 2023 02:29

Merge branch 'main' into 168635-error-distinguish-actions

e62dffa

Merge branch 'main' into 168635-error-distinguish-actions

33d7a03

ersin-erdal commented Nov 22, 2023

View reviewed changes

ersin-erdal marked this pull request as ready for review November 22, 2023 13:20

ersin-erdal requested a review from a team as a code owner November 22, 2023 13:20

ersin-erdal added 3 commits November 22, 2023 15:42

Fix getErrorSource

90c4ad9

Merge remote-tracking branch 'origin/168635-error-distinguish-actions…

e9ae6fd

…' into 168635-error-distinguish-actions

fix unit test

606d97b

ersin-erdal added 2 commits November 22, 2023 18:41

refactor error helpers

c293bca

mark action type errors as USER

0192da7

fix integration test

7051457

ersin-erdal added v8.13.0 and removed v8.12.0 labels Dec 18, 2023

ersin-erdal added 2 commits December 19, 2023 17:57

Mark validation errors as USER error

ccaf4a6

Mark validation errors as USER error

cfdd62e

ersin-erdal and others added 4 commits December 20, 2023 00:19

fix failing functional tests

cea41b2

fix failing functional tests

69b955d

fix failing functional tests

9877785

Merge branch 'main' into 168635-error-distinguish-actions

95a5267

ymao1 reviewed Dec 20, 2023

View reviewed changes

ersin-erdal added 3 commits December 20, 2023 16:23

fix failing functional tests

c91ace6

Merge remote-tracking branch 'origin/168635-error-distinguish-actions…

fb31c18

…' into 168635-error-distinguish-actions

wrap the errors thrown in actionExecutor with try-catch and test the …

73c5e69

…error type

ersin-erdal added 2 commits December 20, 2023 17:53

mark actionType.executor returned errors as USER error

5fd3a0c

add unit test for actionType returned errors

a6e27b2

ymao1 approved these changes Dec 20, 2023

View reviewed changes

ersin-erdal merged commit dc25e89 into elastic:main Dec 20, 2023
37 checks passed

kibanamachine added the backport:skip This commit does not require backporting label Dec 20, 2023

ersin-erdal deleted the 168635-error-distinguish-actions branch December 20, 2023 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish error types(source) for actions serverless metrics #171416

Distinguish error types(source) for actions serverless metrics #171416

ersin-erdal commented Nov 16, 2023 •

edited by pmuellr

Loading

ersin-erdal Nov 16, 2023

ersin-erdal commented Nov 22, 2023

ersin-erdal Nov 22, 2023

mikecote Nov 23, 2023 •

edited

Loading

elasticmachine commented Nov 22, 2023

pmuellr commented Nov 22, 2023

ersin-erdal commented Nov 22, 2023 •

edited

Loading

ymao1 commented Dec 18, 2023

ersin-erdal commented Dec 18, 2023 •

edited

Loading

mikecote commented Dec 19, 2023 •

edited

Loading

ersin-erdal commented Dec 19, 2023

ymao1 commented Dec 19, 2023

mikecote commented Dec 19, 2023

ersin-erdal commented Dec 19, 2023

ymao1 Dec 20, 2023

ymao1 Dec 20, 2023

ymao1 Dec 20, 2023

ersin-erdal Dec 20, 2023

ymao1 Dec 20, 2023

ersin-erdal Dec 20, 2023

ymao1 Dec 20, 2023

ersin-erdal Dec 20, 2023

ymao1 commented Dec 20, 2023 •

edited

Loading

ymao1 commented Dec 20, 2023

ersin-erdal commented Dec 20, 2023

ymao1 commented Dec 20, 2023

kibana-ci commented Dec 20, 2023

API count

ymao1 left a comment

Distinguish error types(source) for actions serverless metrics #171416

Distinguish error types(source) for actions serverless metrics #171416

Conversation

ersin-erdal commented Nov 16, 2023 • edited by pmuellr Loading

Choose a reason for hiding this comment

ersin-erdal commented Nov 22, 2023

Choose a reason for hiding this comment

mikecote Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

elasticmachine commented Nov 22, 2023

pmuellr commented Nov 22, 2023

ersin-erdal commented Nov 22, 2023 • edited Loading

ymao1 commented Dec 18, 2023

ersin-erdal commented Dec 18, 2023 • edited Loading

mikecote commented Dec 19, 2023 • edited Loading

ersin-erdal commented Dec 19, 2023

ymao1 commented Dec 19, 2023

mikecote commented Dec 19, 2023

ersin-erdal commented Dec 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ymao1 commented Dec 20, 2023 • edited Loading

ymao1 commented Dec 20, 2023

ersin-erdal commented Dec 20, 2023

ymao1 commented Dec 20, 2023

kibana-ci commented Dec 20, 2023

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

API count

History

ymao1 left a comment

Choose a reason for hiding this comment

ersin-erdal commented Nov 16, 2023 •

edited by pmuellr

Loading

mikecote Nov 23, 2023 •

edited

Loading

ersin-erdal commented Nov 22, 2023 •

edited

Loading

ersin-erdal commented Dec 18, 2023 •

edited

Loading

mikecote commented Dec 19, 2023 •

edited

Loading

ymao1 commented Dec 20, 2023 •

edited

Loading