-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish error types(source) for actions serverless metrics #171416
Distinguish error types(source) for actions serverless metrics #171416
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to move this here as we are not allowed to import something from /server
to /common
and i had to use TaskErrorSource
in actions/common
e2aee64
to
618b414
Compare
618b414
to
5471902
Compare
@elasticmachine merge upstream |
// increment success counters | ||
if (success) { | ||
if (success || ignoreError) { | ||
this.incrementCounters(TaskRunKeys.SUCCESS, taskType, taskTypeGroup); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we add user_errors to success
metric in order to subtract user_erros from SLO target.
Rather than adding them here, we can use user_errors
metric in SLO (success + user_errors) as well.
Please let me know WDYT.
I can remove ignoreError
if we decide to do the addition in the SLO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the use user_errors metric in SLO (success + user_errors)
option because we are trying to categorize the errors
regardless of the type. This will allow various SLOs to determine whether or not a specific type of error is considered success for them or not.
For example:
- Alerting framework success in the future would exclude user errors and rule type implementation errors
- Solution teams would create SLOs on their rule type and exclude framework and user errors
This allows a breakdown down the line as mentioned here: #167055
Pinging @elastic/response-ops (Team:ResponseOps) |
…' into 168635-error-distinguish-actions
I think we are not labelling enough errors as "USER". For example, I went into the Slack executor, and added a line at the top
I think these sort of errors should be marked as User. Basically, if the executor returns a result, instead of throwing an error, we probably want to label it as USER. If we catch an error from the executor, label that as FRAMEWORK. It's not perfect, but I think this more or less captures how I was thinking of splitting these. Certainly we should make it such that if an executor throws an error, that implies a framework error - but perhaps some connectors are lazy and don't return a structured error instead - we could change those by having them return a structured error instead (in subsequent PRs). Likewise, if it doesn't throw an error, but returns an actual result, label that as USER. But we may want an "override" on that later, to allow a structured result to indicate it's a framework error. The main problem without doing this is that all of the Slack errors we see (bad token, etc) that are actually user, still end up being labelled as framework. |
Should be ok now, I marked all the errors thrown by the action types as USER error. |
I am not sure about marking the validation errors as |
That was my opinion as well but we marked validation errors as user error in alerting: https://github.com/elastic/kibana/pull/169306/files/50d6d3bfd697b4d02303f2e2aa1c665039bf0e2d#diff-ce9a445b40db3d05e90a5d293e8754a57308df8fd9d9cb9af9503a255d75daf4R97 I think the idea is: an outdated schema could never go to prod, therefore a validation error could be caused by polluted data! |
Do we have an example in the code I can take a look at? There will be sections where it will be on the user (ex: calling the execute API, generating actions from rules) and there will be sections where it's on the framework (ex: pre-run checks where we skip tasks indefinitely). Maybe there's scenarios where ambiguity exists? |
I think the one we discuss here is: https://github.com/elastic/kibana/pull/171416/files#diff-9a1c0e513210895d1de20260558acd5fadacf4b8331879fba6f37a3e96375fabR544 result of:
|
That's true, it is tricky to distinguish. I guess my fear is that we'll introduce a validation error that we end up ignoring because it's marked as user error. Maybe it's fine to go with this and see if we need to make further adjustments later. |
I think we should only flag user errors if we're highly confident that it's always caused by the user. In this example, we know we validate user input for Regarding So in short I think it's best to consider this system errors. |
@@ -274,6 +272,7 @@ export class ActionExecutor { | |||
serviceMessage: err.message, | |||
error: err, | |||
retry: true, | |||
errorSource: TaskErrorSource.USER, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like if an error is thrown from the actionType.executor
, we assign it a user error, but if the action type executor handles its own error and gracefully returns a rawResult
with an error object, no error type would be assigned (meaning it would default to a framework error). I think we need to add the errorSource
in this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do strip the raw error object but inside the task_runner_factory
, we do this check:
if (executorResult.status === 'error') {
throw throwRetryableError(
createTaskRunError(new Error(executorResult.message), executorResult.errorSource),
executorResult.retry as boolean | Date
);
}
so maybe we can do something here with the error source? I believe most of our executors return controlled results so we would be mis-classifying a large source of user errors if we don't handle this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can inject an errorSource
in the action_executor
even while stripping out the raw error, since the status is error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -23,6 +23,7 @@ import { securityMock } from '@kbn/security-plugin/server/mocks'; | |||
import { finished } from 'stream/promises'; | |||
import { PassThrough } from 'stream'; | |||
import { SecurityConnectorFeatureId } from '../../common'; | |||
import { TaskErrorSource } from '@kbn/task-manager-plugin/common'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test for what is returned when the actionType.executor
doesn't throw an error but returns a result with an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1137,6 +1140,7 @@ test('throws an error when params is invalid', async () => { | |||
status: 'error', | |||
retry: false, | |||
message: `error validating action params: [param1]: expected value of type [string] but got [undefined]`, | |||
errorSource: TaskErrorSource.FRAMEWORK, | |||
}); | |||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the test below throws an error when passing isESOCanEncrypt with value of false
, can we update it to test for the error source inside the error that's thrown? I think there are few other tests that could be updated this way as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good idea.
I wrapped those errors with try-catch and marked them as FRAMEWORK explicitly where they are thrown.
Modified the tests as well. Thanks.
Looking through the code, here is how I see the breakdown
Going to try to verify these scenarios and then maybe we can start a document to keep track of these different types of errors? |
Verified all of the above listed error scenarios. I think the only thing is ensuring that controlled errors from the action type executors are reported as |
…' into 168635-error-distinguish-actions
Patrick thinks that they should be user errors: |
I agree they should be user! Currently they are being returned as framework errors. |
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
History
To update your PR or re-run it, just comment with: cc @ersin-erdal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Let's start a document to keep track of the different types of errors and what we're classifying them as, so we can refer back to it as needed.
Resolves: #168635
This PR intends to add
source
data (user or framework) to the errors that returned by the actions plugin and use them to create two new metrics in TaskManager.New metrics:
Error source types:
I tried to keep the errorSource definition close to the place that the error is thrown as much as possible.
To check the types / numbers for these, you can
curl $KB_URL/api/task_manager/metrics?reset=false
to get the values that our support infrastructure will be using. To test how well this code works, you can inject errors (add code) into a connector executor that you use in an alert action, and then use the endpoint to see how the errors were counted.