-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution] Classify EQL verification and ML job missing errors as user errors #180094
Conversation
b276e76
to
c0ece99
Compare
@@ -131,4 +132,26 @@ describe('ml_executor', () => { | |||
); | |||
expect(response.warningMessages.length).toEqual(1); | |||
}); | |||
|
|||
it('should report job missing errors as user errors', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ML test suite is skipped, but this test works locally if you unskip the suite
Pinging @elastic/security-solution (Team: SecuritySolution) |
Pinging @elastic/security-detection-engine (Team:Detection Engine) |
/ci |
Hi @marshallmain. Jumping in in this PR since I created this ticket which I think is tangentially related to the work you're doing here. We started a discussion with @rylnd where your opinion will probably prove valuable, if you have some time to take a look: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulled down and confirmed behavior described to be tested. Errors still show up in UI for user, but userError
is true
. LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marshallmain this is really necessary improvement 👍
I left two comments regrading my concerns in error handling but overall PR looks good. I tested it locally and it works as expected. I approve in advance and trust you address my comments.
} | ||
return result; | ||
} catch (error) { | ||
if ((error.message as string).includes('verification_exception')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JS allow to throw anything. Generally speaking message
can be undefined
in rare cases. Since it's an error handler I'd recommend to add more robust check here like
if (error instanceof Error && error.message.includes('verification_exception')) {
exceptionFilter, | ||
}); | ||
} catch (error) { | ||
if ((error.message as string).endsWith('missing')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above regarding the error type.
💚 Build Succeeded
Metrics [docs]
History
To update your PR or re-run it, just comment with: |
…rs as user errors (elastic#180094) ## Summary Building on elastic#180040 Detection rules commonly fail when prebuilt rules are imported and enabled without the appropriate indices (for EQL) or ML jobs (for ML rules). EQL rules fail with a `verification_exception` because the EQL search API validates the fields in the search request against the indices in the request; if there are no indices then it returns an exception. ML rules fail with a `<job name> missing` exception on the search request if the job is not found. Both of these errors do not mean that the system is overloaded or performing incorrectly in some way, but they are still showing up in large volumes on SLO dashboards. This PR builds on elastic#180040, which introduces the ability to classify errors as "user errors" when the error is not due to some kind of system malfunction, but more related to incorrect (or insufficient) user actions. ### Testing #### EQL 1. Create 2 indices, `test` and `test2` ``` PUT /test { "mappings": { "properties": { "@timestamp": { "type": "date" } } } } PUT /test2 { "mappings": { "properties": { "@timestamp": { "type": "date" }, "event.category": { "type": "keyword" } } } } ``` 2. Create (disabled) an EQL rule that queries `test*` and uses a simple query like `file where true` 3. Delete the index `test2` 4. Enable the rule. The rule will fail with a `verification_exception` because `test` does not have `event.category`. 5. Use your favorite debugging method to verify that `userError` was `true` in `addLastRunError` in `x-pack/plugins/alerting/server/monitoring/rule_result_service.ts` (hopefully this property will be added to the rule SO so we can check it there in API integration tests) #### ML rules 1. Import a prebuilt ML rule (`Unusual Process Spawned by a User`, for example) 2. Enable the rule. The rule will fail with `An error occurred during rule execution: message: "problem_child_rare_process_by_user missing"` 3. Use your favorite debugging method to verify that `userError` was `true` in `addLastRunError` in `x-pack/plugins/alerting/server/monitoring/rule_result_service.ts` (hopefully this property will be added to the rule SO so we can check it there in API integration tests)
…ing errors as user errors (elastic#180094)" This reverts commit 9242afb.
## Summary API integration tests added in #180094 fail occasionally on main, but pass consistently locally and in the flaky test runner. This PR adds a retry to the `getMetrics` request in hopes of removing the flakiness. Flake issues: #180530 #180641 --------- Co-authored-by: kibanamachine <[email protected]>
Summary
Building on #180040
Detection rules commonly fail when prebuilt rules are imported and enabled without the appropriate indices (for EQL) or ML jobs (for ML rules). EQL rules fail with a
verification_exception
because the EQL search API validates the fields in the search request against the indices in the request; if there are no indices then it returns an exception. ML rules fail with a<job name> missing
exception on the search request if the job is not found. Both of these errors do not mean that the system is overloaded or performing incorrectly in some way, but they are still showing up in large volumes on SLO dashboards.This PR builds on #180040, which introduces the ability to classify errors as "user errors" when the error is not due to some kind of system malfunction, but more related to incorrect (or insufficient) user actions.
Testing
EQL
test
andtest2
test*
and uses a simple query likefile where true
test2
verification_exception
becausetest
does not haveevent.category
.userError
wastrue
inaddLastRunError
inx-pack/plugins/alerting/server/monitoring/rule_result_service.ts
(hopefully this property will be added to the rule SO so we can check it there in API integration tests)ML rules
Unusual Process Spawned by a User
, for example)An error occurred during rule execution: message: "problem_child_rare_process_by_user missing"
userError
wastrue
inaddLastRunError
inx-pack/plugins/alerting/server/monitoring/rule_result_service.ts
(hopefully this property will be added to the rule SO so we can check it there in API integration tests)