Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] Classify EQL verification and ML job missing errors as user errors #180094

Merged
merged 6 commits into from
Apr 10, 2024

Conversation

marshallmain
Copy link
Contributor

Summary

Building on #180040

Detection rules commonly fail when prebuilt rules are imported and enabled without the appropriate indices (for EQL) or ML jobs (for ML rules). EQL rules fail with a verification_exception because the EQL search API validates the fields in the search request against the indices in the request; if there are no indices then it returns an exception. ML rules fail with a <job name> missing exception on the search request if the job is not found. Both of these errors do not mean that the system is overloaded or performing incorrectly in some way, but they are still showing up in large volumes on SLO dashboards.

This PR builds on #180040, which introduces the ability to classify errors as "user errors" when the error is not due to some kind of system malfunction, but more related to incorrect (or insufficient) user actions.

Testing

EQL

  1. Create 2 indices, test and test2
PUT /test
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      }
    }
  }
}

PUT /test2
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "event.category": {
        "type": "keyword"
      }
    }
  }
}
  1. Create (disabled) an EQL rule that queries test* and uses a simple query like file where true
  2. Delete the index test2
  3. Enable the rule. The rule will fail with a verification_exception because test does not have event.category.
  4. Use your favorite debugging method to verify that userError was true in addLastRunError in x-pack/plugins/alerting/server/monitoring/rule_result_service.ts (hopefully this property will be added to the rule SO so we can check it there in API integration tests)

ML rules

  1. Import a prebuilt ML rule (Unusual Process Spawned by a User, for example)
  2. Enable the rule. The rule will fail with An error occurred during rule execution: message: "problem_child_rare_process_by_user missing"
  3. Use your favorite debugging method to verify that userError was true in addLastRunError in x-pack/plugins/alerting/server/monitoring/rule_result_service.ts (hopefully this property will be added to the rule SO so we can check it there in API integration tests)

@marshallmain marshallmain force-pushed the classify-user-errors branch from b276e76 to c0ece99 Compare April 5, 2024 15:44
@@ -131,4 +132,26 @@ describe('ml_executor', () => {
);
expect(response.warningMessages.length).toEqual(1);
});

it('should report job missing errors as user errors', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ML test suite is skipped, but this test works locally if you unskip the suite

@marshallmain marshallmain marked this pull request as ready for review April 5, 2024 15:48
@marshallmain marshallmain requested review from a team as code owners April 5, 2024 15:48
@marshallmain marshallmain added release_note:skip Skip the PR/issue when compiling release notes Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Engine Security Solution Detection Engine Area v8.14.0 labels Apr 5, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detection-engine (Team:Detection Engine)

@marshallmain
Copy link
Contributor Author

/ci

@jpdjere
Copy link
Contributor

jpdjere commented Apr 8, 2024

Hi @marshallmain.

Jumping in in this PR since I created this ticket which I think is tangentially related to the work you're doing here. We started a discussion with @rylnd where your opinion will probably prove valuable, if you have some time to take a look:

Copy link
Contributor

@yctercero yctercero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled down and confirmed behavior described to be tested. Errors still show up in UI for user, but userError is true. LGTM!

Copy link
Contributor

@maximpn maximpn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marshallmain this is really necessary improvement 👍

I left two comments regrading my concerns in error handling but overall PR looks good. I tested it locally and it works as expected. I approve in advance and trust you address my comments.

}
return result;
} catch (error) {
if ((error.message as string).includes('verification_exception')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JS allow to throw anything. Generally speaking message can be undefined in rare cases. Since it's an error handler I'd recommend to add more robust check here like

if (error instanceof Error && error.message.includes('verification_exception')) {

exceptionFilter,
});
} catch (error) {
if ((error.message as string).endsWith('missing')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above regarding the error type.

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@marshallmain marshallmain merged commit d45157b into elastic:main Apr 10, 2024
34 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Apr 10, 2024
@marshallmain marshallmain deleted the classify-user-errors branch April 10, 2024 14:35
semd pushed a commit to semd/kibana that referenced this pull request Apr 10, 2024
…rs as user errors (elastic#180094)

## Summary

Building on elastic#180040

Detection rules commonly fail when prebuilt rules are imported and
enabled without the appropriate indices (for EQL) or ML jobs (for ML
rules). EQL rules fail with a `verification_exception` because the EQL
search API validates the fields in the search request against the
indices in the request; if there are no indices then it returns an
exception. ML rules fail with a `<job name> missing` exception on the
search request if the job is not found. Both of these errors do not mean
that the system is overloaded or performing incorrectly in some way, but
they are still showing up in large volumes on SLO dashboards.

This PR builds on elastic#180040, which introduces the ability to classify
errors as "user errors" when the error is not due to some kind of system
malfunction, but more related to incorrect (or insufficient) user
actions.

### Testing
#### EQL
1. Create 2 indices, `test` and `test2`
```
PUT /test
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      }
    }
  }
}

PUT /test2
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "event.category": {
        "type": "keyword"
      }
    }
  }
}
```

2. Create (disabled) an EQL rule that queries `test*` and uses a simple
query like `file where true`
3. Delete the index `test2`
4. Enable the rule. The rule will fail with a `verification_exception`
because `test` does not have `event.category`.
5. Use your favorite debugging method to verify that `userError` was
`true` in `addLastRunError` in
`x-pack/plugins/alerting/server/monitoring/rule_result_service.ts`
(hopefully this property will be added to the rule SO so we can check it
there in API integration tests)

#### ML rules
1. Import a prebuilt ML rule (`Unusual Process Spawned by a User`, for
example)
2. Enable the rule. The rule will fail with `An error occurred during
rule execution: message: "problem_child_rare_process_by_user missing"`
3. Use your favorite debugging method to verify that `userError` was
`true` in `addLastRunError` in
`x-pack/plugins/alerting/server/monitoring/rule_result_service.ts`
(hopefully this property will be added to the rule SO so we can check it
there in API integration tests)
semd added a commit to semd/kibana that referenced this pull request Apr 10, 2024
marshallmain added a commit that referenced this pull request Apr 15, 2024
## Summary

API integration tests added in #180094 fail occasionally on main, but
pass consistently locally and in the flaky test runner. This PR adds a
retry to the `getMetrics` request in hopes of removing the flakiness.

Flake issues:
#180530
#180641

---------

Co-authored-by: kibanamachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Detection Engine Security Solution Detection Engine Area Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants