Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let addLastRunError to report user errors #180040

Merged

Conversation

ersin-erdal
Copy link
Contributor

Resolves: #180030

@ersin-erdal ersin-erdal added release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.14.0 labels Apr 4, 2024
@ersin-erdal ersin-erdal self-assigned this Apr 4, 2024
@ersin-erdal
Copy link
Contributor Author

/ci

@marshallmain
Copy link
Contributor

marshallmain commented Apr 4, 2024

For #180094, it would be convenient to have the userError property on the error object in the rule SO as well so we can verify in API integration tests that our rule logic is setting the value correctly in various use cases. Is that feasible?

@ersin-erdal
Copy link
Contributor Author

/ci

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for putting this together

warnings: string[];
outcomeMessage: string;
}

interface LastRunError {
message: string;
userError?: boolean;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional nit:

Suggested change
userError?: boolean;
userError: boolean;

@@ -715,10 +715,11 @@ export class TaskRunner<

const { errors: errorsFromLastRun } = this.ruleResult.getLastRunResults();
if (errorsFromLastRun.length > 0) {
const isUserError = !errorsFromLastRun.some((lastRunError) => !lastRunError.userError);
return {
taskRunError: createTaskRunError(
new Error(errorsFromLastRun.join(',')),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: do we need to map here as well?

Suggested change
new Error(errorsFromLastRun.join(',')),
new Error(errorsFromLastRun.map((lastRunError) => lastRunError.message).join(',')),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed.

@ersin-erdal ersin-erdal marked this pull request as ready for review April 5, 2024 12:41
@ersin-erdal ersin-erdal requested a review from a team as a code owner April 5, 2024 12:41
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ersin-erdal

@ersin-erdal ersin-erdal merged commit 29fcdda into elastic:main Apr 5, 2024
34 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Apr 5, 2024
@ersin-erdal ersin-erdal deleted the 180030-add-last-run-user-error branch April 5, 2024 15:11
marshallmain added a commit that referenced this pull request Apr 10, 2024
…rs as user errors (#180094)

## Summary

Building on #180040

Detection rules commonly fail when prebuilt rules are imported and
enabled without the appropriate indices (for EQL) or ML jobs (for ML
rules). EQL rules fail with a `verification_exception` because the EQL
search API validates the fields in the search request against the
indices in the request; if there are no indices then it returns an
exception. ML rules fail with a `<job name> missing` exception on the
search request if the job is not found. Both of these errors do not mean
that the system is overloaded or performing incorrectly in some way, but
they are still showing up in large volumes on SLO dashboards.

This PR builds on #180040, which introduces the ability to classify
errors as "user errors" when the error is not due to some kind of system
malfunction, but more related to incorrect (or insufficient) user
actions.

### Testing
#### EQL
1. Create 2 indices, `test` and `test2`
```
PUT /test
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      }
    }
  }
}

PUT /test2
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "event.category": {
        "type": "keyword"
      }
    }
  }
}
```

2. Create (disabled) an EQL rule that queries `test*` and uses a simple
query like `file where true`
3. Delete the index `test2`
4. Enable the rule. The rule will fail with a `verification_exception`
because `test` does not have `event.category`.
5. Use your favorite debugging method to verify that `userError` was
`true` in `addLastRunError` in
`x-pack/plugins/alerting/server/monitoring/rule_result_service.ts`
(hopefully this property will be added to the rule SO so we can check it
there in API integration tests)

#### ML rules
1. Import a prebuilt ML rule (`Unusual Process Spawned by a User`, for
example)
2. Enable the rule. The rule will fail with `An error occurred during
rule execution: message: "problem_child_rare_process_by_user missing"`
3. Use your favorite debugging method to verify that `userError` was
`true` in `addLastRunError` in
`x-pack/plugins/alerting/server/monitoring/rule_result_service.ts`
(hopefully this property will be added to the rule SO so we can check it
there in API integration tests)
semd pushed a commit to semd/kibana that referenced this pull request Apr 10, 2024
…rs as user errors (elastic#180094)

## Summary

Building on elastic#180040

Detection rules commonly fail when prebuilt rules are imported and
enabled without the appropriate indices (for EQL) or ML jobs (for ML
rules). EQL rules fail with a `verification_exception` because the EQL
search API validates the fields in the search request against the
indices in the request; if there are no indices then it returns an
exception. ML rules fail with a `<job name> missing` exception on the
search request if the job is not found. Both of these errors do not mean
that the system is overloaded or performing incorrectly in some way, but
they are still showing up in large volumes on SLO dashboards.

This PR builds on elastic#180040, which introduces the ability to classify
errors as "user errors" when the error is not due to some kind of system
malfunction, but more related to incorrect (or insufficient) user
actions.

### Testing
#### EQL
1. Create 2 indices, `test` and `test2`
```
PUT /test
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      }
    }
  }
}

PUT /test2
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "event.category": {
        "type": "keyword"
      }
    }
  }
}
```

2. Create (disabled) an EQL rule that queries `test*` and uses a simple
query like `file where true`
3. Delete the index `test2`
4. Enable the rule. The rule will fail with a `verification_exception`
because `test` does not have `event.category`.
5. Use your favorite debugging method to verify that `userError` was
`true` in `addLastRunError` in
`x-pack/plugins/alerting/server/monitoring/rule_result_service.ts`
(hopefully this property will be added to the rule SO so we can check it
there in API integration tests)

#### ML rules
1. Import a prebuilt ML rule (`Unusual Process Spawned by a User`, for
example)
2. Enable the rule. The rule will fail with `An error occurred during
rule execution: message: "problem_child_rare_process_by_user missing"`
3. Use your favorite debugging method to verify that `userError` was
`true` in `addLastRunError` in
`x-pack/plugins/alerting/server/monitoring/rule_result_service.ts`
(hopefully this property will be added to the rule SO so we can check it
there in API integration tests)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow ruleResultService.addLastRunError to define user errors
6 participants