Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test: Detection Engine API Integration Tests - Serverless - Rule Execution Logic.x-pack/test/security_solution_api_integration/test_suites/detections_response/default_license/rule_execution_logic/execution_logic/machine_learning·ts - Rule execution logic API Execution logic @ess @serverless Machine learning type rules "before all" hook for "should create 1 alert from ML rule when record meets anomaly_threshold" #171426

Closed
kibanamachine opened this issue Nov 16, 2023 · 9 comments · Fixed by #181918
Assignees
Labels
failed-test A test failure on a tracked branch, potentially flaky-test Team:Detection Engine Security Solution Detection Engine Area Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. triage_needed

Comments

@kibanamachine
Copy link
Contributor

kibanamachine commented Nov 16, 2023

A test failed on a tracked branch

AggregateError: 
    Error: Bulk doc failure [operation=index]:
      doc: {"actual":[1],"bucket_span":900,"by_field_name":"process.name","by_field_value":"store","detector_index":0,"function":"rare","function_description":"rare","host.name":["mothra"],"influencers":[{"influencer_field_name":"user.name","influencer_field_values":["root"]},{"influencer_field_name":"process.name","influencer_field_values":["store"]},{"influencer_field_name":"host.name","influencer_field_values":["mothra"]}],"initial_record_score":33.36147565024334,"is_interim":false,"job_id":"v3_linux_anomalous_network_activity","multi_bucket_impact":0,"probability":0.007820139656036713,"process.name":["store"],"record_score":33.36147565024334,"result_type":"record","timestamp":1605567488000,"typical":[0.007820139656036711],"user.name":["root"]}
      error: {"type":"document_parsing_exception","reason":"[1:177] failed to parse field [host] of type [keyword] in document with id 'v3_linux_anomalous_network_activity_record_1586274300000_900_0_-96106189301704594950079884115725560577_5'. Preview of field's value: '{name=[mothra]}'","caused_by":{"type":"illegal_state_exception","reason":"Can't get text on a START_OBJECT at 1:156"}}
        at Array.map (<anonymous>)
        at indexDocs (index_doc_records_stream.ts:62:13)
        at processTicksAndRejections (node:internal/process/task_queues:95:5)
        at Writable.write [as _write] (index_doc_records_stream.ts:76:9)
    at indexDocs (index_doc_records_stream.ts:62:13)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at Writable.write [as _write] (index_doc_records_stream.ts:76:9)

First failure: CI Build - main

@kibanamachine kibanamachine added the failed-test A test failure on a tracked branch, potentially flaky-test label Nov 16, 2023
@botelastic botelastic bot added the needs-team Issues missing a team label label Nov 16, 2023
@kibanamachine
Copy link
Contributor Author

New failure: CI Build - main

@mistic mistic added the Team:Detections and Resp Security Detection Response Team label Nov 16, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Nov 16, 2023
@kibanamachine
Copy link
Contributor Author

New failure: CI Build - main

@mistic
Copy link
Member

mistic commented Nov 17, 2023

Skipped.

main: 27d2fb9

@banderror banderror added triage_needed Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:Detection Engine Security Solution Detection Engine Area labels Dec 15, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@yctercero
Copy link
Contributor

Confirmed passing locally - flake not bug.

@yctercero yctercero removed the blocker label Mar 13, 2024
@rylnd
Copy link
Contributor

rylnd commented Mar 14, 2024

The error in these builds is pretty useful: it looks like the es_archiver call to insert our anomaly records failed, because host.name: ['something'] was causing a conflict with the existing mappings on insert (it appears that host was mapped as a keyword directly):

fail: Rule execution logic API Execution logic @ess @serverless Machine learning type rules "before all" hook for "should create 1 alert from ML rule when record meets anomaly_threshold"
 AggregateError: 
Error: Bulk doc failure [operation=index]:
  doc: {"actual":[1],"bucket_span":900,"by_field_name":"process.name","by_field_value":"store","detector_index":0,"function":"rare","function_description":"rare","host.name":["mothra"],"influencers":[{"influencer_field_name":"user.name","influencer_field_values":["root"]},{"influencer_field_name":"process.name","influencer_field_values":["store"]},{"influencer_field_name":"host.name","influencer_field_values":["mothra"]}],"initial_record_score":33.36147565024334,"is_interim":false,"job_id":"v3_linux_anomalous_network_activity","multi_bucket_impact":0,"probability":0.007820139656036713,"process.name":["store"],"record_score":33.36147565024334,"result_type":"record","timestamp":1605567488000,"typical":[0.007820139656036711],"user.name":["root"]}
 error: {"type":"document_parsing_exception","reason":"[1:177] failed to parse field [host] of type [keyword] in document with id 'v3_linux_anomalous_network_activity_record_1586274300000_900_0_-96106189301704594950079884115725560577_5'. Preview of field's value: '{name=[mothra]}'","caused_by":{"type":"illegal_state_exception","reason":"Can't get text on a START_OBJECT at 1:156"}}

I didn't see any errors leading up to that failure indicating that the archive mappings (which explicitly include host.name) were not applied, nor does the data have any errors, so I'm not exactly sure what went wrong, but the consequence was the above.

rylnd added a commit to rylnd/kibana that referenced this issue Apr 26, 2024
These were skipped previously due to es_archiver failing on a mapping
error (CI flake), but upon unskipping it was discovered that there were
a few mistakes in these tests (as they had been modified while skipped).

The previous commit addressed an error related to error
classification, this similarly fixes an assertion that would never have
been true: the anomaly here only has a known `user.name`, and their
`host.name` doesn't have a corresponding criticality record.

Closes elastic#171426.
rylnd added a commit that referenced this issue Apr 29, 2024
…ts (#181918)

## Summary
These tests were skipped previously due to `es_archiver`
[failing](#171426) on a mapping
error, but upon unskipping it was discovered that there were a few
mistakes in these tests, as they had been modified while skipped.

There are three main changes here:

* Fixes an incorrect assertion related to error classification
* Fixes an incorrect assertion related to asset criticality enrichment
* Adds additional `afterEach` hooks for housekeeping of generated data

Closes #171426
@rylnd
Copy link
Contributor

rylnd commented May 3, 2024

FYI I'm continuing to investigate these failures on this PR

@rylnd rylnd reopened this Jul 11, 2024
@rylnd rylnd self-assigned this Jul 11, 2024
rylnd added a commit that referenced this issue Jul 12, 2024
## Summary

The full chronicle of this endeavor can be found
[here](#182183), but [this
comment](#182183 (comment))
summarizes the identified issue:

> I [finally
found](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)
the cause of these failures in the response to our "setup modules"
request to ML. Attaching here for posterity:
>
> <details>
> <summary>Setup Modules Failure Response</summary>
> 
> ```json
> {
>   "jobs": [
> { "id": "v3_linux_anomalous_network_port_activity", "success": true },
>     {
>       "id": "v3_linux_anomalous_network_activity",
>       "success": false,
>       "error": {
>         "error": {
>           "root_cause": [
>             {
>               "type": "no_shard_available_action_exception",
> "reason":
"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
>             }
>           ],
>           "type": "search_phase_execution_exception",
>           "reason": "all shards failed",
>           "phase": "query",
>           "grouped": true,
>           "failed_shards": [
>             {
>               "shard": 0,
> "index":
".ml-anomalies-custom-v3_linux_network_configuration_discovery",
>               "node": "dKzpvp06ScO0OxqHilETEA",
>               "reason": {
>                 "type": "no_shard_available_action_exception",
> "reason":
"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
>               }
>             }
>           ]
>         },
>         "status": 503
>       }
>     }
>   ],
>   "datafeeds": [
>     {
>       "id": "datafeed-v3_linux_anomalous_network_port_activity",
>       "success": true,
>       "started": false,
>       "awaitingMlNodeAllocation": false
>     },
>     {
>       "id": "datafeed-v3_linux_anomalous_network_activity",
>       "success": false,
>       "started": false,
>       "awaitingMlNodeAllocation": false,
>       "error": {
>         "error": {
>           "root_cause": [
>             {
>               "type": "resource_not_found_exception",
> "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
>             }
>           ],
>           "type": "resource_not_found_exception",
> "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
>         },
>         "status": 404
>       }
>     }
>   ],
>   "kibana": {}
> }
> 
> ```
> </details>

This branch, then, fixes said issue by (relatively simply) retrying the
failed API call until it succeeds.

### Related Issues
Addresses:
- #171426
- #187478
- #187614
- #182009
- #171426

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] [ESS Rule Execution FTR x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)
- [x] [Serverless Rule Execution FTR x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)


### For maintainers

- [x] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)
@rylnd
Copy link
Contributor

rylnd commented Jul 12, 2024

Closed by #188155.

@rylnd rylnd closed this as completed Jul 12, 2024
rylnd added a commit to rylnd/kibana that referenced this issue Jul 12, 2024
## Summary

The full chronicle of this endeavor can be found
[here](elastic#182183), but [this
comment](elastic#182183 (comment))
summarizes the identified issue:

> I [finally
found](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)
the cause of these failures in the response to our "setup modules"
request to ML. Attaching here for posterity:
>
> <details>
> <summary>Setup Modules Failure Response</summary>
>
> ```json
> {
>   "jobs": [
> { "id": "v3_linux_anomalous_network_port_activity", "success": true },
>     {
>       "id": "v3_linux_anomalous_network_activity",
>       "success": false,
>       "error": {
>         "error": {
>           "root_cause": [
>             {
>               "type": "no_shard_available_action_exception",
> "reason":
"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
>             }
>           ],
>           "type": "search_phase_execution_exception",
>           "reason": "all shards failed",
>           "phase": "query",
>           "grouped": true,
>           "failed_shards": [
>             {
>               "shard": 0,
> "index":
".ml-anomalies-custom-v3_linux_network_configuration_discovery",
>               "node": "dKzpvp06ScO0OxqHilETEA",
>               "reason": {
>                 "type": "no_shard_available_action_exception",
> "reason":
"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]"
>               }
>             }
>           ]
>         },
>         "status": 503
>       }
>     }
>   ],
>   "datafeeds": [
>     {
>       "id": "datafeed-v3_linux_anomalous_network_port_activity",
>       "success": true,
>       "started": false,
>       "awaitingMlNodeAllocation": false
>     },
>     {
>       "id": "datafeed-v3_linux_anomalous_network_activity",
>       "success": false,
>       "started": false,
>       "awaitingMlNodeAllocation": false,
>       "error": {
>         "error": {
>           "root_cause": [
>             {
>               "type": "resource_not_found_exception",
> "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
>             }
>           ],
>           "type": "resource_not_found_exception",
> "reason": "No known job with id 'v3_linux_anomalous_network_activity'"
>         },
>         "status": 404
>       }
>     }
>   ],
>   "kibana": {}
> }
>
> ```
> </details>

This branch, then, fixes said issue by (relatively simply) retrying the
failed API call until it succeeds.

### Related Issues
Addresses:
- elastic#171426
- elastic#187478
- elastic#187614
- elastic#182009
- elastic#171426

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed
- [x] [ESS Rule Execution FTR x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)
- [x] [Serverless Rule Execution FTR x
200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)

### For maintainers

- [x] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

(cherry picked from commit 3df635e)
rylnd referenced this issue Jul 12, 2024
… (#188259)

# Backport

This will backport the following commits from `main` to `8.15`:
- [[Detection Engine] Addresses Flakiness in ML FTR tests
(#188155)](#188155)

<!--- Backport version: 8.9.8 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Ryland
Herrick","email":"[email protected]"},"sourceCommit":{"committedDate":"2024-07-12T19:10:25Z","message":"[Detection
Engine] Addresses Flakiness in ML FTR tests (#188155)\n\n##
Summary\r\n\r\nThe full chronicle of this endeavor can be
found\r\n[here](#182183), but
[this\r\ncomment](https://github.com/elastic/kibana/pull/182183#issuecomment-2221517519)\r\nsummarizes
the identified issue:\r\n\r\n> I
[finally\r\nfound](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)\r\nthe
cause of these failures in the response to our \"setup
modules\"\r\nrequest to ML. Attaching here for posterity:\r\n>\r\n>
<details>\r\n> <summary>Setup Modules Failure Response</summary>\r\n>
\r\n> ```json\r\n> {\r\n> \"jobs\": [\r\n> { \"id\":
\"v3_linux_anomalous_network_port_activity\", \"success\": true },\r\n>
{\r\n> \"id\": \"v3_linux_anomalous_network_activity\",\r\n>
\"success\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n>
\"root_cause\": [\r\n> {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> ],\r\n> \"type\": \"search_phase_execution_exception\",\r\n>
\"reason\": \"all shards failed\",\r\n> \"phase\": \"query\",\r\n>
\"grouped\": true,\r\n> \"failed_shards\": [\r\n> {\r\n> \"shard\":
0,\r\n>
\"index\":\r\n\".ml-anomalies-custom-v3_linux_network_configuration_discovery\",\r\n>
\"node\": \"dKzpvp06ScO0OxqHilETEA\",\r\n> \"reason\": {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> }\r\n> ]\r\n> },\r\n> \"status\": 503\r\n> }\r\n> }\r\n> ],\r\n>
\"datafeeds\": [\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_port_activity\",\r\n> \"success\":
true,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false\r\n> },\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_activity\",\r\n> \"success\":
false,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n>
{\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No
known job with id 'v3_linux_anomalous_network_activity'\"\r\n> }\r\n>
],\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\":
\"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n>
},\r\n> \"status\": 404\r\n> }\r\n> }\r\n> ],\r\n> \"kibana\": {}\r\n>
}\r\n> \r\n> ```\r\n> </details>\r\n\r\nThis branch, then, fixes said
issue by (relatively simply) retrying the\r\nfailed API call until it
succeeds.\r\n\r\n### Related Issues\r\nAddresses:\r\n-
https://github.com/elastic/kibana/issues/171426\r\n-
https://github.com/elastic/kibana/issues/187478\r\n-
https://github.com/elastic/kibana/issues/187614\r\n-
https://github.com/elastic/kibana/issues/182009\r\n-
https://github.com/elastic/kibana/issues/171426\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [x] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] [ESS Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)\r\n-
[x] [Serverless Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)\r\n\r\n\r\n###
For maintainers\r\n\r\n- [x] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"3df635ef4a8c86c41c91ac5f59198a9b67d1dc8b","branchLabelMapping":{"^v8.16.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","backport:skip","Feature:Detection
Rules","Feature:ML Rule","Feature:Security ML Jobs","Feature:Rule
Creation","Team:Detection Engine","Feature:Rule
Edit","v8.16.0"],"number":188155,"url":"https://github.com/elastic/kibana/pull/188155","mergeCommit":{"message":"[Detection
Engine] Addresses Flakiness in ML FTR tests (#188155)\n\n##
Summary\r\n\r\nThe full chronicle of this endeavor can be
found\r\n[here](#182183), but
[this\r\ncomment](https://github.com/elastic/kibana/pull/182183#issuecomment-2221517519)\r\nsummarizes
the identified issue:\r\n\r\n> I
[finally\r\nfound](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)\r\nthe
cause of these failures in the response to our \"setup
modules\"\r\nrequest to ML. Attaching here for posterity:\r\n>\r\n>
<details>\r\n> <summary>Setup Modules Failure Response</summary>\r\n>
\r\n> ```json\r\n> {\r\n> \"jobs\": [\r\n> { \"id\":
\"v3_linux_anomalous_network_port_activity\", \"success\": true },\r\n>
{\r\n> \"id\": \"v3_linux_anomalous_network_activity\",\r\n>
\"success\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n>
\"root_cause\": [\r\n> {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> ],\r\n> \"type\": \"search_phase_execution_exception\",\r\n>
\"reason\": \"all shards failed\",\r\n> \"phase\": \"query\",\r\n>
\"grouped\": true,\r\n> \"failed_shards\": [\r\n> {\r\n> \"shard\":
0,\r\n>
\"index\":\r\n\".ml-anomalies-custom-v3_linux_network_configuration_discovery\",\r\n>
\"node\": \"dKzpvp06ScO0OxqHilETEA\",\r\n> \"reason\": {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> }\r\n> ]\r\n> },\r\n> \"status\": 503\r\n> }\r\n> }\r\n> ],\r\n>
\"datafeeds\": [\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_port_activity\",\r\n> \"success\":
true,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false\r\n> },\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_activity\",\r\n> \"success\":
false,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n>
{\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No
known job with id 'v3_linux_anomalous_network_activity'\"\r\n> }\r\n>
],\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\":
\"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n>
},\r\n> \"status\": 404\r\n> }\r\n> }\r\n> ],\r\n> \"kibana\": {}\r\n>
}\r\n> \r\n> ```\r\n> </details>\r\n\r\nThis branch, then, fixes said
issue by (relatively simply) retrying the\r\nfailed API call until it
succeeds.\r\n\r\n### Related Issues\r\nAddresses:\r\n-
https://github.com/elastic/kibana/issues/171426\r\n-
https://github.com/elastic/kibana/issues/187478\r\n-
https://github.com/elastic/kibana/issues/187614\r\n-
https://github.com/elastic/kibana/issues/182009\r\n-
https://github.com/elastic/kibana/issues/171426\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [x] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] [ESS Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)\r\n-
[x] [Serverless Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)\r\n\r\n\r\n###
For maintainers\r\n\r\n- [x] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"3df635ef4a8c86c41c91ac5f59198a9b67d1dc8b"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.16.0","labelRegex":"^v8.16.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/188155","number":188155,"mergeCommit":{"message":"[Detection
Engine] Addresses Flakiness in ML FTR tests (#188155)\n\n##
Summary\r\n\r\nThe full chronicle of this endeavor can be
found\r\n[here](#182183), but
[this\r\ncomment](https://github.com/elastic/kibana/pull/182183#issuecomment-2221517519)\r\nsummarizes
the identified issue:\r\n\r\n> I
[finally\r\nfound](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6516#01909dde-a3e8-4e47-b255-b1ff7cac8f8d/6-2368)\r\nthe
cause of these failures in the response to our \"setup
modules\"\r\nrequest to ML. Attaching here for posterity:\r\n>\r\n>
<details>\r\n> <summary>Setup Modules Failure Response</summary>\r\n>
\r\n> ```json\r\n> {\r\n> \"jobs\": [\r\n> { \"id\":
\"v3_linux_anomalous_network_port_activity\", \"success\": true },\r\n>
{\r\n> \"id\": \"v3_linux_anomalous_network_activity\",\r\n>
\"success\": false,\r\n> \"error\": {\r\n> \"error\": {\r\n>
\"root_cause\": [\r\n> {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> ],\r\n> \"type\": \"search_phase_execution_exception\",\r\n>
\"reason\": \"all shards failed\",\r\n> \"phase\": \"query\",\r\n>
\"grouped\": true,\r\n> \"failed_shards\": [\r\n> {\r\n> \"shard\":
0,\r\n>
\"index\":\r\n\".ml-anomalies-custom-v3_linux_network_configuration_discovery\",\r\n>
\"node\": \"dKzpvp06ScO0OxqHilETEA\",\r\n> \"reason\": {\r\n> \"type\":
\"no_shard_available_action_exception\",\r\n>
\"reason\":\r\n\"[ftr][127.0.0.1:9300][indices:data/read/search[phase/query]]\"\r\n>
}\r\n> }\r\n> ]\r\n> },\r\n> \"status\": 503\r\n> }\r\n> }\r\n> ],\r\n>
\"datafeeds\": [\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_port_activity\",\r\n> \"success\":
true,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false\r\n> },\r\n> {\r\n> \"id\":
\"datafeed-v3_linux_anomalous_network_activity\",\r\n> \"success\":
false,\r\n> \"started\": false,\r\n> \"awaitingMlNodeAllocation\":
false,\r\n> \"error\": {\r\n> \"error\": {\r\n> \"root_cause\": [\r\n>
{\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\": \"No
known job with id 'v3_linux_anomalous_network_activity'\"\r\n> }\r\n>
],\r\n> \"type\": \"resource_not_found_exception\",\r\n> \"reason\":
\"No known job with id 'v3_linux_anomalous_network_activity'\"\r\n>
},\r\n> \"status\": 404\r\n> }\r\n> }\r\n> ],\r\n> \"kibana\": {}\r\n>
}\r\n> \r\n> ```\r\n> </details>\r\n\r\nThis branch, then, fixes said
issue by (relatively simply) retrying the\r\nfailed API call until it
succeeds.\r\n\r\n### Related Issues\r\nAddresses:\r\n-
https://github.com/elastic/kibana/issues/171426\r\n-
https://github.com/elastic/kibana/issues/187478\r\n-
https://github.com/elastic/kibana/issues/187614\r\n-
https://github.com/elastic/kibana/issues/182009\r\n-
https://github.com/elastic/kibana/issues/171426\r\n\r\n###
Checklist\r\n\r\n- [x] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common scenarios\r\n- [x] [Flaky
Test\r\nRunner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1)
was\r\nused on any tests changed\r\n- [x] [ESS Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6528)\r\n-
[x] [Serverless Rule Execution FTR
x\r\n200](https://buildkite.com/elastic/kibana-flaky-test-suite-runner/builds/6529)\r\n\r\n\r\n###
For maintainers\r\n\r\n- [x] This was checked for breaking API changes
and was
[labeled\r\nappropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)","sha":"3df635ef4a8c86c41c91ac5f59198a9b67d1dc8b"}}]}]
BACKPORT-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
failed-test A test failure on a tracked branch, potentially flaky-test Team:Detection Engine Security Solution Detection Engine Area Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. triage_needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants