[Cases] Case action: Error handling and retries #173012

cnasikas · 2023-12-10T16:05:58Z

Summary

This PR:

Creates the CasesConnectorError error
Separate the execution logic by moving the current logic to a new class called CasesConnectorExecutor
Let the CasesConnector class handle only the retry logic of the connector
Implements the Full jitter backoff algorithm which is used as the retry strategy of the connector

Depends on: #172709

Checklist

Delete any items that are not applicable to this PR.

Unit or functional tests were updated or added to match the most common scenarios

For maintainers

This was checked for breaking API changes and was labeled appropriately

…te --fix'

…o register_case_action

…te --fix'

…o register_case_action

…fix'

cnasikas · 2024-01-08T14:56:46Z

x-pack/plugins/cases/server/connectors/cases/cases_connector_executor.ts

This file contains the logic as before. The only changes are about throwing errors. Specifically all this.handleAndThrowErrors(restOfErrors); lines plus the handleAndThrowErrors method. Nothing else changed.

cnasikas · 2024-01-08T14:57:09Z

x-pack/plugins/cases/server/connectors/cases/cases_connector_executor.test.ts

Same tests as before plus some tests to test the retry logic.

cnasikas · 2024-01-08T14:57:58Z

x-pack/plugins/cases/server/connectors/cases/cases_connector_executor.test.ts

+    });
+  });
+
+  describe('Retries', () => {


These are the new tests.

cnasikas · 2024-01-08T15:00:42Z

x-pack/plugins/cases/server/connectors/cases/cases_connector.ts

@@ -100,541 +86,34 @@ export class CasesConnector extends SubActionConnector<
  }

  public async run(params: CasesConnectorRunParams) {
-    const { alerts, groupingBy } = params;


Moved the logic to the CasesConnectorExecutor class.

cnasikas · 2024-01-08T15:01:46Z

x-pack/plugins/cases/server/connectors/cases/cases_connector.ts

-    }
-
-    return counterLastUpdatedAtAsDate < parsedDate.toDate();
+    await this.retryService.retryWithBackoff(() => this._run(params));


The CasesConnector class is responsible only for the retry logic.

cnasikas · 2024-01-08T15:05:23Z

x-pack/plugins/cases/server/connectors/cases/index.mock.ts

@@ -24,3 +27,111 @@ export const oracleRecordError: OracleRecordError = {
  statusCode: 404,
  message: 'An error',
 };
+
+export const alerts = [


Moved from the test files.

cnasikas · 2024-01-09T08:29:26Z

x-pack/plugins/cases/server/connectors/cases/cases_connector.test.ts

The test tests only how the executor is called and the retry logic. All the executor logic moved to cases_connector_executor.test.ts.

cnasikas · 2024-01-09T08:33:54Z

x-pack/plugins/cases/server/connectors/cases/cases_connector.ts

All logic of the executor moved to cases_connector_executor.ts.

cnasikas · 2024-01-09T08:40:58Z

x-pack/plugins/cases/server/connectors/cases/cases_connector_executor.test.ts

+        `"Conflict: getting records: mockBulkGetRecords error"`
+      );
+
+      resetCounters();


The generation of the IDs is mocked. It uses counters to get an incremental ID each time an ID is requested. We need to reset the counters before we retry the execution.

elasticmachine · 2024-01-09T09:23:35Z

Pinging @elastic/response-ops (Team:ResponseOps)

elasticmachine · 2024-01-09T09:23:36Z

Pinging @elastic/response-ops-cases (Feature:Cases)

adcoelho · 2024-01-11T10:49:25Z

x-pack/plugins/cases/server/connectors/cases/retry_service.ts

+import type { BackoffStrategy, BackoffFactory } from './types';
+
+export class CaseConnectorRetryService {
+  private maxAttempts: number = 10;


nit: does this need to be initialized here if the constructor sets it?

Good point, probably not 🙂.

adcoelho · 2024-01-11T11:18:26Z

x-pack/plugins/cases/server/connectors/cases/retry_service.ts

+  }
+
+  private stop(): void {
+    if (this.timer !== null) {


nit: I think this check is unnecessary, clearTimeout will just do nothing.

I did not know about it, thanks!

I get this TS error. clearTimeout accepts a number or undefined. I will keep it as it is, ok?

Weird when i tried it locally it didn't show anything. nevermind then!

adcoelho · 2024-01-11T11:21:03Z

x-pack/plugins/cases/server/connectors/cases/retry_service.test.ts

+    }
+  );
+
+  it('should succeed if cb does not throws', async () => {


Suggested change

it('should succeed if cb does not throws', async () => {

it('should succeed if cb does not throw', async () => {

adcoelho · 2024-01-11T11:23:39Z

x-pack/plugins/cases/server/connectors/cases/retry_service.test.ts

+      `"My transient error"`
+    );
+
+    expect(cb).toBeCalledTimes(maxAttempts + 1);


Why maxAttempts + 1?

The execution goes as:

The cb is called for the first time. cb throws an error. The retry service retries the cb.

The cb is called for a second time. First rety. cb throws an error. The retry service retries the cb.

The cb is called for a third time. Second retry. cb throws an error. The retry service retries the cb.

The cb is called for the fourth time. Third retry. cb throws an error. The retry service does not retry and throws an error.

Basically the first execution of cb does not count as a retry.

adcoelho · 2024-01-11T11:25:05Z

x-pack/plugins/cases/server/connectors/cases/utils.ts

@@ -21,3 +21,14 @@ export const partitionRecordsByError = (

  return [validRecords, errors];
 };
+
+export const partitionByNonFoundErrors = <T extends Array<{ statusCode: number }>>(


Was this also moved?

No this is a new function that separates 404 errors from the rest of the other errors.

adcoelho · 2024-01-11T11:30:03Z

x-pack/plugins/cases/server/connectors/cases/cases_connector_executor.test.ts

+      expectCasesToHaveTheCorrectAlertsAttachedWithGrouping(casesClientMock);
+    });
+
+    it('attaches the alerts correctly while creating a record and another node has already created it', async () => {


Isn't this fundamentally the same as 'attaches the alerts correctly when bulkCreateRecord fails'?

adcoelho · 2024-01-11T11:37:13Z

x-pack/plugins/cases/server/connectors/cases/cases_connector_executor.test.ts

+      // conflict error. Another node had updated the record.
+      mockBulkUpdateRecord.mockResolvedValueOnce([
+        {
+          id: groupedAlertsWithOracleKey[0].oracleKey,
+          type: CASE_ORACLE_SAVED_OBJECT,
+          message: 'updating records: mockBulkUpdateRecord error',
+          statusCode: 409,
+          error: 'Conflict',
+        },
+      ]);
+
+      await expect(() =>
+        connectorExecutor.execute({
+          alerts,
+          groupingBy,
+          owner,
+          rule,
+          timeWindow,
+          reopenClosedCases,
+        })
+      ).rejects.toThrowErrorMatchingInlineSnapshot(
+        `"Conflict: updating records: mockBulkUpdateRecord error"`
+      );


I get that since we are testing retries we want to actually call the same thing(connectorExecutor.execute) twice.

How relevant is that though?

We always have the same chunks in these tests. We mock some error for mockBulk*Record and expect some snapshot for connectorExecutor.execute. Are we really ensuring something or is this just overhead?

I don't know, food for thought. Maybe some integration tests would be more useful.

The execution of the connector is designed to be as idempotent as it can be. Retrying should not affect the proper execution of the connector, alerts should always attached to the correct cases. In the RFC you can see how the connector is simulated as a state machine where each error and retry leads to the correct state each time, at least in theory. The tests try to simulate multiple nodes making changes to cases at the same time (race conditions) which unfortunately cannot be tested with integration tests. When the system actions are ready we will write a lot of integration tests but for the logic of one node executing the connector.

To the tests now, if you see the mock functions that we are interested in have a chain of mockResolvedValueOnce(...). mockResolvedValueOnce(...). The first mockResolvedValueOnce will be called on the first execution and the second one on the second. This way we try to simulate different responses based on different actions on different nodes. For example, if the first mockResolvedValueOnce fails (conflict), then we retry, and on the second time return different results we test that on the first try another node did some change (increased the counter for example, or created the case for us) and on the second try we get the new state. The new state should not affect the execution of the connector and attach the alerts to the correct case. The snapshot check is to ensure that the correct function threw the error and not another function. After checking the snapshot we check in which case the alerts got attached.

kibana-ci · 2024-01-11T15:20:13Z

💔 Build Failed

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #4 / cases security and spaces enabled: basic Common update_alert_status should update the status of multiple alerts attached to multiple cases using the cases client
[job] [logs] FTR Configs #4 / cases security and spaces enabled: basic Common update_alert_status should update the status of multiple alerts attached to multiple cases using the cases client
[job] [logs] FTR Configs #37 / cases security and spaces enabled: trial Common update_alert_status should update the status of multiple alerts attached to multiple cases using the cases client
[job] [logs] FTR Configs #37 / cases security and spaces enabled: trial Common update_alert_status should update the status of multiple alerts attached to multiple cases using the cases client
[job] [logs] Jest Tests #8 / update Total comments and alerts calls the attachment service with the right params and returns the expected comments and alerts
[job] [logs] Jest Tests #8 / update Total comments and alerts calls the attachment service with the right params and returns the expected comments and alerts

Metrics [docs]

‼️ ERROR: no builds found for mergeBase sha [a1a9840]

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @cnasikas

## Summary Depends on: #166267, #170326, #169484, #173740, #173763, #178068, #178307, #178600, #180437 PRs: - #168370 - #169229 - #171754 - #172709 - #173012 - #175107 - #175452 - #175505 - #177033 - #178277 - #177139 - #179796 Fixes: #153837 ## Testing Run Kibana with `--run-examples` if you want to use the "Always firing" rule. Create a rule with a case action in observability and the stack. The security solution is not supported. You should not be able to assign a case action in a security solution rule. 1. Test the "Reopen closed cases" configuration. 2. Test the "Grouping by" configuration. Only one field is allowed. Not all fields are persisted in alerts. If you select a field not part of the alert the case action will create a case where the grouping value is set to `unknow`. 3. Test the "Time window" feature. You can comment out the validation to test for shorter times. 4. Verify that the case action is experimental. 5. Verify that based on the rule type the case is created in the correct solution. 6. Verify that you cannot create a rule with the case action on the basic license. 7. Verify that the execution of the case action fails if you do not have permission for cases. Pending work on the system actions framework level to not allow users to create rules with system actions where they do not have permission. 8. Stress test the case action by creating multiple rules. ### Checklist Delete any items that are not applicable to this PR. - [x] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios ### For maintainers - [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ## Release notes Automatically create cases when an alert is triggered. --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: adcoelho <[email protected]> Co-authored-by: Janki Salvi <[email protected]>

cnasikas and others added 30 commits October 9, 2023 17:16

Register the case action

0e1bfb0

Register the cases oracle

4777f9c

[CI] Auto-commit changed files from 'node scripts/check_mappings_upda…

65a88fa

…te --fix'

Calculate the hash of the record ID

eeb35a6

Merge branch 'register_case_action' of github.com:cnasikas/kibana int…

649dea9

…o register_case_action

Get oracle record

4491476

Rename folder

f4a81a1

Sort grouping definition

5822d73

Increase counter

df77537

Change grouping to record

9955239

Make the rule ID optional in the key

58bc3d6

Better types

581819a

Add version when updating

449a1b7

Improve types

c210a0c

Fix tests

c802637

Merge branch 'case_action' into register_case_action

b249032

Add model version and improve mapping

c120c97

Fix tests

043a9fe

Fix mapping test

a9db13f

[CI] Auto-commit changed files from 'node scripts/check_mappings_upda…

32af9e3

…te --fix'

Merge branch 'case_action' into register_case_action

90acabf

Merge branch 'register_case_action' of github.com:cnasikas/kibana int…

b7605b7

…o register_case_action

Define connector params initial schema

7b27008

Bulk get records

13fd013

Group alerts and bulk get oracle records

69a8778

[CI] Auto-commit changed files from 'node scripts/lint_ts_projects --…

e5c73a3

…fix'

Bulk create records

1ba6117

Merge branch 'ca_part_2' of github.com:cnasikas/kibana into ca_part_2

bf8a32d

Add TODOs

c6982a5

Move bulkGetOrCreateOracleRecords logic to the connector

f184a08

cnasikas added 5 commits December 20, 2023 19:31

Merge branch 'case_action' into ca_error_handling

9ff2617

Add tests for jitter

69e90d6

Create retry service

aac09db

Retry the execution of the connector

3b9c564

Add retry tests

8af3c0f

cnasikas force-pushed the ca_error_handling branch from 7cd3130 to 8af3c0f Compare December 24, 2023 18:44

cnasikas added 4 commits December 24, 2023 20:51

Move mock data to the mock file

f214254

Create test helpers

96afd98

Merge branch 'case_action' into ca_error_handling

cae023b

Remove formatting in current_fields

a901029

cnasikas commented Jan 8, 2024

View reviewed changes

cnasikas commented Jan 9, 2024

View reviewed changes

Small improvements

00a2871

cnasikas marked this pull request as ready for review January 9, 2024 09:23

cnasikas requested a review from a team as a code owner January 9, 2024 09:23

adcoelho approved these changes Jan 11, 2024

View reviewed changes

PR feedback

068678a

cnasikas merged commit 4f81c0c into elastic:case_action Jan 12, 2024
16 of 23 checks passed

cnasikas deleted the ca_error_handling branch January 12, 2024 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cases] Case action: Error handling and retries #173012

[Cases] Case action: Error handling and retries #173012

cnasikas commented Dec 10, 2023 •

edited

Loading

cnasikas Jan 8, 2024 •

edited

Loading

cnasikas Jan 8, 2024

cnasikas Jan 8, 2024

cnasikas Jan 8, 2024

cnasikas Jan 8, 2024

cnasikas Jan 8, 2024

cnasikas Jan 9, 2024

cnasikas Jan 9, 2024

cnasikas Jan 9, 2024

elasticmachine commented Jan 9, 2024

elasticmachine commented Jan 9, 2024

adcoelho Jan 11, 2024

cnasikas Jan 11, 2024

adcoelho Jan 11, 2024

cnasikas Jan 11, 2024

cnasikas Jan 11, 2024

adcoelho Jan 11, 2024

adcoelho Jan 11, 2024

adcoelho Jan 11, 2024

cnasikas Jan 11, 2024

adcoelho Jan 11, 2024

cnasikas Jan 11, 2024

adcoelho Jan 11, 2024

adcoelho Jan 11, 2024

cnasikas Jan 11, 2024

kibana-ci commented Jan 11, 2024 •

edited

Loading

	it('should succeed if cb does not throws', async () => {
	it('should succeed if cb does not throw', async () => {

[Cases] Case action: Error handling and retries #173012

[Cases] Case action: Error handling and retries #173012

Conversation

cnasikas commented Dec 10, 2023 • edited Loading

Summary

Checklist

For maintainers

cnasikas Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Jan 9, 2024

elasticmachine commented Jan 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibana-ci commented Jan 11, 2024 • edited Loading

💔 Build Failed

Failed CI Steps

Test Failures

Metrics [docs]

History

cnasikas commented Dec 10, 2023 •

edited

Loading

cnasikas Jan 8, 2024 •

edited

Loading

kibana-ci commented Jan 11, 2024 •

edited

Loading