Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cases] Case action: Error handling and retries #173012

Merged
merged 83 commits into from
Jan 12, 2024

Conversation

cnasikas
Copy link
Member

@cnasikas cnasikas commented Dec 10, 2023

Summary

This PR:

  1. Creates the CasesConnectorError error
  2. Separate the execution logic by moving the current logic to a new class called CasesConnectorExecutor
  3. Let the CasesConnector class handle only the retry logic of the connector
  4. Implements the Full jitter backoff algorithm which is used as the retry strategy of the connector

Depends on: #172709

Checklist

Delete any items that are not applicable to this PR.

For maintainers

Copy link
Member Author

@cnasikas cnasikas Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains the logic as before. The only changes are about throwing errors. Specifically all this.handleAndThrowErrors(restOfErrors); lines plus the handleAndThrowErrors method. Nothing else changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same tests as before plus some tests to test the retry logic.

});
});

describe('Retries', () => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the new tests.

@@ -100,541 +86,34 @@ export class CasesConnector extends SubActionConnector<
}

public async run(params: CasesConnectorRunParams) {
const { alerts, groupingBy } = params;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the logic to the CasesConnectorExecutor class.

}

return counterLastUpdatedAtAsDate < parsedDate.toDate();
await this.retryService.retryWithBackoff(() => this._run(params));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CasesConnector class is responsible only for the retry logic.

@@ -24,3 +27,111 @@ export const oracleRecordError: OracleRecordError = {
statusCode: 404,
message: 'An error',
};

export const alerts = [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from the test files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test tests only how the executor is called and the retry logic. All the executor logic moved to cases_connector_executor.test.ts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All logic of the executor moved to cases_connector_executor.ts.

`"Conflict: getting records: mockBulkGetRecords error"`
);

resetCounters();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generation of the IDs is mocked. It uses counters to get an incremental ID each time an ID is requested. We need to reset the counters before we retry the execution.

@cnasikas cnasikas marked this pull request as ready for review January 9, 2024 09:23
@cnasikas cnasikas requested a review from a team as a code owner January 9, 2024 09:23
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops-cases (Feature:Cases)

import type { BackoffStrategy, BackoffFactory } from './types';

export class CaseConnectorRetryService {
private maxAttempts: number = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does this need to be initialized here if the constructor sets it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, probably not 🙂.

}

private stop(): void {
if (this.timer !== null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this check is unnecessary, clearTimeout will just do nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know about it, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this TS error. clearTimeout accepts a number or undefined. I will keep it as it is, ok?

Screenshot 2024-01-11 at 4 55 39 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird when i tried it locally it didn't show anything. nevermind then!

}
);

it('should succeed if cb does not throws', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it('should succeed if cb does not throws', async () => {
it('should succeed if cb does not throw', async () => {

`"My transient error"`
);

expect(cb).toBeCalledTimes(maxAttempts + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why maxAttempts + 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution goes as:

  1. The cb is called for the first time. cb throws an error. The retry service retries the cb.
  2. The cb is called for a second time. First rety. cb throws an error. The retry service retries the cb.
  3. The cb is called for a third time. Second retry. cb throws an error. The retry service retries the cb.
  4. The cb is called for the fourth time. Third retry. cb throws an error. The retry service does not retry and throws an error.

Basically the first execution of cb does not count as a retry.

@@ -21,3 +21,14 @@ export const partitionRecordsByError = (

return [validRecords, errors];
};

export const partitionByNonFoundErrors = <T extends Array<{ statusCode: number }>>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this also moved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is a new function that separates 404 errors from the rest of the other errors.

expectCasesToHaveTheCorrectAlertsAttachedWithGrouping(casesClientMock);
});

it('attaches the alerts correctly while creating a record and another node has already created it', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this fundamentally the same as 'attaches the alerts correctly when bulkCreateRecord fails'?

Comment on lines +1485 to +1507
// conflict error. Another node had updated the record.
mockBulkUpdateRecord.mockResolvedValueOnce([
{
id: groupedAlertsWithOracleKey[0].oracleKey,
type: CASE_ORACLE_SAVED_OBJECT,
message: 'updating records: mockBulkUpdateRecord error',
statusCode: 409,
error: 'Conflict',
},
]);

await expect(() =>
connectorExecutor.execute({
alerts,
groupingBy,
owner,
rule,
timeWindow,
reopenClosedCases,
})
).rejects.toThrowErrorMatchingInlineSnapshot(
`"Conflict: updating records: mockBulkUpdateRecord error"`
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that since we are testing retries we want to actually call the same thing(connectorExecutor.execute) twice.

How relevant is that though?

We always have the same chunks in these tests. We mock some error for mockBulk*Record and expect some snapshot for connectorExecutor.execute. Are we really ensuring something or is this just overhead?

I don't know, food for thought. Maybe some integration tests would be more useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution of the connector is designed to be as idempotent as it can be. Retrying should not affect the proper execution of the connector, alerts should always attached to the correct cases. In the RFC you can see how the connector is simulated as a state machine where each error and retry leads to the correct state each time, at least in theory. The tests try to simulate multiple nodes making changes to cases at the same time (race conditions) which unfortunately cannot be tested with integration tests. When the system actions are ready we will write a lot of integration tests but for the logic of one node executing the connector.

To the tests now, if you see the mock functions that we are interested in have a chain of mockResolvedValueOnce(...). mockResolvedValueOnce(...). The first mockResolvedValueOnce will be called on the first execution and the second one on the second. This way we try to simulate different responses based on different actions on different nodes. For example, if the first mockResolvedValueOnce fails (conflict), then we retry, and on the second time return different results we test that on the first try another node did some change (increased the counter for example, or created the case for us) and on the second try we get the new state. The new state should not affect the execution of the connector and attach the alerts to the correct case. The snapshot check is to ensure that the correct function threw the error and not another function. After checking the snapshot we check in which case the alerts got attached.

@kibana-ci
Copy link
Collaborator

kibana-ci commented Jan 11, 2024

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #4 / cases security and spaces enabled: basic Common update_alert_status should update the status of multiple alerts attached to multiple cases using the cases client
  • [job] [logs] FTR Configs #4 / cases security and spaces enabled: basic Common update_alert_status should update the status of multiple alerts attached to multiple cases using the cases client
  • [job] [logs] FTR Configs #37 / cases security and spaces enabled: trial Common update_alert_status should update the status of multiple alerts attached to multiple cases using the cases client
  • [job] [logs] FTR Configs #37 / cases security and spaces enabled: trial Common update_alert_status should update the status of multiple alerts attached to multiple cases using the cases client
  • [job] [logs] Jest Tests #8 / update Total comments and alerts calls the attachment service with the right params and returns the expected comments and alerts
  • [job] [logs] Jest Tests #8 / update Total comments and alerts calls the attachment service with the right params and returns the expected comments and alerts

Metrics [docs]

‼️ ERROR: no builds found for mergeBase sha [a1a9840]

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @cnasikas

@cnasikas cnasikas merged commit 4f81c0c into elastic:case_action Jan 12, 2024
16 of 23 checks passed
@cnasikas cnasikas deleted the ca_error_handling branch January 12, 2024 12:55
cnasikas added a commit that referenced this pull request Apr 12, 2024
## Summary

Depends on: #166267,
#170326,
#169484,
#173740,
#173763,
#178068,
#178307,
#178600,
#180437

PRs:
- #168370
- #169229
- #171754
- #172709
- #173012
- #175107
- #175452
- #175505
- #177033
- #178277
- #177139
- #179796

Fixes: #153837

## Testing

Run Kibana with `--run-examples` if you want to use the "Always firing"
rule.

Create a rule with a case action in observability and the stack. The
security solution is not supported. You should not be able to assign a
case action in a security solution rule.

1. Test the "Reopen closed cases" configuration.
2. Test the "Grouping by" configuration. Only one field is allowed. Not
all fields are persisted in alerts. If you select a field not part of
the alert the case action will create a case where the grouping value is
set to `unknow`.
3. Test the "Time window" feature. You can comment out the validation to
test for shorter times.
4. Verify that the case action is experimental.
5. Verify that based on the rule type the case is created in the correct
solution.
6. Verify that you cannot create a rule with the case action on the
basic license.
7. Verify that the execution of the case action fails if you do not have
permission for cases. Pending work on the system actions framework level
to not allow users to create rules with system actions where they do not
have permission.
8. Stress test the case action by creating multiple rules.

### Checklist

Delete any items that are not applicable to this PR.

- [x]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

### For maintainers

- [x] This was checked for breaking API changes and was [labeled
appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

## Release notes

Automatically create cases when an alert is triggered.

---------

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: adcoelho <[email protected]>
Co-authored-by: Janki Salvi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Cases Cases feature release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants