-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cases] Case action: Error handling and retries #173012
Conversation
…o register_case_action
…o register_case_action
7cd3130
to
8af3c0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains the logic as before. The only changes are about throwing errors. Specifically all this.handleAndThrowErrors(restOfErrors);
lines plus the handleAndThrowErrors
method. Nothing else changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same tests as before plus some tests to test the retry logic.
}); | ||
}); | ||
|
||
describe('Retries', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the new tests.
@@ -100,541 +86,34 @@ export class CasesConnector extends SubActionConnector< | |||
} | |||
|
|||
public async run(params: CasesConnectorRunParams) { | |||
const { alerts, groupingBy } = params; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the logic to the CasesConnectorExecutor
class.
} | ||
|
||
return counterLastUpdatedAtAsDate < parsedDate.toDate(); | ||
await this.retryService.retryWithBackoff(() => this._run(params)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CasesConnector
class is responsible only for the retry logic.
@@ -24,3 +27,111 @@ export const oracleRecordError: OracleRecordError = { | |||
statusCode: 404, | |||
message: 'An error', | |||
}; | |||
|
|||
export const alerts = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from the test files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test tests only how the executor is called and the retry logic. All the executor logic moved to cases_connector_executor.test.ts
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All logic of the executor moved to cases_connector_executor.ts
.
`"Conflict: getting records: mockBulkGetRecords error"` | ||
); | ||
|
||
resetCounters(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generation of the IDs is mocked. It uses counters to get an incremental ID each time an ID is requested. We need to reset the counters before we retry the execution.
Pinging @elastic/response-ops (Team:ResponseOps) |
Pinging @elastic/response-ops-cases (Feature:Cases) |
import type { BackoffStrategy, BackoffFactory } from './types'; | ||
|
||
export class CaseConnectorRetryService { | ||
private maxAttempts: number = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: does this need to be initialized here if the constructor sets it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, probably not 🙂.
} | ||
|
||
private stop(): void { | ||
if (this.timer !== null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this check is unnecessary, clearTimeout
will just do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know about it, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird when i tried it locally it didn't show anything. nevermind then!
} | ||
); | ||
|
||
it('should succeed if cb does not throws', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it('should succeed if cb does not throws', async () => { | |
it('should succeed if cb does not throw', async () => { |
`"My transient error"` | ||
); | ||
|
||
expect(cb).toBeCalledTimes(maxAttempts + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why maxAttempts + 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The execution goes as:
- The
cb
is called for the first time.cb
throws an error. The retry service retries thecb
. - The
cb
is called for a second time. First rety.cb
throws an error. The retry service retries thecb
. - The
cb
is called for a third time. Second retry.cb
throws an error. The retry service retries thecb
. - The
cb
is called for the fourth time. Third retry.cb
throws an error. The retry service does not retry and throws an error.
Basically the first execution of cb
does not count as a retry.
@@ -21,3 +21,14 @@ export const partitionRecordsByError = ( | |||
|
|||
return [validRecords, errors]; | |||
}; | |||
|
|||
export const partitionByNonFoundErrors = <T extends Array<{ statusCode: number }>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this also moved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is a new function that separates 404 errors from the rest of the other errors.
expectCasesToHaveTheCorrectAlertsAttachedWithGrouping(casesClientMock); | ||
}); | ||
|
||
it('attaches the alerts correctly while creating a record and another node has already created it', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this fundamentally the same as 'attaches the alerts correctly when bulkCreateRecord fails'
?
// conflict error. Another node had updated the record. | ||
mockBulkUpdateRecord.mockResolvedValueOnce([ | ||
{ | ||
id: groupedAlertsWithOracleKey[0].oracleKey, | ||
type: CASE_ORACLE_SAVED_OBJECT, | ||
message: 'updating records: mockBulkUpdateRecord error', | ||
statusCode: 409, | ||
error: 'Conflict', | ||
}, | ||
]); | ||
|
||
await expect(() => | ||
connectorExecutor.execute({ | ||
alerts, | ||
groupingBy, | ||
owner, | ||
rule, | ||
timeWindow, | ||
reopenClosedCases, | ||
}) | ||
).rejects.toThrowErrorMatchingInlineSnapshot( | ||
`"Conflict: updating records: mockBulkUpdateRecord error"` | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that since we are testing retries we want to actually call the same thing(connectorExecutor.execute
) twice.
How relevant is that though?
We always have the same chunks in these tests. We mock some error for mockBulk*Record
and expect some snapshot for connectorExecutor.execute
. Are we really ensuring something or is this just overhead?
I don't know, food for thought. Maybe some integration tests would be more useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The execution of the connector is designed to be as idempotent as it can be. Retrying should not affect the proper execution of the connector, alerts should always attached to the correct cases. In the RFC you can see how the connector is simulated as a state machine where each error and retry leads to the correct state each time, at least in theory. The tests try to simulate multiple nodes making changes to cases at the same time (race conditions) which unfortunately cannot be tested with integration tests. When the system actions are ready we will write a lot of integration tests but for the logic of one node executing the connector.
To the tests now, if you see the mock functions that we are interested in have a chain of mockResolvedValueOnce(...). mockResolvedValueOnce(...)
. The first mockResolvedValueOnce
will be called on the first execution and the second one on the second. This way we try to simulate different responses based on different actions on different nodes. For example, if the first mockResolvedValueOnce
fails (conflict), then we retry, and on the second time return different results we test that on the first try another node did some change (increased the counter for example, or created the case for us) and on the second try we get the new state. The new state should not affect the execution of the connector and attach the alerts to the correct case. The snapshot check is to ensure that the correct function threw the error and not another function. After checking the snapshot we check in which case the alerts got attached.
💔 Build FailedFailed CI Steps
Test Failures
Metrics [docs]
HistoryTo update your PR or re-run it, just comment with: cc @cnasikas |
## Summary Depends on: #166267, #170326, #169484, #173740, #173763, #178068, #178307, #178600, #180437 PRs: - #168370 - #169229 - #171754 - #172709 - #173012 - #175107 - #175452 - #175505 - #177033 - #178277 - #177139 - #179796 Fixes: #153837 ## Testing Run Kibana with `--run-examples` if you want to use the "Always firing" rule. Create a rule with a case action in observability and the stack. The security solution is not supported. You should not be able to assign a case action in a security solution rule. 1. Test the "Reopen closed cases" configuration. 2. Test the "Grouping by" configuration. Only one field is allowed. Not all fields are persisted in alerts. If you select a field not part of the alert the case action will create a case where the grouping value is set to `unknow`. 3. Test the "Time window" feature. You can comment out the validation to test for shorter times. 4. Verify that the case action is experimental. 5. Verify that based on the rule type the case is created in the correct solution. 6. Verify that you cannot create a rule with the case action on the basic license. 7. Verify that the execution of the case action fails if you do not have permission for cases. Pending work on the system actions framework level to not allow users to create rules with system actions where they do not have permission. 8. Stress test the case action by creating multiple rules. ### Checklist Delete any items that are not applicable to this PR. - [x] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios ### For maintainers - [x] This was checked for breaking API changes and was [labeled appropriately](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ## Release notes Automatically create cases when an alert is triggered. --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: adcoelho <[email protected]> Co-authored-by: Janki Salvi <[email protected]>
Summary
This PR:
CasesConnectorError
errorCasesConnectorExecutor
CasesConnector
class handle only the retry logic of the connectorDepends on: #172709
Checklist
Delete any items that are not applicable to this PR.
For maintainers