CORE-20824: Merging forward updates from release/os/5.2 to release/os/5.3 - 2024-10-09 #6359

When lost membership requests were investigated it seems to be a problem that occurs during rebalance in the state and event pattern. After the rebalance the commits that were not successful don't get re-processed and are instead skipped over. As part of handling this, this adds a reset of the event offset position so we try processing again from the correct place after a RebalanceInProgressException.

Change CommitFailedException classification from fatal to transient. This is required as the kafka connection tests exposed that we mark CommitFailedException as fatal and therefore do not retry. A CommitFailedException means we should abort the transaction but by classifying it as fatal we will bubble this up and we will not retry on any level. A CommitFailedException is fatal at the transaction level but not at the worker level so this should be changed.

This issue appeared while running kafka connection tests which kill the kafka broker. When the connection to the kafka broker was lost, ERROR level logs related to CryptoOpsClientImpl appeared several times. This is because there was a failure in executing net.corda.data.crypto.wire.ops.rpc.queries.SupportedSchemesRpcQuery. It has been verified how this is handled in the rest worker: KeyRestResource calls listSchemes and fails at cryptoOpsClient.getSupportedSchemes. The exception handling for listSchemes will return an InternalServerException. This extends HttpApiException and the response code will be 500. As we return an error to the rest client and the system can recover, the log level for the CryptoOpsClient failing the operation should be WARN level instead.

…xception (#6299) This change ensures that the exception `CordaMessageAPIProducerRequiresReset` is correctly handled, by re-raising it to the context which owns the `CordaProducer` and resetting the producer safely. Unit tests have also been added to verify this behaviour.

…lNodeInfoProcessor (#6296) This issue appeared while running kafka connection tests which kill the kafka broker. In a flow worker compacted subscription we got a org.apache.kafka.common.errors.TimeoutException and handled it as fatal. This is because we have no error handling in our KafkaAdmin class for the exceptions that can be thrown by KafkaFuture.get() . The KafkaAdmin client is called from the VirtualNodeInfoProcessor which also does not catch this. This PR adds retry logic to the KafkaAdmin for getTopics and in the case of exhausting these retries, error handling in VirtualNodeInfoProcessor to handle this better and prevent this exception bubbling further.

…xceptionUtils (#6311) This is required so that the exception is correctly handled in the Kafka message patterns.

…cerRequiresReset (#6303) The `ProducerFencedException` in these cases is likely due to an invalid epoch being used in an operation as a result of a timeout on the broker side. The correct action in this case is to reset the producer.

…10-09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE-20824: Merging forward updates from release/os/5.2 to release/os/5.3 - 2024-10-09 #6359

CORE-20824: Merging forward updates from release/os/5.2 to release/os/5.3 - 2024-10-09 #6359

Commits on Oct 8, 2024

Commits on Oct 9, 2024

CORE-20824: Merging forward updates from release/os/5.2 to release/os/5.3 - 2024-10-09 #6359

Are you sure you want to change the base?

CORE-20824: Merging forward updates from release/os/5.2 to release/os/5.3 - 2024-10-09 #6359

Commits on Oct 8, 2024

Commits on Oct 9, 2024