Translate gRPC Canceled code to Nexus HandlerErrorTypeInternal #1680

bergundy · 2024-10-20T21:14:43Z

To ensure these errors can be retried by the Nexus machinery in the server.

cretz · 2024-10-21T12:57:44Z

internal/internal_nexus_task_handler.go

@@ -462,11 +462,11 @@ func convertServiceError(err error) error {
 	errMessage := err.Error()

 	switch st.Code() {
-	case codes.AlreadyExists, codes.Canceled, codes.InvalidArgument, codes.FailedPrecondition, codes.OutOfRange:
+	case codes.AlreadyExists, codes.InvalidArgument, codes.FailedPrecondition, codes.OutOfRange:


(making comment here because it's easier to have threads as PR comments)

To ensure these errors can be retried by the Nexus machinery in the server.

Don't we want gRPC canceled errors to not be retried? They aren't anywhere else and are usually the result of user cancel or user-supplied timeout.

Here the timeout is for a single RPC not the entire operation, we want to retry the RPC (to start a workflow).
We're seeing CI failures since introducing the gRPC status translation logic and there's definitely no intent to fail these nexus operations. Not 100% where these Canceled errors are coming from, they could be coming internally from the server or the SDK.

I think it's worth investigating. I can't think of a scenario where we want to ever retry this status code. It's almost always due to explicit cancel/timeout unless there's a scenario I'm not familiar with.

Request timeout should only fail a single Nexus HTTP request, not the entire operation.

If a nexus request is timing out on the worker we shouldn't even respond

I tried finding this error in the logs and couldn't find it so still not sure where this is coming from. All I know now is that our release validation tests are failing when running an SDK version with this error translation enabled with "400 Bad Request": context canceled where they didn't fail before.

I found this case in the logs. I can definitely see the server reporting that the SDK is returning a BadRequest error and what seems to be the corresponding StartWorkflowExecution RPC with a grpc Cancelled status.

We also know that the Nexus handler context in the SDK wasn't canceled because otherwise it would not send a response back and the server would have reported this as a timeout. I don't see a strong enough reason to fail a Nexus operation for a single RPC that ended up as grpc Cancelled.

Canceled can also be returned by the client side if the user cancelled some grpc request they are making in a handler. In this case I understand it is desirable in this case to retry it , but it is't obvious that all users would want to map this to a HandlerErrorTypeInternal. I think we should prioritizing how we are going to allow users to customize this policy. Updating the SDK because a client needs a different policy is not a scalable solution.

https://grpc.github.io/grpc/core/md_doc_statuscodes.html

To me a handler implementer making a gRPC call in a handler and then canceling it (which may be due to context being otherwise canceled) causing the gRPC call to return canceled is the same as an application error, both in Nexus handlers and in Temporal activities. But if it was due to context cancellation from actual context cancel from caller, it should not be retryable, like Temporal activities, which is what the eager ctx.Err() checks seem to accomplish (though even they may be wrong if context cancel is not always caller cancel (e.g. could be worker shutdown), so I'd have to see integration tests for both).

So I think I agree kinda with what this PR is doing.

I think the confusion may be the fact that Nexus chose to split application error into bad request and internal just based on retryability when really just because an error is "retryable" doesn't mean it's "internal".

Maybe the following integration tests (if they don't already exists) would help confirm:

Caller of operation cancels operation mid-gRPC call in handler, what does caller see. I expect operation cancel.

Operation is canceled due to worker shutdown mid-gRPC call in handler, what does caller see. I expect retryable error.

Handler implementer has their own context and calls cancel on it mid-gRPC call in handler, what does caller see. I expect retryable error (I think?).

I think the confusion may be the fact that Nexus chose to split application error into bad request and internal just based on retryability when really just because an error is "retryable" doesn't mean it's "internal".

Agree, we may need more statuses to better express more scenarios.

Caller of operation cancels operation mid-gRPC call in handler, what does caller see. I expect operation cancel.

What do you mean by "cancels operation"? Are you referring to the CancelOperation RPC from the Nexus spec?
If so, yes, that is only applicable for async operations (there's going to be support for sync too soon and canceling operations that haven't started yet). If you're referring to canceling the RPC, the caller would get a ctx.Err() and may choose to handle it however they see fit. They probably don't want to consider the operation canceled though, once the first request is sent it's important to see the operation through and not abandon it (unless the caller explicitly wants to abandon).

Operation is canceled due to worker shutdown mid-gRPC call in handler, what does caller see. I expect retryable error.

Seems like you're referring to the RPC here, yes the RPC will return a retryable error with this change.

Handler implementer has their own context and calls cancel on it mid-gRPC call in handler, what does caller see. I expect retryable error (I think?).

That would result in either context.Canceled or gRPC Canceled status error, both of these are retryable, and should be IMHO (this is the same case as above).

bergundy · 2024-10-25T00:02:10Z

Anything blocking this PR now that we've clarified the behavior?

bergundy · 2024-10-25T00:03:53Z

I'm going to potentially revisit this auto error translation soon. I think these are good defaults but we'll see how this behaves in the wild. We may need to allow users to opt out of this or maybe only do auto error translation for temporalnexus primitives like ExecuteWorkflow and others that will come later.

Translate gRPC Canceled code to Nexus HandlerErrorTypeInternal

6d9dda9

bergundy requested a review from a team as a code owner October 20, 2024 21:14

cretz reviewed Oct 21, 2024

View reviewed changes

Merge branch 'master' into nexus-grpc-canceled-to-internal

daf2a5c

Quinn-With-Two-Ns approved these changes Oct 25, 2024

View reviewed changes

bergundy enabled auto-merge (squash) October 25, 2024 00:05

bergundy merged commit c0a1b59 into temporalio:master Oct 25, 2024
13 checks passed

bergundy deleted the nexus-grpc-canceled-to-internal branch October 25, 2024 00:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translate gRPC Canceled code to Nexus HandlerErrorTypeInternal #1680

Translate gRPC Canceled code to Nexus HandlerErrorTypeInternal #1680

bergundy commented Oct 20, 2024

cretz Oct 21, 2024

bergundy Oct 21, 2024 •

edited

Loading

cretz Oct 21, 2024 •

edited

Loading

bergundy Oct 21, 2024

Quinn-With-Two-Ns Oct 21, 2024

bergundy Oct 21, 2024

bergundy Oct 22, 2024

Quinn-With-Two-Ns Oct 22, 2024

cretz Oct 22, 2024 •

edited

Loading

bergundy Oct 25, 2024

bergundy commented Oct 25, 2024

bergundy commented Oct 25, 2024

Translate gRPC Canceled code to Nexus HandlerErrorTypeInternal #1680

Translate gRPC Canceled code to Nexus HandlerErrorTypeInternal #1680

Conversation

bergundy commented Oct 20, 2024

cretz Oct 21, 2024

Choose a reason for hiding this comment

bergundy Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

cretz Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

bergundy Oct 21, 2024

Choose a reason for hiding this comment

Quinn-With-Two-Ns Oct 21, 2024

Choose a reason for hiding this comment

bergundy Oct 21, 2024

Choose a reason for hiding this comment

bergundy Oct 22, 2024

Choose a reason for hiding this comment

Quinn-With-Two-Ns Oct 22, 2024

Choose a reason for hiding this comment

cretz Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

bergundy Oct 25, 2024

Choose a reason for hiding this comment

bergundy commented Oct 25, 2024

bergundy commented Oct 25, 2024

bergundy Oct 21, 2024 •

edited

Loading

cretz Oct 21, 2024 •

edited

Loading

cretz Oct 22, 2024 •

edited

Loading