Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate gRPC Canceled code to Nexus HandlerErrorTypeInternal #1680

Merged
merged 2 commits into from
Oct 25, 2024

Conversation

bergundy
Copy link
Member

To ensure these errors can be retried by the Nexus machinery in the server.

@bergundy bergundy requested a review from a team as a code owner October 20, 2024 21:14
@@ -462,11 +462,11 @@ func convertServiceError(err error) error {
errMessage := err.Error()

switch st.Code() {
case codes.AlreadyExists, codes.Canceled, codes.InvalidArgument, codes.FailedPrecondition, codes.OutOfRange:
case codes.AlreadyExists, codes.InvalidArgument, codes.FailedPrecondition, codes.OutOfRange:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(making comment here because it's easier to have threads as PR comments)

To ensure these errors can be retried by the Nexus machinery in the server.

Don't we want gRPC canceled errors to not be retried? They aren't anywhere else and are usually the result of user cancel or user-supplied timeout.

Copy link
Member Author

@bergundy bergundy Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the timeout is for a single RPC not the entire operation, we want to retry the RPC (to start a workflow).
We're seeing CI failures since introducing the gRPC status translation logic and there's definitely no intent to fail these nexus operations. Not 100% where these Canceled errors are coming from, they could be coming internally from the server or the SDK.

Copy link
Member

@cretz cretz Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth investigating. I can't think of a scenario where we want to ever retry this status code. It's almost always due to explicit cancel/timeout unless there's a scenario I'm not familiar with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request timeout should only fail a single Nexus HTTP request, not the entire operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a nexus request is timing out on the worker we shouldn't even respond

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried finding this error in the logs and couldn't find it so still not sure where this is coming from. All I know now is that our release validation tests are failing when running an SDK version with this error translation enabled with "400 Bad Request": context canceled where they didn't fail before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this case in the logs. I can definitely see the server reporting that the SDK is returning a BadRequest error and what seems to be the corresponding StartWorkflowExecution RPC with a grpc Cancelled status.

We also know that the Nexus handler context in the SDK wasn't canceled because otherwise it would not send a response back and the server would have reported this as a timeout. I don't see a strong enough reason to fail a Nexus operation for a single RPC that ended up as grpc Cancelled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canceled can also be returned by the client side if the user cancelled some grpc request they are making in a handler. In this case I understand it is desirable in this case to retry it , but it is't obvious that all users would want to map this to a HandlerErrorTypeInternal. I think we should prioritizing how we are going to allow users to customize this policy. Updating the SDK because a client needs a different policy is not a scalable solution.

https://grpc.github.io/grpc/core/md_doc_statuscodes.html

Copy link
Member

@cretz cretz Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me a handler implementer making a gRPC call in a handler and then canceling it (which may be due to context being otherwise canceled) causing the gRPC call to return canceled is the same as an application error, both in Nexus handlers and in Temporal activities. But if it was due to context cancellation from actual context cancel from caller, it should not be retryable, like Temporal activities, which is what the eager ctx.Err() checks seem to accomplish (though even they may be wrong if context cancel is not always caller cancel (e.g. could be worker shutdown), so I'd have to see integration tests for both).

So I think I agree kinda with what this PR is doing.

I think the confusion may be the fact that Nexus chose to split application error into bad request and internal just based on retryability when really just because an error is "retryable" doesn't mean it's "internal".

Maybe the following integration tests (if they don't already exists) would help confirm:

  • Caller of operation cancels operation mid-gRPC call in handler, what does caller see. I expect operation cancel.
  • Operation is canceled due to worker shutdown mid-gRPC call in handler, what does caller see. I expect retryable error.
  • Handler implementer has their own context and calls cancel on it mid-gRPC call in handler, what does caller see. I expect retryable error (I think?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the confusion may be the fact that Nexus chose to split application error into bad request and internal just based on retryability when really just because an error is "retryable" doesn't mean it's "internal".

Agree, we may need more statuses to better express more scenarios.

Caller of operation cancels operation mid-gRPC call in handler, what does caller see. I expect operation cancel.

What do you mean by "cancels operation"? Are you referring to the CancelOperation RPC from the Nexus spec?
If so, yes, that is only applicable for async operations (there's going to be support for sync too soon and canceling operations that haven't started yet). If you're referring to canceling the RPC, the caller would get a ctx.Err() and may choose to handle it however they see fit. They probably don't want to consider the operation canceled though, once the first request is sent it's important to see the operation through and not abandon it (unless the caller explicitly wants to abandon).

Operation is canceled due to worker shutdown mid-gRPC call in handler, what does caller see. I expect retryable error.

Seems like you're referring to the RPC here, yes the RPC will return a retryable error with this change.

Handler implementer has their own context and calls cancel on it mid-gRPC call in handler, what does caller see. I expect retryable error (I think?).

That would result in either context.Canceled or gRPC Canceled status error, both of these are retryable, and should be IMHO (this is the same case as above).

@bergundy
Copy link
Member Author

Anything blocking this PR now that we've clarified the behavior?

@bergundy
Copy link
Member Author

I'm going to potentially revisit this auto error translation soon. I think these are good defaults but we'll see how this behaves in the wild. We may need to allow users to opt out of this or maybe only do auto error translation for temporalnexus primitives like ExecuteWorkflow and others that will come later.

@bergundy bergundy enabled auto-merge (squash) October 25, 2024 00:05
@bergundy bergundy merged commit c0a1b59 into temporalio:master Oct 25, 2024
13 checks passed
@bergundy bergundy deleted the nexus-grpc-canceled-to-internal branch October 25, 2024 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants