-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The blocking nature of Concurrency API's campaign call makes it inconsistent with authentication token expiration #17623
Comments
Thanks for raising this issue,
|
I see. I think it would help if that is made to be noted clearly in the documentation and in comments in the source. Or even better, if the API is clearly marked as deprecated. |
Thinking out loud, I think is likely v3lock also has the same issue (eg, the other part of concurrency gRPC API in https://etcd.io/docs/v3.2/dev-guide/api_concurrency_reference_v3/), so perhaps taking a look and confirming may be a good idea, and if yes that may need a doc nod too. I don't have the code in front of me right now but I seem to remember the client side code called by the server also uses the waitDeletes method. |
Yes, it should be true based on quick triaging. Would be great if you or anyone else could double confirm it by reproducing the two issues (this one and #17502). So the same proposal as #17502 (comment) |
Bug report criteria
What happened?
Recipe: Use the election support in concurrency via the campaign call from two processes. One is leader, the other is follower and blocked on the election campaign call. Wait 5 minutes for the regular period of an auth token to expire. Kill the leader process, the follower's campaign call fails with "etcdserver: Invalid auth token".
What did you expect to happen?
The campaign call is designed to block indefinitely in the server for clients that are not elected leaders, but are followers waiting for the possibility of becoming a leader. I don't expect a campaign call that was able to start, due to its auth token being valid at the time of start, to fail once it returns due to the follower about to become a leader. Retrying the call succeeds, but this defies the purpose of an API that is intended for the implementation of fault tolerance.
How can we reproduce it (as minimally and precisely as possible)?
This can be reproduced by using etcdctl exactly like the example given in https://etcd.io/docs/v3.6/tutorials/how-to-conduct-elections/, when used together with authentication. Just wait 5 minutes to kill the original leader. etcdctl will log the retry and the follower will become leader. This issue writeup argues that having to retry in the first place is a bug.
Note however, that etcdctl uses the go client, and the go client implements the campaign call directly in client code, as opposed to doing a server call on the gRPC concurrency API, which is what other clients, eg jetcd from java, do, which is how my team is using this and ran into this issue. This is a separate problematic point; by having two implementations, even if the semantics are the same, the details of the behavior in clients is different, eg, in a go client like etcdctl a follower that is waiting on a call to campaign is blocked in a watcher (effectively a streaming RPC in gRPC to the server), while a java client using jetcd is blocked in a unary call. This creates at least two potential sources of problems, (1) subtle differences in behavior, eg, a retry policy on unary calls will affect them differently, (2) since the go client is likely to be by far the most used client, the concurrency gRPC API is not as tested in the wild.
Anything else we need to know?
(1) While retrying the failed call does work, and this can be configured automatically, for instance in the java client via client builder parameters for a retry policy, it is not possible to configure a sensible retry policy that both makes sense for every other unary call and also for campaign; campaign is the only call that can block and wait in the server indefinitely, so a max duration for retries that can apply to campaign has to be effectively infinite. This forces us to manually handle the retry in this case.
(2) Given the implementation of campaign that consists of a transaction put, followed by sequence (for loop) of watching and doing get, it seems like the get call inside looping following when the watcher detects that a key is deleted is the one that will fail with invalid auth token. In the case of the go implementation, where the client is implementing all this logic directly, that get call is the one that will be retried (is my guess) when an auth interceptor is registered. In the case of clients using the concurrency gRPC API (eg, jetcd/java), the whole campaign call will fail which, even if one were to manually implement retries in the client per (1), which is what my team is doing now, seems overly wasteful in the context of a certain time budget for the implementation of failover in which the campaign call is being used.
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
The command line sessions captured below were running concurrently.
The text was updated successfully, but these errors were encountered: