Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The blocking nature of Concurrency API's campaign call makes it inconsistent with authentication token expiration #17623

Open
4 tasks done
jcferretti opened this issue Mar 21, 2024 · 4 comments
Labels

Comments

@jcferretti
Copy link
Contributor

jcferretti commented Mar 21, 2024

Bug report criteria

What happened?

Recipe: Use the election support in concurrency via the campaign call from two processes. One is leader, the other is follower and blocked on the election campaign call. Wait 5 minutes for the regular period of an auth token to expire. Kill the leader process, the follower's campaign call fails with "etcdserver: Invalid auth token".

What did you expect to happen?

The campaign call is designed to block indefinitely in the server for clients that are not elected leaders, but are followers waiting for the possibility of becoming a leader. I don't expect a campaign call that was able to start, due to its auth token being valid at the time of start, to fail once it returns due to the follower about to become a leader. Retrying the call succeeds, but this defies the purpose of an API that is intended for the implementation of fault tolerance.

How can we reproduce it (as minimally and precisely as possible)?

This can be reproduced by using etcdctl exactly like the example given in https://etcd.io/docs/v3.6/tutorials/how-to-conduct-elections/, when used together with authentication. Just wait 5 minutes to kill the original leader. etcdctl will log the retry and the follower will become leader. This issue writeup argues that having to retry in the first place is a bug.

Note however, that etcdctl uses the go client, and the go client implements the campaign call directly in client code, as opposed to doing a server call on the gRPC concurrency API, which is what other clients, eg jetcd from java, do, which is how my team is using this and ran into this issue. This is a separate problematic point; by having two implementations, even if the semantics are the same, the details of the behavior in clients is different, eg, in a go client like etcdctl a follower that is waiting on a call to campaign is blocked in a watcher (effectively a streaming RPC in gRPC to the server), while a java client using jetcd is blocked in a unary call. This creates at least two potential sources of problems, (1) subtle differences in behavior, eg, a retry policy on unary calls will affect them differently, (2) since the go client is likely to be by far the most used client, the concurrency gRPC API is not as tested in the wild.

Anything else we need to know?

(1) While retrying the failed call does work, and this can be configured automatically, for instance in the java client via client builder parameters for a retry policy, it is not possible to configure a sensible retry policy that both makes sense for every other unary call and also for campaign; campaign is the only call that can block and wait in the server indefinitely, so a max duration for retries that can apply to campaign has to be effectively infinite. This forces us to manually handle the retry in this case.

(2) Given the implementation of campaign that consists of a transaction put, followed by sequence (for loop) of watching and doing get, it seems like the get call inside looping following when the watcher detects that a key is deleted is the one that will fail with invalid auth token. In the case of the go implementation, where the client is implementing all this logic directly, that get call is the one that will be retried (is my guess) when an auth interceptor is registered. In the case of clients using the concurrency gRPC API (eg, jetcd/java), the whole campaign call will fail which, even if one were to manually implement retries in the client per (1), which is what my team is doing now, seems overly wasteful in the context of a certain time budget for the implementation of failover in which the campaign call is being used.

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.5
Git SHA: 19002cfc6
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version  
etcdctl version: 3.5.5
API version: 3.5

Etcd configuration (command line flags or environment variables)

$ cat /etc/etcd/dh/c8e0d1040/config.yaml
name: etcd-1

data-dir: /var/lib/etcd/dh/c8e0d1040

listen-client-urls: https://10.128.1.172:2379,https://localhost:2379
listen-peer-urls: https://10.128.1.172:2380,https://localhost:2380

advertise-client-urls: https://10.128.1.172:2379
initial-advertise-peer-urls: https://10.128.1.172:2380

initial-cluster-state: new
initial-cluster-token: c8e0d1040
initial-cluster: etcd-1=https://10.128.1.172:2380

strict-reconfig-check: true
enable-v2: false

peer-transport-security:
  # the peer certs need to have CN=peer.etcd.deephaven.local
  cert-allowed-cn: peer.etcd.deephaven.local
  client-cert-auth: true
  trusted-ca-file: /etc/etcd/dh/c8e0d1040/ssl/peer/ca.crt
  cert-file: /etc/etcd/dh/c8e0d1040/ssl/peer/etcd-1.public.crt
  key-file: /etc/etcd/dh/c8e0d1040/ssl/peer/etcd-1.private.key

client-transport-security:
  # clients are encrypted, but not authenticated - we rely on username / password for auth
  client-cert-auth: false
  # don't need to set CA since we aren't authenticating clients
  # trusted-ca-file: /etc/etcd/dh/c8e0d1040/ssl/server/ca.crt
  cert-file: /etc/etcd/dh/c8e0d1040/ssl/server/etcd-1.public.crt
  key-file: /etc/etcd/dh/c8e0d1040/ssl/server/etcd-1.private.key

auto-compaction-mode: periodic
auto-compaction-retention: "168"

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root etcdctl.sh member list -w table
+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
| d9ef9f5e14be6e82 | started | etcd-1 | https://10.128.1.172:2380 | https://10.128.1.172:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+


$ DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root etcdctl.sh endpoint status -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.1.172:2379 | d9ef9f5e14be6e82 |   3.5.5 |  6.8 MB |      true |      false |         2 |       2851 |               2851 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

The command line sessions captured below were running concurrently.

$ # Process #1 in machine #1
$ # The DH_ETCD_DIR env var below sets up some role for authorization in a script we use to wrap etcdctl.
$ echo = Starting at $(date); DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root etcdctl.sh elect one p1 & sleep 310; echo = Kill at $(date); kill %1
= Starting at Thu Mar 21 01:14:20 EDT 2024
[1] 86733
one/6e828e5ee0a9218a
p1
= Kill at Thu Mar 21 01:19:30 EDT 2024

................................................................................................................................................................................

$ # Process #2 in machine #2
$ # The DH_ETCD_DIR env var below sets up some role for authorization in a script we use to wrap etcdctl.
$ echo = Starting at $(date); DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root etcdctl.sh elect one p2
= Starting at Thu Mar 21 01:14:21 EDT 2024
{"level":"warn","ts":"2024-03-21T01:19:30.873-0400","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003aac40/10.128.1.172:2379","attempt":0,"error":"rpc error: code = Unauthenticated desc = etcdserver: invalid auth token"}
one/6e828e5ee0a92193
p2
@ahrtr
Copy link
Member

ahrtr commented Mar 24, 2024

Thanks for raising this issue,

@jcferretti
Copy link
Contributor Author

* Actually it isn't recommended to use the [v3election](https://github.com/etcd-io/etcd/blob/main/server/etcdserver/api/v3election/v3electionpb/v3election.proto) API

I see. I think it would help if that is made to be noted clearly in the documentation and in comments in the source. Or even better, if the API is clearly marked as deprecated.

@jcferretti
Copy link
Contributor Author

jcferretti commented Mar 24, 2024

Thanks for raising this issue,

* Actually it isn't recommended to use the [v3election](https://github.com/etcd-io/etcd/blob/main/server/etcdserver/api/v3election/v3electionpb/v3election.proto) API.

Thinking out loud, I think is likely v3lock also has the same issue (eg, the other part of concurrency gRPC API in https://etcd.io/docs/v3.2/dev-guide/api_concurrency_reference_v3/), so perhaps taking a look and confirming may be a good idea, and if yes that may need a doc nod too. I don't have the code in front of me right now but I seem to remember the client side code called by the server also uses the waitDeletes method.

@ahrtr
Copy link
Member

ahrtr commented Mar 25, 2024

I think is likely v3lock also has the same issue

Yes, it should be true based on quick triaging. Would be great if you or anyone else could double confirm it by reproducing the two issues (this one and #17502). So the same proposal as #17502 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants