The blocking nature of Concurrency API's campaign call makes it inconsistent with authentication token expiration #17623

jcferretti · 2024-03-21T04:31:32Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

Recipe: Use the election support in concurrency via the campaign call from two processes. One is leader, the other is follower and blocked on the election campaign call. Wait 5 minutes for the regular period of an auth token to expire. Kill the leader process, the follower's campaign call fails with "etcdserver: Invalid auth token".

What did you expect to happen?

The campaign call is designed to block indefinitely in the server for clients that are not elected leaders, but are followers waiting for the possibility of becoming a leader. I don't expect a campaign call that was able to start, due to its auth token being valid at the time of start, to fail once it returns due to the follower about to become a leader. Retrying the call succeeds, but this defies the purpose of an API that is intended for the implementation of fault tolerance.

How can we reproduce it (as minimally and precisely as possible)?

This can be reproduced by using etcdctl exactly like the example given in https://etcd.io/docs/v3.6/tutorials/how-to-conduct-elections/, when used together with authentication. Just wait 5 minutes to kill the original leader. etcdctl will log the retry and the follower will become leader. This issue writeup argues that having to retry in the first place is a bug.

Note however, that etcdctl uses the go client, and the go client implements the campaign call directly in client code, as opposed to doing a server call on the gRPC concurrency API, which is what other clients, eg jetcd from java, do, which is how my team is using this and ran into this issue. This is a separate problematic point; by having two implementations, even if the semantics are the same, the details of the behavior in clients is different, eg, in a go client like etcdctl a follower that is waiting on a call to campaign is blocked in a watcher (effectively a streaming RPC in gRPC to the server), while a java client using jetcd is blocked in a unary call. This creates at least two potential sources of problems, (1) subtle differences in behavior, eg, a retry policy on unary calls will affect them differently, (2) since the go client is likely to be by far the most used client, the concurrency gRPC API is not as tested in the wild.

Anything else we need to know?

(1) While retrying the failed call does work, and this can be configured automatically, for instance in the java client via client builder parameters for a retry policy, it is not possible to configure a sensible retry policy that both makes sense for every other unary call and also for campaign; campaign is the only call that can block and wait in the server indefinitely, so a max duration for retries that can apply to campaign has to be effectively infinite. This forces us to manually handle the retry in this case.

(2) Given the implementation of campaign that consists of a transaction put, followed by sequence (for loop) of watching and doing get, it seems like the get call inside looping following when the watcher detects that a key is deleted is the one that will fail with invalid auth token. In the case of the go implementation, where the client is implementing all this logic directly, that get call is the one that will be retried (is my guess) when an auth interceptor is registered. In the case of clients using the concurrency gRPC API (eg, jetcd/java), the whole campaign call will fail which, even if one were to manually implement retries in the client per (1), which is what my team is doing now, seems overly wasteful in the context of a certain time budget for the implementation of failover in which the campaign call is being used.

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.5
Git SHA: 19002cfc6
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version  
etcdctl version: 3.5.5
API version: 3.5

Etcd configuration (command line flags or environment variables)

$ cat /etc/etcd/dh/c8e0d1040/config.yaml
name: etcd-1

data-dir: /var/lib/etcd/dh/c8e0d1040

listen-client-urls: https://10.128.1.172:2379,https://localhost:2379
listen-peer-urls: https://10.128.1.172:2380,https://localhost:2380

advertise-client-urls: https://10.128.1.172:2379
initial-advertise-peer-urls: https://10.128.1.172:2380

initial-cluster-state: new
initial-cluster-token: c8e0d1040
initial-cluster: etcd-1=https://10.128.1.172:2380

strict-reconfig-check: true
enable-v2: false

peer-transport-security:
  # the peer certs need to have CN=peer.etcd.deephaven.local
  cert-allowed-cn: peer.etcd.deephaven.local
  client-cert-auth: true
  trusted-ca-file: /etc/etcd/dh/c8e0d1040/ssl/peer/ca.crt
  cert-file: /etc/etcd/dh/c8e0d1040/ssl/peer/etcd-1.public.crt
  key-file: /etc/etcd/dh/c8e0d1040/ssl/peer/etcd-1.private.key

client-transport-security:
  # clients are encrypted, but not authenticated - we rely on username / password for auth
  client-cert-auth: false
  # don't need to set CA since we aren't authenticating clients
  # trusted-ca-file: /etc/etcd/dh/c8e0d1040/ssl/server/ca.crt
  cert-file: /etc/etcd/dh/c8e0d1040/ssl/server/etcd-1.public.crt
  key-file: /etc/etcd/dh/c8e0d1040/ssl/server/etcd-1.private.key

auto-compaction-mode: periodic
auto-compaction-retention: "168"

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root etcdctl.sh member list -w table
+------------------+---------+--------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |  NAME  |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------+---------------------------+---------------------------+------------+
| d9ef9f5e14be6e82 | started | etcd-1 | https://10.128.1.172:2380 | https://10.128.1.172:2379 |      false |
+------------------+---------+--------+---------------------------+---------------------------+------------+


$ DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root etcdctl.sh endpoint status -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.128.1.172:2379 | d9ef9f5e14be6e82 |   3.5.5 |  6.8 MB |      true |      false |         2 |       2851 |               2851 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

The command line sessions captured below were running concurrently.

$ # Process #1 in machine #1
$ # The DH_ETCD_DIR env var below sets up some role for authorization in a script we use to wrap etcdctl.
$ echo = Starting at $(date); DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root etcdctl.sh elect one p1 & sleep 310; echo = Kill at $(date); kill %1
= Starting at Thu Mar 21 01:14:20 EDT 2024
[1] 86733
one/6e828e5ee0a9218a
p1
= Kill at Thu Mar 21 01:19:30 EDT 2024

................................................................................................................................................................................

$ # Process #2 in machine #2
$ # The DH_ETCD_DIR env var below sets up some role for authorization in a script we use to wrap etcdctl.
$ echo = Starting at $(date); DH_ETCD_DIR=/etc/sysconfig/illumon.d/etcd/client/root etcdctl.sh elect one p2
= Starting at Thu Mar 21 01:14:21 EDT 2024
{"level":"warn","ts":"2024-03-21T01:19:30.873-0400","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003aac40/10.128.1.172:2379","attempt":0,"error":"rpc error: code = Unauthenticated desc = etcdserver: invalid auth token"}
one/6e828e5ee0a92193
p2

The text was updated successfully, but these errors were encountered:

ahrtr · 2024-03-24T19:02:35Z

Thanks for raising this issue,

Actually it isn't recommended to use the v3election API, please refer to the discussion in Fouled Authentication when using Concurrency/Election by multiple clients #17502
I think we should try to fix the java SDK (follow the similar logic as golang SDK does). Could you raise an issue in jetcd ? cc @lburgazzoli

jcferretti · 2024-03-24T21:33:10Z

* Actually it isn't recommended to use the [v3election](https://github.com/etcd-io/etcd/blob/main/server/etcdserver/api/v3election/v3electionpb/v3election.proto) API

I see. I think it would help if that is made to be noted clearly in the documentation and in comments in the source. Or even better, if the API is clearly marked as deprecated.

jcferretti · 2024-03-24T21:49:49Z

Thanks for raising this issue,

* Actually it isn't recommended to use the [v3election](https://github.com/etcd-io/etcd/blob/main/server/etcdserver/api/v3election/v3electionpb/v3election.proto) API.

Thinking out loud, I think is likely v3lock also has the same issue (eg, the other part of concurrency gRPC API in https://etcd.io/docs/v3.2/dev-guide/api_concurrency_reference_v3/), so perhaps taking a look and confirming may be a good idea, and if yes that may need a doc nod too. I don't have the code in front of me right now but I seem to remember the client side code called by the server also uses the waitDeletes method.

ahrtr · 2024-03-25T05:37:57Z

I think is likely v3lock also has the same issue

Yes, it should be true based on quick triaging. Would be great if you or anyone else could double confirm it by reproducing the two issues (this one and #17502). So the same proposal as #17502 (comment)

jcferretti added the type/bug label Mar 21, 2024

jcferretti mentioned this issue Mar 26, 2024

Fouled Authentication when using Concurrency/Election by multiple clients #17502

Open

4 tasks

jcferretti mentioned this issue Jun 17, 2024

Election.Observe may report the wrong leader #18163

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The blocking nature of Concurrency API's campaign call makes it inconsistent with authentication token expiration #17623

The blocking nature of Concurrency API's campaign call makes it inconsistent with authentication token expiration #17623

jcferretti commented Mar 21, 2024 •

edited

Loading

ahrtr commented Mar 24, 2024

jcferretti commented Mar 24, 2024

jcferretti commented Mar 24, 2024 •

edited

Loading

ahrtr commented Mar 25, 2024

The blocking nature of Concurrency API's campaign call makes it inconsistent with authentication token expiration #17623

The blocking nature of Concurrency API's campaign call makes it inconsistent with authentication token expiration #17623

Comments

jcferretti commented Mar 21, 2024 • edited Loading

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

ahrtr commented Mar 24, 2024

jcferretti commented Mar 24, 2024

jcferretti commented Mar 24, 2024 • edited Loading

ahrtr commented Mar 25, 2024

jcferretti commented Mar 21, 2024 •

edited

Loading

jcferretti commented Mar 24, 2024 •

edited

Loading