Add instructions for 3.5->3.4 downgrade.

Signed-off-by: Siyuan Zhang <[email protected]>
etcd-io · May 13, 2024 · 2f084fc · 2f084fc
1 parent decc347
commit 2f084fc
Show file tree

Hide file tree

Showing 3 changed files with 293 additions and 0 deletions.
diff --git a/content/en/docs/v3.5/downgrades/_index.md b/content/en/docs/v3.5/downgrades/_index.md
@@ -0,0 +1,5 @@
+---
+title: Downgrading
+weight: 6000
+description: Downgrading etcd clusters and applications
+---
diff --git a/content/en/docs/v3.5/downgrades/downgrade_3_5.md b/content/en/docs/v3.5/downgrades/downgrade_3_5.md
@@ -0,0 +1,276 @@
+---
+title: Downgrade etcd from 3.5 to 3.4
+weight: 6650
+description: Processes, checklists, and notes on downgrading etcd from 3.5 to 3.4
+---
+
+In the general case, downgrading from etcd 3.5 to 3.4 can be a zero-downtime, rolling downgrade:
+ - one by one, stop the etcd v3.5 processes and replace them with etcd v3.4 processes
+ - after starting any v3.4 processes, new features in v3.5 are not longer available to the cluster
+
+Before [starting an downgrade](#downgrade-procedure), read through the rest of this guide to prepare.
+
+### Downgrade checklists
+
+**NOTE:** If your cluster enables auth, rolling downgrade from 3.5 isn't supported because 3.5 [changes a format of WAL entries related to auth](https://github.com/etcd-io/etcd/pull/11943).
+
+Highlighted breaking changes from 3.5 to 3.4:
+
+#### Difference in flags
+
+If you are using any of the following flags in your v3.5 configurations, make sure to remove, rename, or change the default value when downgrading to v3.4.
+
+```diff
+# flags not available in 3.4
+-etcd --socket-reuse-port
+-etcd --socket-reuse-address
+-etcd --raft-read-timeout
+-etcd --raft-write-timeout
+-etcd --v2-deprecation
+-etcd --client-cert-file
+-etcd --client-key-file
+-etcd --peer-client-cert-file
+-etcd --peer-client-key-file
+-etcd --self-signed-cert-validity
+-etcd --enable-log-rotation --log-rotation-config-json=some.json
+-etcd --experimental-enable-distributed-tracing --experimental-distributed-tracing-address='localhost:4317' --experimental-distributed-tracing-service-name='etcd' --experimental-distributed-tracing-instance-id='' --experimental-distributed-tracing-sampling-rate='0'
+-etcd --experimental-compact-hash-check-enabled --experimental-compact-hash-check-time='1m'
+-etcd --experimental-downgrade-check-time
+-etcd --experimental-memory-mlock
+-etcd --experimental-txn-mode-write-with-shared-buffer
+-etcd --experimental-bootstrap-defrag-threshold-megabytes
+
+# same flag with different names
+-etcd --backend-bbolt-freelist-type=map
++etcd --experimental-backend-bbolt-freelist-type=array
+
+# same flag different defaults
+-etcd --logger=zap
++etcd --logger=capnslog
+```
+
+#### `etcd --logger zap`
+
+v3.4 defaults to `--logger=capnslog` while v3.5 defaults `--logger=zap`.
+
+If you want to keep using `zap`, it needs to be explicitly specified.
+
+```diff
++etcd --logger=zap --log-outputs=stderr
+
++# to write logs to stderr and a.log file at the same time
++etcd --logger=zap --log-outputs=stderr,a.log
+```
+
+#### Difference in Prometheus metrics
+
+```diff
+# metrics not available in 3.4
+-etcd_debugging_mvcc_db_compaction_last
+```
+
+### Server downgrade checklists
+
+#### Downgrade requirements
+
+To ensure a smooth rolling downgrade, the running cluster must be healthy. Check the health of the cluster by using the `etcdctl endpoint health` command before proceeding.
+
+The 3.4 version to downgrade to must be >= v3.4.32.
+
+#### Preparation
+
+Before downgrading etcd, always test the services relying on etcd in a staging environment before deploying the downgrade to the production environment.
+
+Before beginning, [download the snapshot backup](../../op-guide/maintenance/#snapshot-backup). Should something go wrong with the downgrade, it is possible to use this backup to [rollback](#rollback) back to existing etcd version. Please note that the `snapshot` command only backs up the v3 data. For v2 data, see [backing up v2 datastore](/docs/v2.3/admin_guide#backing-up-the-datastore).
+
+Before beginning, download the latest release of etcd 3.4, and make sure its version is >= v3.4.32.
+
+#### Mixed versions
+
+While downgrading, an etcd cluster supports mixed versions of etcd members, and operates with the protocol of the lowest common version. The cluster is considered downgraded once any of its members is downgraded to version 3.4. Internally, etcd members negotiate with each other to determine the overall cluster version, which controls the reported version and the supported features.
+
+#### Limitations
+
+Note: If the cluster only has v3 data and no v2 data, it is not subject to this limitation.
+
+If the cluster is serving a v2 data set larger than 50MB, each newly downgraded member may take up to two minutes to catch up with the existing cluster. Check the size of a recent snapshot to estimate the total data size. In other words, it is safest to wait for 2 minutes between downgrading each member.
+
+For a much larger total data size, 100MB or more , this one-time process might take even more time. Administrators of very large etcd clusters of this magnitude can feel free to contact the [etcd team][etcd-contact] before downgrading, and we'll be happy to provide advice on the procedure.
+
+#### Rollback
+
+If any member has been downgraded to v3.4, the cluster will be downgraded to v3.4, and the cluster and its operations will be "v3.4". You would need to follow the [Upgrade etcd from 3.4 to 3.5](../../upgrades/upgrade_3_5/) instructions to rollback.
+
+Please [download the snapshot backup](../../op-guide/maintenance/#snapshot-backup) to make downgrading the cluster possible even after it has been completely downgraded.
+
+### Downgrade procedure
+
+This example shows how to downgrade a 3-member v3.5 etcd cluster running on a local machine.
+
+#### Step 1: check downgrade requirements
+
+Is the cluster healthy and running v3.5.x?
+
+```bash
+etcdctl --endpoints=localhost:2379,localhost:22379,localhost:32379 endpoint health
+<<COMMENT
+localhost:2379 is healthy: successfully committed proposal: took = 2.118638ms
+localhost:22379 is healthy: successfully committed proposal: took = 3.631388ms
+localhost:32379 is healthy: successfully committed proposal: took = 2.157051ms
+COMMENT
+
+curl http://localhost:2379/version
+<<COMMENT
+{"etcdserver":"3.5.0","etcdcluster":"3.5.0"}
+COMMENT
+
+curl http://localhost:22379/version
+<<COMMENT
+{"etcdserver":"3.5.0","etcdcluster":"3.5.0"}
+COMMENT
+
+curl http://localhost:32379/version
+<<COMMENT
+{"etcdserver":"3.5.0","etcdcluster":"3.5.0"}
+COMMENT
+```
+
+#### Step 2: download snapshot backup from leader
+
+[Download the snapshot backup](../../op-guide/maintenance/#snapshot-backup) to provide a downgrade path should any problems occur.
+
+etcd leader is guaranteed to have the latest application data, thus fetch snapshot from leader:
+
+```bash
+curl -sL http://localhost:2379/metrics | grep etcd_server_is_leader
+<<COMMENT
+# HELP etcd_server_is_leader Whether or not this member is a leader. 1 if is, 0 otherwise.
+# TYPE etcd_server_is_leader gauge
+etcd_server_is_leader 1
+COMMENT
+
+curl -sL http://localhost:22379/metrics | grep etcd_server_is_leader
+<<COMMENT
+etcd_server_is_leader 0
+COMMENT
+
+curl -sL http://localhost:32379/metrics | grep etcd_server_is_leader
+<<COMMENT
+etcd_server_is_leader 0
+COMMENT
+
+etcdctl --endpoints=localhost:2379 snapshot save backup.db
+<<COMMENT
+{"level":"info","ts":1526585787.148433,"caller":"snapshot/v3_snapshot.go:109","msg":"created temporary db file","path":"backup.db.part"}
+{"level":"info","ts":1526585787.1485257,"caller":"snapshot/v3_snapshot.go:120","msg":"fetching snapshot","endpoint":"localhost:2379"}
+{"level":"info","ts":1526585787.1519694,"caller":"snapshot/v3_snapshot.go:133","msg":"fetched snapshot","endpoint":"localhost:2379","took":0.003502721}
+{"level":"info","ts":1526585787.1520295,"caller":"snapshot/v3_snapshot.go:142","msg":"saved","path":"backup.db"}
+Snapshot saved at backup.db
+COMMENT
+```
+
+#### Step 3: stop one existing etcd server
+
+When each etcd process is stopped, expected errors will be logged by other cluster members. This is normal since a cluster member connection has been (temporarily) broken:
+
+```bash
+{"level":"info","ts":"2024-05-13T20:52:34.175402Z","caller":"membership/cluster.go:576","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"91bc3c398fb3c146","from":"3.0","to":"3.5"}
+
+^C{"level":"info","ts":"2024-05-13T20:55:58.220609Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"91bc3c398fb3c146 [term 2] received MsgTimeoutNow from 8211f1d0f64f3269 and starts an election to get leadership."}
+{"level":"info","ts":"2024-05-13T20:55:58.220661Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"91bc3c398fb3c146 is starting a new election at term 2"}
+{"level":"info","ts":"2024-05-13T20:55:58.22068Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"91bc3c398fb3c146 became candidate at term 3"}
+{"level":"info","ts":"2024-05-13T20:55:58.220696Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"91bc3c398fb3c146 received MsgVoteResp from 91bc3c398fb3c146 at term 3"}
+{"level":"info","ts":"2024-05-13T20:55:58.220711Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"91bc3c398fb3c146 [logterm: 2, index: 13] sent MsgVote request to 8211f1d0f64f3269 at term 3"}
+{"level":"info","ts":"2024-05-13T20:55:58.220726Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"91bc3c398fb3c146 [logterm: 2, index: 13] sent MsgVote request to fd422379fda50e48 at term 3"}
+{"level":"info","ts":"2024-05-13T20:55:58.220736Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 91bc3c398fb3c146 lost leader 8211f1d0f64f3269 at term 3"}
+{"level":"info","ts":"2024-05-13T20:55:58.223316Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"91bc3c398fb3c146 received MsgVoteResp from 8211f1d0f64f3269 at term 3"}
+{"level":"info","ts":"2024-05-13T20:55:58.223381Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"91bc3c398fb3c146 has received 2 MsgVoteResp votes and 0 vote rejections"}
+{"level":"info","ts":"2024-05-13T20:55:58.223404Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"91bc3c398fb3c146 became leader at term 3"}
+{"level":"info","ts":"2024-05-13T20:55:58.223423Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 91bc3c398fb3c146 elected leader 91bc3c398fb3c146 at term 3"}
+{"level":"warn","ts":"2024-05-13T20:55:58.321632Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"91bc3c398fb3c146","remote-peer-id":"8211f1d0f64f3269","error":"EOF"}
+{"level":"warn","ts":"2024-05-13T20:55:58.322145Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"91bc3c398fb3c146","remote-peer-id":"8211f1d0f64f3269","error":"EOF"}
+{"level":"warn","ts":"2024-05-13T20:55:58.323013Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"8211f1d0f64f3269","error":"failed to dial 8211f1d0f64f3269 on stream MsgApp v2 (peer 8211f1d0f64f3269 failed to find local node 91bc3c398fb3c146)"}
+{"level":"warn","ts":"2024-05-13T20:55:58.633777Z","caller":"rafthttp/stream.go:194","msg":"lost TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"91bc3c398fb3c146","remote-peer-id":"8211f1d0f64f3269"}
+
+```
+
+#### Step 4: restart the etcd server with same configuration + `--next-cluster-version-compatible`
+
+Restart the etcd server with same configuration but with the new etcd binary and `--next-cluster-version-compatible`.
+
+```diff
+-etcd-old --name s1 \
++etcd-new --name s1 \
+  --data-dir /tmp/etcd/s1 \
+  --listen-client-urls http://localhost:2379 \
+  --advertise-client-urls http://localhost:2379 \
+  --listen-peer-urls http://localhost:2380 \
+  --initial-advertise-peer-urls http://localhost:2380 \
+  --initial-cluster s1=http://localhost:2380,s2=http://localhost:22380,s3=http://localhost:32380 \
+  --initial-cluster-token tkn \
+  --initial-cluster-state existing
+  --next-cluster-version-compatible
+```
+
+The new v3.4 etcd will publish its information to the cluster. At this point, cluster will start to operate as v3.4 protocol, which is the lowest common version.
+
+> `{"level":"info","ts":"2024-05-13T21:05:43.981445Z","caller":"membership/cluster.go:561","msg":"set initial cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","cluster-version":"3.0"}`
+
+> `{"level":"info","ts":"2024-05-13T21:05:43.982188Z","caller":"api/capability.go:77","msg":"enabled capabilities for version","cluster-version":"3.0"}`
+
+> `{"level":"info","ts":"2024-05-13T21:05:43.982312Z","caller":"membership/cluster.go:549","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","from":"3.0","from":"3.5"}`
+
+> `{"level":"info","ts":"2024-05-13T21:05:43.982376Z","caller":"api/capability.go:77","msg":"enabled capabilities for version","cluster-version":"3.5"}`
+
+> `{"level":"info","ts":"2024-05-13T21:05:44.000672Z","caller":"etcdserver/server.go:2152","msg":"published local member to cluster through raft","local-member-id":"8211f1d0f64f3269","local-member-attributes":"{Name:infra1 ClientURLs:[http://127.0.0.1:2379]}","request-path":"/0/members/8211f1d0f64f3269/attributes","cluster-id":"ef37ad9dc622a7c4","publish-timeout":"7s"}`
+
+> `{"level":"info","ts":"2024-05-13T21:05:46.452631Z","caller":"membership/cluster.go:549","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","from":"3.5","from":"3.4"}`
+
+Verify that each member, and then the entire cluster, becomes healthy with the new v3.4 etcd binary:
+
+```bash
+etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379
+<<COMMENT
+localhost:32379 is healthy: successfully committed proposal: took = 2.337471ms
+localhost:22379 is healthy: successfully committed proposal: took = 1.130717ms
+localhost:2379 is healthy: successfully committed proposal: took = 2.124843ms
+COMMENT
+```
+
+Un-downgraded members will log info like the following
+
+```
+{"level":"info","ts":"2024-05-13T21:05:46.450764Z","caller":"etcdserver/server.go:2633","msg":"updating cluster version using v2 API","from":"3.5","to":"3.4"}
+{"level":"info","ts":"2024-05-13T21:05:46.452419Z","caller":"membership/cluster.go:576","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"91bc3c398fb3c146","from":"3.5","to":"3.4"}
+{"level":"info","ts":"2024-05-13T21:05:46.452547Z","caller":"etcdserver/server.go:2652","msg":"cluster version is updated","cluster-version":"3.4"}
+```
+
+#### Step 5: repeat *step 3* and *step 4* for rest of the members
+
+When all members are downgraded, check the health status and version of the cluster:
+
+```bash
+endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379
+<<COMMENT
+localhost:2379 is healthy: successfully committed proposal: took = 492.834µs
+localhost:22379 is healthy: successfully committed proposal: took = 1.015025ms
+localhost:32379 is healthy: successfully committed proposal: took = 1.853077ms
+COMMENT
+
+curl http://localhost:2379/version
+<<COMMENT
+{"etcdserver":"3.4.32","etcdcluster":"3.4.0"}
+COMMENT
+
+curl http://localhost:22379/version
+<<COMMENT
+{"etcdserver":"3.4.32","etcdcluster":"3.4.0"}
+COMMENT
+
+curl http://localhost:32379/version
+<<COMMENT
+{"etcdserver":"3.4.32","etcdcluster":"3.4.0"}
+COMMENT
+```
+
+[etcd-contact]: https://groups.google.com/g/etcd-dev
diff --git a/content/en/docs/v3.5/downgrades/downgrading-etcd.md b/content/en/docs/v3.5/downgrades/downgrading-etcd.md
@@ -0,0 +1,12 @@
+---
+title: Downgrading etcd clusters and applications
+weight: 6500
+description: Documentation list for downgrading etcd clusters and applications
+---
+
+This section contains documents specific to downgrading etcd clusters and applications.
+
+## Downgrading an etcd v3.x cluster
+* [Downgrade etcd from 3.5 to 3.4](../downgrade_3_5/)
+
+[migrate-apps]: ../../op-guide/v2-migration/