Merge pull request #1395 from PhanLe1010/5549

Add manual test for Test Node Drain Policy Setting AND remove deprecated allow-node-drain-with-last-healthy-replica setting
longhorn · Aug 30, 2023 · f3324d7 · f3324d7
2 parents a6a1a29 + 1636954
commit f3324d7
Showing 1 changed file with 143 additions and 0 deletions.
diff --git a/docs/content/manual/pre-release/upgrade/test-node-drain-policy.md b/docs/content/manual/pre-release/upgrade/test-node-drain-policy.md
@@ -0,0 +1,143 @@
+---
+title: Test Node Drain Policy Setting
+---
+
+## With `node-drain-policy` is `block-if-contains-last-replica`
+
+> Note:
+> Starting from v1.5.x, it is not necessary to check for the presence of longhorn-admission-webhook and longhorn-conversion-webhook. 
+> Please refer to the Longhorn issue [#5590](https://github.com/longhorn/longhorn/issues/5590) for more details.
+> 
+> Starting from v1.5.x, observe that the instance-manager-r and instance-manager-e are combined into instance-manager. 
+> Ref [5208](https://github.com/longhorn/longhorn/issues/5208)
+
+### 1. Basic unit tests
+
+#### 1.1 Single worker node cluster with separate master node
+1.1.1 RWO volumes
+* Deploy Longhorn
+* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Manually create a PVC (simulate the volume which has never been attached scenario)
+* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` because there is no attached volume
+* Create a deployment that uses one RW0 Longhorn volume.
+* Verify that there is PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
+* Drain the node by `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force`
+* Observe that the workload pods are evited first -> PDB of `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` are removed -> `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`, and instance-manager-e pods are evicted -> all volumes are successfully detached
+* Observe that instance-manager-r is NOT evicted.
+
+1.1.2 RWX volume
+* Deploy Longhorn
+* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Create a deployment of 2 pods that uses one RWX Longhorn volume.
+* Verify that there is PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Drain the node by `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force`
+* Observe that the workload pods are evited first -> PDB of  `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` are removed -> `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`, and instance-manager-e pods are evicted -> all volumes are successfully detached
+* Observe that instance-manager-r is NOT evicted.
+
+#### 1.2 multi-node cluster
+1.2.1 Multiple healthy replicas
+* Deploy Longhorn
+* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Manually create a PVC (simulate the volume which has never been attached scenario)
+* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` because there is no attached volume
+* Create a deployment that uses one RW0 Longhorn volume.
+* Verify that there is PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
+* Create a deployment of 2 pods that uses one RWX Longhorn volume.
+* For each node one by one by `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force`
+* Verify that the drain can finish successfully
+* Uncordon the node and move to next node
+
+1.2.2 Single healthy replicas
+* Given Longhorn with 2 nodes cluster: node-1, node-2
+* Create a 5Gi volume with 1 replica. Let's say the replica is on node-2 
+* Attached the volume to node-1
+* Set `node-drain-policy` to `block-if-contains-last-replica`
+* Attempts to drain node-2 that contains the only replica.
+* The node-2 becomes cordoned.
+* All pods on node-2 are evicted except the replica instance manager pod.
+* The message like below keeps appearing.
+    ```
+    evicting pod longhorn-system/instance-manager-r-xxxxxxxx
+    error when evicting pods/"instance-manager-r-xxxxxxxx" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
+    ```
+
+
+
+### 2. Upgrade Kubernetes for k3s cluster with standalone System Upgrade Controller deployment
+* Deploy a 3 nodes with each node has all roles (master + worker)
+* Install the [System Upgrade Controller](https://github.com/rancher/system-upgrade-controller#deploying)
+* Deploy Longhorn
+* Manually create a PVC (simulate the volume which has never been attached scenario)
+* Create a deployment that uses one RW0 Longhorn volume.
+* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
+* Create another deployment of 2 pods that uses one RWX Longhorn volume.
+* Deploying the `plan` CR to upgrade Kubernetes similar to:
+```
+  apiVersion: upgrade.cattle.io/v1
+  kind: Plan
+  metadata:
+    name: k3s-server
+    namespace: system-upgrade
+  spec:
+    concurrency: 1
+    cordon: true
+    nodeSelector:
+      matchExpressions:
+      - key: node-role.kubernetes.io/master
+        operator: In
+        values:
+        - "true"
+    serviceAccountName: system-upgrade
+    drain:
+      force: true
+      skipWaitForDeleteTimeout: 60 # 1.18+ (honor pod disruption budgets up to 60 seconds per pod then moves on)
+    upgrade:
+      image: rancher/k3s-upgrade
+    version: v1.21.11+k3s1
+  ```
+Note that the `concurrency` should be 1 to upgrade node one by one. `version` should be a newer K3s version. And it should contains the `drain` stage
+* Verify that the upgrade went smoothly
+* Exec into workload pod and make sure that the data is still there
+* Repeat the upgrading process above 5 times to make sure
+
+### 3. Upgrade Kubernetes for imported  k3s cluster in Rancher
+* Creating a 3-node k3s cluster with each node is both master+worker role. K3s should be an old version such as `v1.21.9+k3s1` so that we can upgrade multiple times. Some instructions to create such cluster is here https://docs.k3s.io/datastore/ha-embedded
+* Import the cluster into Rancher by: go to cluster management -> create new cluster -> generic cluster -> follow the instruction over there
+* Update the upgrade strategy in cluster management -> click three dots menu on the imported cluster -> edit config -> K3s options -> close drain for both control plane and worker node like below:
+![Screenshot from 2023-03-14 17-53-24](https://user-images.githubusercontent.com/22139961/225175432-87f076ac-552c-464a-a466-42356f1ac8e2.png)
+* Install Longhorn
+* Manually create a PVC (simulate the volume which has never been attached scenario)
+* Create a deployment that uses one RW0 Longhorn volume.
+* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
+* Create another deployment of 2 pods that uses one RWX Longhorn volume.
+* Using Rancher to upgrade the cluster to a newer Kubernetes version
+* Verify that the upgrade went smoothly
+* Exec into workload pod and make sure that the data is still there
+
+### 4. Upgrade Kubernetes for provisioned k3s cluster in Rancher
+* Using Rancher to provision a k3s cluster with an old version. For example, `v1.22.11+k3s2`. The cluster has 3 nodes each node with both worker and master role. Set the upgrade strategy as below:
+![Screenshot from 2023-03-14 15-44-34](https://user-images.githubusercontent.com/22139961/225163284-51c017ed-650c-4263-849c-054a0a0abf20.png)
+* Install Longhorn
+* Manually create a PVC (simulate the volume which has never been attached scenario)
+* Create a deployment that uses one RW0 Longhorn volume.
+* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
+* Create another deployment of 2 pods that uses one RWX Longhorn volume.
+* Using Rancher to upgrade the cluster to a newer Kubernetes version
+* Verify that the upgrade went smoothly
+* Exec into workload pod and make sure that the data is still there
+
+## With `node-drain-policy` is `allow-if-replica-is-stopped`
+
+1. Repeat the test cases above. 
+1. Verify that in the test `1.1.1`, `1.1.2`, `1.2.1`, `2`,`3`, and `4`, the drain is successfully. 
+1. Verify that the test `1.2.2`, the drain is still failed
+
+
+## With `node-drain-policy` as `always-allow`
+1. Repeat the test cases above.
+1. Verify that in the test `1.1.1`, `1.1.2`, `1.2.1`, `1.2.2`, `2`,`3`, and `4`, the drain is successfully. 
+
+
+