merge master

Signed-off-by: Jack Lin <[email protected]>
longhorn · Nov 2, 2023 · 39687da · 39687da
2 parents 93ddfdb + 075920e
commit 39687da
Show file tree

Hide file tree

Showing 136 changed files with 5,305 additions and 2,067 deletions.
diff --git a/docs/content/manual/Test-cases-to-reproduce-attach-detach-issues/_index.md b/docs/content/manual/Test-cases-to-reproduce-attach-detach-issues/_index.md
@@ -0,0 +1,3 @@
+---
+title: Test cases to reproduce issues related to attach detach
+---
diff --git a/...-reproduce-attach-detach-issues/attachment-detachment-issues-reproducibility.md b/...-reproduce-attach-detach-issues/attachment-detachment-issues-reproducibility.md
@@ -0,0 +1,79 @@
+---
+title: Test cases to reproduce attachment-detachment issues
+---
+**Prerequisite:** Have an environment with just with 2 worker nodes or taint 1 out of 3 worker node to be `NoExecute` & `NoSchedule`.
+This will serve as a constrained fallback and limited source of recovery in the event of failure.   
+
+
+#### 1. Kill the engines and instance manager repeatedly 
+**Given** 1 RWO and 1 RWX volume is attached to a pod.
+And Both the volumes have 2 replicas.
+And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync`
+
+**When** One replica rebuilding is triggered by crashing the IM
+AND Immediately IM associated with another replica is crashed
+AND After crashing IMs, detaching of Volume is tried either by pod deletion or using Longhorn UI    
+
+**Then** Volume should not stuck in attaching-detaching loop
+
+**When** Volume is detached and manually attached again.
+And Engine running on the node where is volume is attached in killed
+
+**Then** Volume should recover once the engine is back online.
+
+#### 2. Illegal values in Volume/Snap.meta
+**Given** 1 RWO and 1 RWX volume is attached to a pod.
+And Both the volumes have 2 replicas.
+
+**When** Some random values are set in the Volume/snap meta file
+And If replica rebuilding is triggered and the IM associated with another replica is also crashed
+
+**Then** Volume should not stuck in attaching-detaching loop
+
+
+#### 3. Deletion of Volume/Snap.meta
+**Given** 1 RWO and 1 RWX volume is attached to a pod.
+And Both the volumes have 2 replicas.
+
+**When** The Volume & snap meta files are deleted one by one.
+And If replica rebuilding is triggered and the IM associated with another replica is also crashed
+
+**Then** Volume should not stuck in attaching-detaching loop
+
+#### 4. Failed replica tries to rebuild from other just crashed replica - https://github.com/longhorn/longhorn/issues/4212
+**Given** 1 RWO and 1 RWX volume is attached to a pod.
+And Both the volumes have 2 replicas.
+And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync`
+
+**When** One replica rebuilding is triggered by crashing the IM
+AND Immediately IM associated with another replica is crashed
+
+**Then** Volume should not stuck in attaching-detaching loop.
+
+#### 5. Volume attachment Modification/deletion
+
+**Given** A deployment and statefulSet are created with same name and attached to Longhorn Volume.
+AND Some data is written and their md5sum is computed
+
+**When** The statefulSet and Deployment are deleted without deleting the volumes
+And Same named new statefulSet and Deployment are created with new PVCs.
+And Before above deployed workload could attach to volumes, attached node is rebooted
+
+**Then** After node reboot completion, volumes should reflect right status.
+And the newly created deployment and statefulSet should get attached to the volumes.
+
+**When** The volume attachments of above workloads are deleted.
+And above workloads are deleted and recreated immediately.
+
+**Then** No multi attach or other errors should be observed.
+
+#### 6. Use monitoring/word press/db workloads
+**Given** Monitoring and word press and any other db related workload are deployed in the system
+And All the volumes have 2 replicas.
+And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync`
+
+**When** One replica rebuilding is triggered by crashing the IM
+AND Immediately IM associated with another replica is crashed
+
+**Then** Volume should not stuck in attaching-detaching loop.
+
diff --git a/docs/content/manual/functional-test-cases/kubernetes.md b/docs/content/manual/functional-test-cases/kubernetes.md
@@ -31,7 +31,7 @@ title: 5. Kubernetes
 | 5   | StorageClass `nodeSelector` parameter | **Prerequisite:**<br><br>*   Longhorn Node should have tags<br><br>1.  Create a new StorageClass, set `nodeSelector` parameter<br>2.  Create a PVC with using the new StorageClass | *   New Storage class should be created<br>*   Volume should be dynamically provisioned, it’s PV/PVC should be `Bound`, volume replicas should only scheduled to Nodes with tags that match `nodeSelector` parameter tags |     |
 | 6   | StorageClass `recurringJobs` parameter | 1.  Create a new StorageClass, set `recurringJobs` parameter<br>2.  Create a PVC with using the new StorageClass<br>3.  Create a pod that consumes the created PVC<br>4.  Check Volume recurring jobs | *   New Storage class should be created<br>*   Volume should be dynamically provisioned, it’s PV/PVC should be `Bound` and attached to the pod<br>*   Volume should have recurring snapshots and backups matches ones specified in `recurringJobs` StorageClass parameter |  test\_statefulset\_recurring\_backup |
 | 7   | StorageClass with `reclaimPolicy` parameter set to `Delete` | 1.  Create a new StorageClass, set `reclaimPolicy` parameter to `Delete`<br>2.  Create a PVC with using the new StorageClass<br>3.  Delete the PVC | *   Volume should be dynamically provisioned, it’s PV/PVC should be `Bound`<br>*   Deleting PVC would trigger Volume delete |     |
-| 8   | StorageClass with `reclaimPolicy` parameter set to `Retain` | 1.  Create a new StorageClass, set `reclaimPolicy` parameter to `Delete`<br>2.  Create a PVC with using the new StorageClass<br>3.  Delete PVC<br>4.  Delete PV | *   Volume should be dynamically provisioned, it’s PV/PVC should be `Bound`<br>*   Deleting PVC and PV will not delete longhorn volume. | test\_kubernetes\_status |
+| 8   | StorageClass with `reclaimPolicy` parameter set to `Retain` | 1.  Create a new StorageClass, set `reclaimPolicy` parameter to `Retain`<br>2.  Create a PVC with using the new StorageClass<br>3.  Delete PVC<br>4.  Delete PV | *   Volume should be dynamically provisioned, it’s PV/PVC should be `Bound`<br>*   Deleting PVC and PV will not delete longhorn volume. | test\_kubernetes\_status |
 | 9   | Static provisioning using `Default Longhorn Static StorageClass Name` Setting | 1.  Update `Default Longhorn Static StorageClass Name` setting, set a new StorageClass Name, StorageClass doesn’t have to exist or be created.<br>2.  Create a Volume<br>3.  From Longhorn, Create a PV/PVC for the volume<br>4.  Check created PV `persistentVolumeReclaimPolicy: Retain`<br>5.  Create a pod consuming created PVC<br>6.  Delete the pod<br>7.  Delete PV<br>8.  Delete PVC | *   Volume should be created<br>*   Volume PV should be created using new StorageClass Name defined in `Default Longhorn Static StorageClass Name` setting<br>*   PVC should be consumed by the pod, volume should be accessible in the pod, write/read operations should work normally.<br>*   Deleting PV/PVC will not trigger volume delete. |  test\_pvc\_creation\_with\_default\_sc\_set |
 
 
@@ -59,4 +59,4 @@ title: 5. Kubernetes
 | 18  | Power down node where the pods/workload exists<br><br>**workload type: stateful set** | *   Create a workload, attach to a volume.<br>*   Write some data<br>*   Power down the node where the pod is running. | *   The pod should get recreated on another node.<br>*   The volume should get reattached.<br>*   The mount point should be accessible to read and write. |
 | 19  | Delete worker node one by one |     | *   The volume should get reattach to healthy node. |     |     |
 | 20  | Upgrade cluster - drain set - false | 1.  In Rancher - upgrade a cluster by changing the max-pods value<br>2.  The cluster will go into “Updating” state<br>3.  Verify upgrade completes successfully | Upgrade should finish successfully |
-| 21  | Upgrade cluster - drain set - true | 1.  In Rancher - upgrade a cluster by changing the max-pods value<br>2.  The cluster will go into “Updating” state<br>3.  Verify upgrade completes successfully | Upgrade should finish successfully |
+| 21  | Upgrade cluster - drain set - true | 1.  In Rancher - upgrade a cluster by changing the max-pods value<br>2.  The cluster will go into “Updating” state<br>3.  Verify upgrade completes successfully | Upgrade should finish successfully |
diff --git a/docs/content/manual/pre-release/basic-operations/storage-network.md b/docs/content/manual/pre-release/basic-operations/storage-network.md
@@ -4,7 +4,15 @@ title: Storage Network Test
 ## Related issue:
 https://github.com/longhorn/longhorn/issues/2285
 
-## Test Steps
+## Test Multus version below v4.0.0
 **Given** Set up the Longhorn environment as mentioned [here](https://longhorn.github.io/longhorn-tests/manual/release-specific/v1.3.0/test-storage-network/)
 **When** Run Longhorn core tests on the environment.
-**Then** All the tests should pass. 
+**Then** All the tests should pass.
+
+## Related issue:
+https://github.com/longhorn/longhorn/issues/6953
+
+## Test Multus version above v4.0.0
+**Given** Set up the Longhorn environment as mentioned [here](https://longhorn.github.io/longhorn-tests/manual/release-specific/v1.6.0/test-storage-network/)
+**When** Run Longhorn core tests on the environment.
+**Then** All the tests should pass.
diff --git a/docs/content/manual/pre-release/upgrade/test-node-drain-policy.md b/docs/content/manual/pre-release/upgrade/test-node-drain-policy.md
@@ -0,0 +1,143 @@
+---
+title: Test Node Drain Policy Setting
+---
+
+## With `node-drain-policy` is `block-if-contains-last-replica`
+
+> Note:
+> Starting from v1.5.x, it is not necessary to check for the presence of longhorn-admission-webhook and longhorn-conversion-webhook. 
+> Please refer to the Longhorn issue [#5590](https://github.com/longhorn/longhorn/issues/5590) for more details.
+> 
+> Starting from v1.5.x, observe that the instance-manager-r and instance-manager-e are combined into instance-manager. 
+> Ref [5208](https://github.com/longhorn/longhorn/issues/5208)
+
+### 1. Basic unit tests
+
+#### 1.1 Single worker node cluster with separate master node
+1.1.1 RWO volumes
+* Deploy Longhorn
+* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Manually create a PVC (simulate the volume which has never been attached scenario)
+* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` because there is no attached volume
+* Create a deployment that uses one RW0 Longhorn volume.
+* Verify that there is PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
+* Drain the node by `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force`
+* Observe that the workload pods are evited first -> PDB of `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` are removed -> `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`, and instance-manager-e pods are evicted -> all volumes are successfully detached
+* Observe that instance-manager-r is NOT evicted.
+
+1.1.2 RWX volume
+* Deploy Longhorn
+* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Create a deployment of 2 pods that uses one RWX Longhorn volume.
+* Verify that there is PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Drain the node by `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force`
+* Observe that the workload pods are evited first -> PDB of  `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` are removed -> `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`, and instance-manager-e pods are evicted -> all volumes are successfully detached
+* Observe that instance-manager-r is NOT evicted.
+
+#### 1.2 multi-node cluster
+1.2.1 Multiple healthy replicas
+* Deploy Longhorn
+* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Manually create a PVC (simulate the volume which has never been attached scenario)
+* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` because there is no attached volume
+* Create a deployment that uses one RW0 Longhorn volume.
+* Verify that there is PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`
+* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
+* Create a deployment of 2 pods that uses one RWX Longhorn volume.
+* For each node one by one by `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force`
+* Verify that the drain can finish successfully
+* Uncordon the node and move to next node
+
+1.2.2 Single healthy replicas
+* Given Longhorn with 2 nodes cluster: node-1, node-2
+* Create a 5Gi volume with 1 replica. Let's say the replica is on node-2 
+* Attached the volume to node-1
+* Set `node-drain-policy` to `block-if-contains-last-replica`
+* Attempts to drain node-2 that contains the only replica.
+* The node-2 becomes cordoned.
+* All pods on node-2 are evicted except the replica instance manager pod.
+* The message like below keeps appearing.
+    ```
+    evicting pod longhorn-system/instance-manager-r-xxxxxxxx
+    error when evicting pods/"instance-manager-r-xxxxxxxx" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
+    ```
+
+
+
+### 2. Upgrade Kubernetes for k3s cluster with standalone System Upgrade Controller deployment
+* Deploy a 3 nodes with each node has all roles (master + worker)
+* Install the [System Upgrade Controller](https://github.com/rancher/system-upgrade-controller#deploying)
+* Deploy Longhorn
+* Manually create a PVC (simulate the volume which has never been attached scenario)
+* Create a deployment that uses one RW0 Longhorn volume.
+* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
+* Create another deployment of 2 pods that uses one RWX Longhorn volume.
+* Deploying the `plan` CR to upgrade Kubernetes similar to:
+```
+  apiVersion: upgrade.cattle.io/v1
+  kind: Plan
+  metadata:
+    name: k3s-server
+    namespace: system-upgrade
+  spec:
+    concurrency: 1
+    cordon: true
+    nodeSelector:
+      matchExpressions:
+      - key: node-role.kubernetes.io/master
+        operator: In
+        values:
+        - "true"
+    serviceAccountName: system-upgrade
+    drain:
+      force: true
+      skipWaitForDeleteTimeout: 60 # 1.18+ (honor pod disruption budgets up to 60 seconds per pod then moves on)
+    upgrade:
+      image: rancher/k3s-upgrade
+    version: v1.21.11+k3s1
+  ```
+Note that the `concurrency` should be 1 to upgrade node one by one. `version` should be a newer K3s version. And it should contains the `drain` stage
+* Verify that the upgrade went smoothly
+* Exec into workload pod and make sure that the data is still there
+* Repeat the upgrading process above 5 times to make sure
+
+### 3. Upgrade Kubernetes for imported  k3s cluster in Rancher
+* Creating a 3-node k3s cluster with each node is both master+worker role. K3s should be an old version such as `v1.21.9+k3s1` so that we can upgrade multiple times. Some instructions to create such cluster is here https://docs.k3s.io/datastore/ha-embedded
+* Import the cluster into Rancher by: go to cluster management -> create new cluster -> generic cluster -> follow the instruction over there
+* Update the upgrade strategy in cluster management -> click three dots menu on the imported cluster -> edit config -> K3s options -> close drain for both control plane and worker node like below:
+![Screenshot from 2023-03-14 17-53-24](https://user-images.githubusercontent.com/22139961/225175432-87f076ac-552c-464a-a466-42356f1ac8e2.png)
+* Install Longhorn
+* Manually create a PVC (simulate the volume which has never been attached scenario)
+* Create a deployment that uses one RW0 Longhorn volume.
+* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
+* Create another deployment of 2 pods that uses one RWX Longhorn volume.
+* Using Rancher to upgrade the cluster to a newer Kubernetes version
+* Verify that the upgrade went smoothly
+* Exec into workload pod and make sure that the data is still there
+
+### 4. Upgrade Kubernetes for provisioned k3s cluster in Rancher
+* Using Rancher to provision a k3s cluster with an old version. For example, `v1.22.11+k3s2`. The cluster has 3 nodes each node with both worker and master role. Set the upgrade strategy as below:
+![Screenshot from 2023-03-14 15-44-34](https://user-images.githubusercontent.com/22139961/225163284-51c017ed-650c-4263-849c-054a0a0abf20.png)
+* Install Longhorn
+* Manually create a PVC (simulate the volume which has never been attached scenario)
+* Create a deployment that uses one RW0 Longhorn volume.
+* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
+* Create another deployment of 2 pods that uses one RWX Longhorn volume.
+* Using Rancher to upgrade the cluster to a newer Kubernetes version
+* Verify that the upgrade went smoothly
+* Exec into workload pod and make sure that the data is still there
+
+## With `node-drain-policy` is `allow-if-replica-is-stopped`
+
+1. Repeat the test cases above. 
+1. Verify that in the test `1.1.1`, `1.1.2`, `1.2.1`, `2`,`3`, and `4`, the drain is successfully. 
+1. Verify that the test `1.2.2`, the drain is still failed
+
+
+## With `node-drain-policy` as `always-allow`
+1. Repeat the test cases above.
+1. Verify that in the test `1.1.1`, `1.1.2`, `1.2.1`, `1.2.2`, `2`,`3`, and `4`, the drain is successfully. 
+
+
+
diff --git a/docs/content/manual/pre-release/upgrade/upgrade-with-new-instance-manager.md b/docs/content/manual/pre-release/upgrade/upgrade-with-new-instance-manager.md
@@ -4,7 +4,7 @@ title: Test System Upgrade with New Instance Manager
 
 1. Prepare 3 sets of longhorn-manager and longhorn-instance-manager images.
 2. Deploy Longhorn with the 1st set of images.
-3. Set `Guaranteed Engine Manager CPU` and `Guaranteed Replica Manager CPU` to 15 and 24, respectively. 
+3. Set `Guaranteed Instance Manager CPU` to 40, respectively.
    Then wait for the instance manager recreation.
 4. Create and attach a volume to a node (node1).
 5. Upgrade the Longhorn system with the 2nd set of images. 
@@ -13,4 +13,4 @@ title: Test System Upgrade with New Instance Manager
 7. Upgrade the Longhorn system with the 3rd set of images.
 8. Verify the pods of the 3rd instance manager cannot be launched on node1 since there is no available CPU for the allocation.
 9. Detach the volume in the 1st instance manager pod. 
-   Verify the related instance manager pods will be cleaned up and the new instance manager pod can be launched on node1.
+   Verify the related instance manager pods will be cleaned up and the new instance manager pod can be launched on node1.