-
Notifications
You must be signed in to change notification settings - Fork 54
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Jack Lin <[email protected]>
- Loading branch information
Showing
136 changed files
with
5,305 additions
and
2,067 deletions.
There are no files selected for viewing
3 changes: 3 additions & 0 deletions
3
docs/content/manual/Test-cases-to-reproduce-attach-detach-issues/_index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
--- | ||
title: Test cases to reproduce issues related to attach detach | ||
--- |
79 changes: 79 additions & 0 deletions
79
...-reproduce-attach-detach-issues/attachment-detachment-issues-reproducibility.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
--- | ||
title: Test cases to reproduce attachment-detachment issues | ||
--- | ||
**Prerequisite:** Have an environment with just with 2 worker nodes or taint 1 out of 3 worker node to be `NoExecute` & `NoSchedule`. | ||
This will serve as a constrained fallback and limited source of recovery in the event of failure. | ||
|
||
|
||
#### 1. Kill the engines and instance manager repeatedly | ||
**Given** 1 RWO and 1 RWX volume is attached to a pod. | ||
And Both the volumes have 2 replicas. | ||
And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync` | ||
|
||
**When** One replica rebuilding is triggered by crashing the IM | ||
AND Immediately IM associated with another replica is crashed | ||
AND After crashing IMs, detaching of Volume is tried either by pod deletion or using Longhorn UI | ||
|
||
**Then** Volume should not stuck in attaching-detaching loop | ||
|
||
**When** Volume is detached and manually attached again. | ||
And Engine running on the node where is volume is attached in killed | ||
|
||
**Then** Volume should recover once the engine is back online. | ||
|
||
#### 2. Illegal values in Volume/Snap.meta | ||
**Given** 1 RWO and 1 RWX volume is attached to a pod. | ||
And Both the volumes have 2 replicas. | ||
|
||
**When** Some random values are set in the Volume/snap meta file | ||
And If replica rebuilding is triggered and the IM associated with another replica is also crashed | ||
|
||
**Then** Volume should not stuck in attaching-detaching loop | ||
|
||
|
||
#### 3. Deletion of Volume/Snap.meta | ||
**Given** 1 RWO and 1 RWX volume is attached to a pod. | ||
And Both the volumes have 2 replicas. | ||
|
||
**When** The Volume & snap meta files are deleted one by one. | ||
And If replica rebuilding is triggered and the IM associated with another replica is also crashed | ||
|
||
**Then** Volume should not stuck in attaching-detaching loop | ||
|
||
#### 4. Failed replica tries to rebuild from other just crashed replica - https://github.com/longhorn/longhorn/issues/4212 | ||
**Given** 1 RWO and 1 RWX volume is attached to a pod. | ||
And Both the volumes have 2 replicas. | ||
And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync` | ||
|
||
**When** One replica rebuilding is triggered by crashing the IM | ||
AND Immediately IM associated with another replica is crashed | ||
|
||
**Then** Volume should not stuck in attaching-detaching loop. | ||
|
||
#### 5. Volume attachment Modification/deletion | ||
|
||
**Given** A deployment and statefulSet are created with same name and attached to Longhorn Volume. | ||
AND Some data is written and their md5sum is computed | ||
|
||
**When** The statefulSet and Deployment are deleted without deleting the volumes | ||
And Same named new statefulSet and Deployment are created with new PVCs. | ||
And Before above deployed workload could attach to volumes, attached node is rebooted | ||
|
||
**Then** After node reboot completion, volumes should reflect right status. | ||
And the newly created deployment and statefulSet should get attached to the volumes. | ||
|
||
**When** The volume attachments of above workloads are deleted. | ||
And above workloads are deleted and recreated immediately. | ||
|
||
**Then** No multi attach or other errors should be observed. | ||
|
||
#### 6. Use monitoring/word press/db workloads | ||
**Given** Monitoring and word press and any other db related workload are deployed in the system | ||
And All the volumes have 2 replicas. | ||
And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync` | ||
|
||
**When** One replica rebuilding is triggered by crashing the IM | ||
AND Immediately IM associated with another replica is crashed | ||
|
||
**Then** Volume should not stuck in attaching-detaching loop. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
143 changes: 143 additions & 0 deletions
143
docs/content/manual/pre-release/upgrade/test-node-drain-policy.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
--- | ||
title: Test Node Drain Policy Setting | ||
--- | ||
|
||
## With `node-drain-policy` is `block-if-contains-last-replica` | ||
|
||
> Note: | ||
> Starting from v1.5.x, it is not necessary to check for the presence of longhorn-admission-webhook and longhorn-conversion-webhook. | ||
> Please refer to the Longhorn issue [#5590](https://github.com/longhorn/longhorn/issues/5590) for more details. | ||
> | ||
> Starting from v1.5.x, observe that the instance-manager-r and instance-manager-e are combined into instance-manager. | ||
> Ref [5208](https://github.com/longhorn/longhorn/issues/5208) | ||
### 1. Basic unit tests | ||
|
||
#### 1.1 Single worker node cluster with separate master node | ||
1.1.1 RWO volumes | ||
* Deploy Longhorn | ||
* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` | ||
* Manually create a PVC (simulate the volume which has never been attached scenario) | ||
* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` because there is no attached volume | ||
* Create a deployment that uses one RW0 Longhorn volume. | ||
* Verify that there is PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` | ||
* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached | ||
* Drain the node by `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force` | ||
* Observe that the workload pods are evited first -> PDB of `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` are removed -> `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`, and instance-manager-e pods are evicted -> all volumes are successfully detached | ||
* Observe that instance-manager-r is NOT evicted. | ||
|
||
1.1.2 RWX volume | ||
* Deploy Longhorn | ||
* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` | ||
* Create a deployment of 2 pods that uses one RWX Longhorn volume. | ||
* Verify that there is PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` | ||
* Drain the node by `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force` | ||
* Observe that the workload pods are evited first -> PDB of `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` are removed -> `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook`, and instance-manager-e pods are evicted -> all volumes are successfully detached | ||
* Observe that instance-manager-r is NOT evicted. | ||
|
||
#### 1.2 multi-node cluster | ||
1.2.1 Multiple healthy replicas | ||
* Deploy Longhorn | ||
* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` | ||
* Manually create a PVC (simulate the volume which has never been attached scenario) | ||
* Verify that there is no PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` because there is no attached volume | ||
* Create a deployment that uses one RW0 Longhorn volume. | ||
* Verify that there is PDB for `csi-attacher`, `csi-provisioner`, `longhorn-admission-webhook`, and `longhorn-conversion-webhook` | ||
* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached | ||
* Create a deployment of 2 pods that uses one RWX Longhorn volume. | ||
* For each node one by one by `kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force` | ||
* Verify that the drain can finish successfully | ||
* Uncordon the node and move to next node | ||
|
||
1.2.2 Single healthy replicas | ||
* Given Longhorn with 2 nodes cluster: node-1, node-2 | ||
* Create a 5Gi volume with 1 replica. Let's say the replica is on node-2 | ||
* Attached the volume to node-1 | ||
* Set `node-drain-policy` to `block-if-contains-last-replica` | ||
* Attempts to drain node-2 that contains the only replica. | ||
* The node-2 becomes cordoned. | ||
* All pods on node-2 are evicted except the replica instance manager pod. | ||
* The message like below keeps appearing. | ||
``` | ||
evicting pod longhorn-system/instance-manager-r-xxxxxxxx | ||
error when evicting pods/"instance-manager-r-xxxxxxxx" -n "longhorn-system" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. | ||
``` | ||
### 2. Upgrade Kubernetes for k3s cluster with standalone System Upgrade Controller deployment | ||
* Deploy a 3 nodes with each node has all roles (master + worker) | ||
* Install the [System Upgrade Controller](https://github.com/rancher/system-upgrade-controller#deploying) | ||
* Deploy Longhorn | ||
* Manually create a PVC (simulate the volume which has never been attached scenario) | ||
* Create a deployment that uses one RW0 Longhorn volume. | ||
* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached | ||
* Create another deployment of 2 pods that uses one RWX Longhorn volume. | ||
* Deploying the `plan` CR to upgrade Kubernetes similar to: | ||
``` | ||
apiVersion: upgrade.cattle.io/v1 | ||
kind: Plan | ||
metadata: | ||
name: k3s-server | ||
namespace: system-upgrade | ||
spec: | ||
concurrency: 1 | ||
cordon: true | ||
nodeSelector: | ||
matchExpressions: | ||
- key: node-role.kubernetes.io/master | ||
operator: In | ||
values: | ||
- "true" | ||
serviceAccountName: system-upgrade | ||
drain: | ||
force: true | ||
skipWaitForDeleteTimeout: 60 # 1.18+ (honor pod disruption budgets up to 60 seconds per pod then moves on) | ||
upgrade: | ||
image: rancher/k3s-upgrade | ||
version: v1.21.11+k3s1 | ||
``` | ||
Note that the `concurrency` should be 1 to upgrade node one by one. `version` should be a newer K3s version. And it should contains the `drain` stage | ||
* Verify that the upgrade went smoothly | ||
* Exec into workload pod and make sure that the data is still there | ||
* Repeat the upgrading process above 5 times to make sure | ||
### 3. Upgrade Kubernetes for imported k3s cluster in Rancher | ||
* Creating a 3-node k3s cluster with each node is both master+worker role. K3s should be an old version such as `v1.21.9+k3s1` so that we can upgrade multiple times. Some instructions to create such cluster is here https://docs.k3s.io/datastore/ha-embedded | ||
* Import the cluster into Rancher by: go to cluster management -> create new cluster -> generic cluster -> follow the instruction over there | ||
* Update the upgrade strategy in cluster management -> click three dots menu on the imported cluster -> edit config -> K3s options -> close drain for both control plane and worker node like below: | ||
![Screenshot from 2023-03-14 17-53-24](https://user-images.githubusercontent.com/22139961/225175432-87f076ac-552c-464a-a466-42356f1ac8e2.png) | ||
* Install Longhorn | ||
* Manually create a PVC (simulate the volume which has never been attached scenario) | ||
* Create a deployment that uses one RW0 Longhorn volume. | ||
* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached | ||
* Create another deployment of 2 pods that uses one RWX Longhorn volume. | ||
* Using Rancher to upgrade the cluster to a newer Kubernetes version | ||
* Verify that the upgrade went smoothly | ||
* Exec into workload pod and make sure that the data is still there | ||
### 4. Upgrade Kubernetes for provisioned k3s cluster in Rancher | ||
* Using Rancher to provision a k3s cluster with an old version. For example, `v1.22.11+k3s2`. The cluster has 3 nodes each node with both worker and master role. Set the upgrade strategy as below: | ||
![Screenshot from 2023-03-14 15-44-34](https://user-images.githubusercontent.com/22139961/225163284-51c017ed-650c-4263-849c-054a0a0abf20.png) | ||
* Install Longhorn | ||
* Manually create a PVC (simulate the volume which has never been attached scenario) | ||
* Create a deployment that uses one RW0 Longhorn volume. | ||
* Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached | ||
* Create another deployment of 2 pods that uses one RWX Longhorn volume. | ||
* Using Rancher to upgrade the cluster to a newer Kubernetes version | ||
* Verify that the upgrade went smoothly | ||
* Exec into workload pod and make sure that the data is still there | ||
## With `node-drain-policy` is `allow-if-replica-is-stopped` | ||
1. Repeat the test cases above. | ||
1. Verify that in the test `1.1.1`, `1.1.2`, `1.2.1`, `2`,`3`, and `4`, the drain is successfully. | ||
1. Verify that the test `1.2.2`, the drain is still failed | ||
## With `node-drain-policy` as `always-allow` | ||
1. Repeat the test cases above. | ||
1. Verify that in the test `1.1.1`, `1.1.2`, `1.2.1`, `1.2.2`, `2`,`3`, and `4`, the drain is successfully. | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.