Merge branch 'master' into khushboo-rancher-patch-1

longhorn · Nov 1, 2023 · 535e115 · 535e115
2 parents 4062e0d + 7ea8d89
commit 535e115
Show file tree

Hide file tree

Showing 50 changed files with 1,146 additions and 270 deletions.
diff --git a/docs/content/manual/Test-cases-to-reproduce-attach-detach-issues/_index.md b/docs/content/manual/Test-cases-to-reproduce-attach-detach-issues/_index.md
@@ -0,0 +1,3 @@
+---
+title: Test cases to reproduce issues related to attach detach
+---
diff --git a/...-reproduce-attach-detach-issues/attachment-detachment-issues-reproducibility.md b/...-reproduce-attach-detach-issues/attachment-detachment-issues-reproducibility.md
@@ -0,0 +1,79 @@
+---
+title: Test cases to reproduce attachment-detachment issues
+---
+**Prerequisite:** Have an environment with just with 2 worker nodes or taint 1 out of 3 worker node to be `NoExecute` & `NoSchedule`.
+This will serve as a constrained fallback and limited source of recovery in the event of failure.   
+
+
+#### 1. Kill the engines and instance manager repeatedly 
+**Given** 1 RWO and 1 RWX volume is attached to a pod.
+And Both the volumes have 2 replicas.
+And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync`
+
+**When** One replica rebuilding is triggered by crashing the IM
+AND Immediately IM associated with another replica is crashed
+AND After crashing IMs, detaching of Volume is tried either by pod deletion or using Longhorn UI    
+
+**Then** Volume should not stuck in attaching-detaching loop
+
+**When** Volume is detached and manually attached again.
+And Engine running on the node where is volume is attached in killed
+
+**Then** Volume should recover once the engine is back online.
+
+#### 2. Illegal values in Volume/Snap.meta
+**Given** 1 RWO and 1 RWX volume is attached to a pod.
+And Both the volumes have 2 replicas.
+
+**When** Some random values are set in the Volume/snap meta file
+And If replica rebuilding is triggered and the IM associated with another replica is also crashed
+
+**Then** Volume should not stuck in attaching-detaching loop
+
+
+#### 3. Deletion of Volume/Snap.meta
+**Given** 1 RWO and 1 RWX volume is attached to a pod.
+And Both the volumes have 2 replicas.
+
+**When** The Volume & snap meta files are deleted one by one.
+And If replica rebuilding is triggered and the IM associated with another replica is also crashed
+
+**Then** Volume should not stuck in attaching-detaching loop
+
+#### 4. Failed replica tries to rebuild from other just crashed replica - https://github.com/longhorn/longhorn/issues/4212
+**Given** 1 RWO and 1 RWX volume is attached to a pod.
+And Both the volumes have 2 replicas.
+And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync`
+
+**When** One replica rebuilding is triggered by crashing the IM
+AND Immediately IM associated with another replica is crashed
+
+**Then** Volume should not stuck in attaching-detaching loop.
+
+#### 5. Volume attachment Modification/deletion
+
+**Given** A deployment and statefulSet are created with same name and attached to Longhorn Volume.
+AND Some data is written and their md5sum is computed
+
+**When** The statefulSet and Deployment are deleted without deleting the volumes
+And Same named new statefulSet and Deployment are created with new PVCs.
+And Before above deployed workload could attach to volumes, attached node is rebooted
+
+**Then** After node reboot completion, volumes should reflect right status.
+And the newly created deployment and statefulSet should get attached to the volumes.
+
+**When** The volume attachments of above workloads are deleted.
+And above workloads are deleted and recreated immediately.
+
+**Then** No multi attach or other errors should be observed.
+
+#### 6. Use monitoring/word press/db workloads
+**Given** Monitoring and word press and any other db related workload are deployed in the system
+And All the volumes have 2 replicas.
+And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync`
+
+**When** One replica rebuilding is triggered by crashing the IM
+AND Immediately IM associated with another replica is crashed
+
+**Then** Volume should not stuck in attaching-detaching loop.
+
diff --git a/docs/content/manual/pre-release/basic-operations/storage-network.md b/docs/content/manual/pre-release/basic-operations/storage-network.md
@@ -4,9 +4,19 @@ title: Storage Network Test
 ## Related issue:
 https://github.com/longhorn/longhorn/issues/2285
 
-## Test Steps
+## Test Multus version below v4.0.0
 **Given** Set up the Longhorn environment as mentioned [here](https://longhorn.github.io/longhorn-tests/manual/release-specific/v1.3.0/test-storage-network/)
 
 **When** Run Longhorn core tests on the environment.
 
 **Then** All the tests should pass. 
+
+## Related issue:
+https://github.com/longhorn/longhorn/issues/6953
+
+## Test Multus version above v4.0.0
+**Given** Set up the Longhorn environment as mentioned [here](https://longhorn.github.io/longhorn-tests/manual/release-specific/v1.6.0/test-storage-network/)
+
+**When** Run Longhorn core tests on the environment.
+
+**Then** All the tests should pass.
diff --git a/...tent/manual/release-specific/v1.6.0/test-rebuild-in-meta-blocks-engine-start.md b/...tent/manual/release-specific/v1.6.0/test-rebuild-in-meta-blocks-engine-start.md
@@ -0,0 +1,47 @@
+---
+title: Test `Rebuild` in volume.meta blocks engine start
+---
+
+## Related issue
+https://github.com/longhorn/longhorn/issues/6626
+
+## Test with patched image
+
+**Given** a patched longhorn-engine image with the following code change.
+```diff
+diff --git a/pkg/sync/sync.go b/pkg/sync/sync.go
+index b48ddd46..c4523f11 100644
+--- a/pkg/sync/sync.go
++++ b/pkg/sync/sync.go
+@@ -534,9 +534,9 @@ func (t *Task) reloadAndVerify(address, instanceName string, repClient *replicaC
+                return err
+        }
+
+-       if err := repClient.SetRebuilding(false); err != nil {
+-               return err
+-       }
++       // if err := repClient.SetRebuilding(false); err != nil {
++       //      return err
++       // }
+        return nil
+ }
+```
+**And** a patched longhorn-instance-manager image with the longhorn-engine vendor updated.  
+**And** Longhorn is installed with the patched images.  
+**And** the `data-locality` setting is set to `disabled`.  
+**And** the `auto-salvage` setting is set to `true`.  
+**And** a new StorageClass is created with `NumberOfReplica` set to `1`.  
+**And** a StatefulSet is created with `Replica` set to `1`.  
+**And** the node of the StatefulSet Pod and the node of its volume Replica are different. This is necessary to trigger the rebuilding in reponse to the data locality setting update later.  
+**And** Volume have 1 running Replica.  
+**And** data exists in the volume.  
+
+**When** the `data-locality` setting is set to `best-effort`.  
+**And** the replica rebuilding is completed.  
+**And** the `Rebuilding` in the replicas's `volume.meta` file is `true`.  
+**And** Delete the instance manager Pod of the Replica.  
+
+**Then** the Replica should be running.  
+**And** the StatefulSet Pod should restart.  
+**And** the `Rebuilding` in replicas's `volume.meta` file should be `false`.  
+**And** the data should remain intact.
diff --git a/docs/content/manual/release-specific/v1.6.0/test-storage-network.md b/docs/content/manual/release-specific/v1.6.0/test-storage-network.md
@@ -0,0 +1,216 @@
+---
+title: Setup and test storage network when Multus version is above v4.0.0
+---
+
+## Related issue
+https://github.com/longhorn/longhorn/issues/6953
+
+## Test storage network
+
+### Create AWS instances
+**Given** Create VPC.
+- VPC only
+- IPv4 CIDR 10.0.0.0/16
+
+*And* Create an internet gateway.
+- Attach to VPC
+
+*And* Add the internet gateway to the VPC `Main route table`, `Routes`.
+- Destination 0.0.0.0/0
+
+*And* Create 2 subnets in the VPC.
+- Subnet-1: 10.0.1.0/24
+- Subnet-2: 10.0.2.0/24
+
+*And* Launch 3 EC2 instances.
+- Use the created VPC
+- Use subnet-1 for network interface 1
+- Use subnet-2 for network interface 2
+- Disable `Auto-assign public IP`
+- Add security group inbound rule to allow `All traffic` from `Anywhere-IPv4`
+- Stop `Source/destination check`
+
+*And* Create 3 elastic IPs.
+
+*And* Associate one of the elastic IP to one of the EC2 instance network interface 1.
+- Repeat for the other 2 EC2 instances with the remain elastic IPs.
+
+
+### Setup instances
+
+**Given** K3s K8s cluster installed on EC2 instances.
+
+*And* Deploy Multus DaemonSet on the control-plane node.
+- Download YAML.
+  ```
+  curl -O https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/v4.0.2/deployments/multus-daemonset.yml
+  ```
+- Edit YAML.
+  ```
+  diff --git a/deployments/multus-daemonset.yml b/deployments/multus-daemonset.yml
+  index ab626a66..a7228942 100644
+  --- a/deployments/multus-daemonset.yml
+  +++ b/deployments/multus-daemonset.yml
+  @@ -145,7 +145,7 @@ data:
+             ]
+           }
+         ],
+  -      "kubeconfig": "/etc/cni/net.d/multus.d/multus.kubeconfig"
+  +      "kubeconfig": "/var/lib/rancher/k3s/agent/etc/cni/net.d/multus.d/multus.kubeconfig"
+       }
+   ---
+   apiVersion: apps/v1
+  @@ -179,12 +179,13 @@ spec:
+         serviceAccountName: multus
+         containers:
+         - name: kube-multus
+  -        image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot
+  +        image: ghcr.io/k8snetworkplumbingwg/multus-cni:v4.0.2
+           command: ["/thin_entrypoint"]
+           args:
+           - "--multus-conf-file=auto"
+           - "--multus-autoconfig-dir=/host/etc/cni/net.d"
+           - "--cni-conf-dir=/host/etc/cni/net.d"
+  +        - "--multus-kubeconfig-file-host=/var/lib/rancher/k3s/agent/etc/cni/net.d/multus.d/multus.kubeconfig"
+           resources:
+             requests:
+               cpu: "100m"
+  @@ -222,10 +223,10 @@ spec:
+         volumes:
+           - name: cni
+             hostPath:
+  -            path: /etc/cni/net.d
+  +            path: /var/lib/rancher/k3s/agent/etc/cni/net.d
+           - name: cnibin
+             hostPath:
+  -            path: /opt/cni/bin
+  +            path: /var/lib/rancher/k3s/data/current/bin
+           - name: multus-cfg
+             configMap:
+               name: multus-cni-config
+  ```
+- Apply YAML to K8s cluster.
+  ```
+  kubectl apply -f multus-daemonset.yml.new
+  ```
+
+*And* Download `ipvlan` and put to K3s binaries path to all cluster nodes.
+```
+curl -OL https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-amd64-v1.3.0.tgz
+tar -zxvf cni-plugins-linux-amd64-v1.3.0.tgz
+cp ipvlan /var/lib/rancher/k3s/data/current/bin/
+```
+
+*And* Setup flannels on all cluster nodes.
+```
+# Update nodes eth1 IP to N1, N2, N3
+N1="10.0.2.95"
+N2="10.0.2.139"
+N3="10.0.2.158"
+NODES=(${N1} ${N2} ${N3})
+
+STORAGE_NETWORK_PREFIX="192.168"
+
+ETH1_IP=`ip a | grep eth1 | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*'  | awk '{print $2}'`
+
+count=1
+for n in "${NODES[@]}"; do
+    [[ ${ETH1_IP} != $n ]] && ((count=count+1)) && continue
+
+    NET=$count
+    break
+done
+
+cat << EOF > /run/flannel/multus-subnet-${STORAGE_NETWORK_PREFIX}.0.0.env
+FLANNEL_NETWORK=${STORAGE_NETWORK_PREFIX}.0.0/16
+FLANNEL_SUBNET=${STORAGE_NETWORK_PREFIX}.${NET}.0/24
+FLANNEL_MTU=1472
+FLANNEL_IPMASQ=true
+EOF
+```
+*And* Setup routes on all cluster nodes.
+```
+# Update nodes eth1 IP to N1, N2, N3
+N1="10.0.2.95"
+N2="10.0.2.139"
+N3="10.0.2.158"
+
+STORAGE_NETWORK_PREFIX="192.168"
+ACTION="add"
+
+ETH1_IP=`ip a | grep eth1 | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*'  | awk '{print $2}'`
+
+[[ ${ETH1_IP} != ${N1} ]] && ip r ${ACTION} ${STORAGE_NETWORK_PREFIX}.1.0/24 via ${N1} dev eth1
+[[ ${ETH1_IP} != ${N2} ]] && ip r ${ACTION} ${STORAGE_NETWORK_PREFIX}.2.0/24 via ${N2} dev eth1
+[[ ${ETH1_IP} != ${N3} ]] && ip r ${ACTION} ${STORAGE_NETWORK_PREFIX}.3.0/24 via ${N3} dev eth1
+```
+
+*And* Deploy `NetworkAttachmentDefinition`.
+```
+cat << EOF > nad-192-168-0-0.yaml
+apiVersion: "k8s.cni.cncf.io/v1"
+kind: NetworkAttachmentDefinition
+metadata:
+  name: demo-192-168-0-0
+  namespace: kube-system
+  #namespace: longhorn-system
+spec:
+  config: '{
+      "cniVersion": "0.3.1",
+      "type": "flannel",
+      "subnetFile": "/run/flannel/multus-subnet-192.168.0.0.env",
+      "dataDir": "/var/lib/cni/multus-subnet-192.168.0.0",
+      "delegate": {
+        "type": "ipvlan",
+        "master": "eth1",
+        "mode": "l3",
+          "capabilities": {
+            "ips": true
+        }
+      },
+      "kubernetes": {
+          "kubeconfig": "/etc/cni/net.d/multus.d/multus.kubeconfig"
+      }
+    }'
+EOF
+kubectl apply -f nad-192-168-0-0.yaml
+```
+
+
+### Test storage network
+**Given** Longhorn deployed.
+
+**When** Update storage network setting value to `kube-system/demo-192-168-0-0`.
+
+**Then** Instance manager pods should restart.
+
+*And* Should have storage network in `k8s.v1.cni.cncf.io/network-status` instance manager pods annotations.
+- Should have 2 network in `k8s.v1.cni.cncf.io/network-status` annotation
+- `kube-system/demo-192-168-0-0` should exist in `k8s.v1.cni.cncf.io/network-status` annotation
+- `kube-system/demo-192-168-0-0` should use `lhnet1` interface.
+- `kube-system/demo-192-168-0-0` should be in `192.168.0.0/16` subnet.
+*And* Should be able to create/attach/detach/delete volumes successfully.
+- Example:
+  ```
+  Annotations:  k8s.v1.cni.cncf.io/network-status:
+                  [{
+                      "name": "cbr0",
+                      "interface": "eth0",
+                      "ips": [
+                          "10.42.2.35"
+                      ],
+                      "mac": "26:a7:d3:0d:af:68",
+                      "default": true,
+                      "dns": {}
+                  },{
+                      "name": "kube-system/demo-192-168-0-0",
+                      "interface": "lhnet1",
+                      "ips": [
+                          "192.168.2.230"
+                      ],
+                      "mac": "02:d3:d9:0b:2e:50",
+                      "dns": {}
+                  }]
+                k8s.v1.cni.cncf.io/networks: [{"namespace": "kube-system", "name": "demo-192-168-0-0", "interface": "lhnet1"}]
+  ```
+- Should see engine/replica `storageIP` in `192.168.0.0` subnet.
diff --git a/e2e/keywords/common.resource b/e2e/keywords/common.resource
@@ -2,9 +2,11 @@
 Documentation       Common keywords
 
 Library             ../libs/keywords/common_keywords.py
+Library             ../libs/keywords/node_keywords.py
 Library             ../libs/keywords/volume_keywords.py
 Library             ../libs/keywords/recurring_job_keywords.py
 Library             ../libs/keywords/workload_keywords.py
+Library             ../libs/keywords/network_keywords.py
 
 
 *** Variables ***
@@ -21,9 +23,12 @@ Set test environment
     Set Test Variable    ${deployment_list}
     @{statefulset_list} =    Create List
     Set Test Variable    ${statefulset_list}
+    setup_control_plane_network_latency
 
 Cleanup test resources
+    cleanup_control_plane_network_latency
     cleanup_node_exec
+    cleanup_stress_helper
     cleanup_recurring_jobs    ${volume_list}
     cleanup_volumes    ${volume_list}
     cleanup_deployments    ${deployment_list}