Skip to content

Commit

Permalink
Merge branch 'master' into khushboo-rancher-patch-1
Browse files Browse the repository at this point in the history
  • Loading branch information
khushboo-rancher authored Nov 1, 2023
2 parents 4062e0d + 7ea8d89 commit 535e115
Show file tree
Hide file tree
Showing 50 changed files with 1,146 additions and 270 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
---
title: Test cases to reproduce issues related to attach detach
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
title: Test cases to reproduce attachment-detachment issues
---
**Prerequisite:** Have an environment with just with 2 worker nodes or taint 1 out of 3 worker node to be `NoExecute` & `NoSchedule`.
This will serve as a constrained fallback and limited source of recovery in the event of failure.


#### 1. Kill the engines and instance manager repeatedly
**Given** 1 RWO and 1 RWX volume is attached to a pod.
And Both the volumes have 2 replicas.
And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync`

**When** One replica rebuilding is triggered by crashing the IM
AND Immediately IM associated with another replica is crashed
AND After crashing IMs, detaching of Volume is tried either by pod deletion or using Longhorn UI

**Then** Volume should not stuck in attaching-detaching loop

**When** Volume is detached and manually attached again.
And Engine running on the node where is volume is attached in killed

**Then** Volume should recover once the engine is back online.

#### 2. Illegal values in Volume/Snap.meta
**Given** 1 RWO and 1 RWX volume is attached to a pod.
And Both the volumes have 2 replicas.

**When** Some random values are set in the Volume/snap meta file
And If replica rebuilding is triggered and the IM associated with another replica is also crashed

**Then** Volume should not stuck in attaching-detaching loop


#### 3. Deletion of Volume/Snap.meta
**Given** 1 RWO and 1 RWX volume is attached to a pod.
And Both the volumes have 2 replicas.

**When** The Volume & snap meta files are deleted one by one.
And If replica rebuilding is triggered and the IM associated with another replica is also crashed

**Then** Volume should not stuck in attaching-detaching loop

#### 4. Failed replica tries to rebuild from other just crashed replica - https://github.com/longhorn/longhorn/issues/4212
**Given** 1 RWO and 1 RWX volume is attached to a pod.
And Both the volumes have 2 replicas.
And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync`

**When** One replica rebuilding is triggered by crashing the IM
AND Immediately IM associated with another replica is crashed

**Then** Volume should not stuck in attaching-detaching loop.

#### 5. Volume attachment Modification/deletion

**Given** A deployment and statefulSet are created with same name and attached to Longhorn Volume.
AND Some data is written and their md5sum is computed

**When** The statefulSet and Deployment are deleted without deleting the volumes
And Same named new statefulSet and Deployment are created with new PVCs.
And Before above deployed workload could attach to volumes, attached node is rebooted

**Then** After node reboot completion, volumes should reflect right status.
And the newly created deployment and statefulSet should get attached to the volumes.

**When** The volume attachments of above workloads are deleted.
And above workloads are deleted and recreated immediately.

**Then** No multi attach or other errors should be observed.

#### 6. Use monitoring/word press/db workloads
**Given** Monitoring and word press and any other db related workload are deployed in the system
And All the volumes have 2 replicas.
And Random data is continuously being written to the volume using command `dd if=/dev/urandom of=file1 count=100 bs=1M conv=fsync status=progress oflag=direct,sync`

**When** One replica rebuilding is triggered by crashing the IM
AND Immediately IM associated with another replica is crashed

**Then** Volume should not stuck in attaching-detaching loop.

Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,19 @@ title: Storage Network Test
## Related issue:
https://github.com/longhorn/longhorn/issues/2285

## Test Steps
## Test Multus version below v4.0.0
**Given** Set up the Longhorn environment as mentioned [here](https://longhorn.github.io/longhorn-tests/manual/release-specific/v1.3.0/test-storage-network/)

**When** Run Longhorn core tests on the environment.

**Then** All the tests should pass.

## Related issue:
https://github.com/longhorn/longhorn/issues/6953

## Test Multus version above v4.0.0
**Given** Set up the Longhorn environment as mentioned [here](https://longhorn.github.io/longhorn-tests/manual/release-specific/v1.6.0/test-storage-network/)

**When** Run Longhorn core tests on the environment.

**Then** All the tests should pass.
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
title: Test `Rebuild` in volume.meta blocks engine start
---

## Related issue
https://github.com/longhorn/longhorn/issues/6626

## Test with patched image

**Given** a patched longhorn-engine image with the following code change.
```diff
diff --git a/pkg/sync/sync.go b/pkg/sync/sync.go
index b48ddd46..c4523f11 100644
--- a/pkg/sync/sync.go
+++ b/pkg/sync/sync.go
@@ -534,9 +534,9 @@ func (t *Task) reloadAndVerify(address, instanceName string, repClient *replicaC
return err
}

- if err := repClient.SetRebuilding(false); err != nil {
- return err
- }
+ // if err := repClient.SetRebuilding(false); err != nil {
+ // return err
+ // }
return nil
}
```
**And** a patched longhorn-instance-manager image with the longhorn-engine vendor updated.
**And** Longhorn is installed with the patched images.
**And** the `data-locality` setting is set to `disabled`.
**And** the `auto-salvage` setting is set to `true`.
**And** a new StorageClass is created with `NumberOfReplica` set to `1`.
**And** a StatefulSet is created with `Replica` set to `1`.
**And** the node of the StatefulSet Pod and the node of its volume Replica are different. This is necessary to trigger the rebuilding in reponse to the data locality setting update later.
**And** Volume have 1 running Replica.
**And** data exists in the volume.

**When** the `data-locality` setting is set to `best-effort`.
**And** the replica rebuilding is completed.
**And** the `Rebuilding` in the replicas's `volume.meta` file is `true`.
**And** Delete the instance manager Pod of the Replica.

**Then** the Replica should be running.
**And** the StatefulSet Pod should restart.
**And** the `Rebuilding` in replicas's `volume.meta` file should be `false`.
**And** the data should remain intact.
216 changes: 216 additions & 0 deletions docs/content/manual/release-specific/v1.6.0/test-storage-network.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
---
title: Setup and test storage network when Multus version is above v4.0.0
---

## Related issue
https://github.com/longhorn/longhorn/issues/6953

## Test storage network

### Create AWS instances
**Given** Create VPC.
- VPC only
- IPv4 CIDR 10.0.0.0/16

*And* Create an internet gateway.
- Attach to VPC

*And* Add the internet gateway to the VPC `Main route table`, `Routes`.
- Destination 0.0.0.0/0

*And* Create 2 subnets in the VPC.
- Subnet-1: 10.0.1.0/24
- Subnet-2: 10.0.2.0/24

*And* Launch 3 EC2 instances.
- Use the created VPC
- Use subnet-1 for network interface 1
- Use subnet-2 for network interface 2
- Disable `Auto-assign public IP`
- Add security group inbound rule to allow `All traffic` from `Anywhere-IPv4`
- Stop `Source/destination check`

*And* Create 3 elastic IPs.

*And* Associate one of the elastic IP to one of the EC2 instance network interface 1.
- Repeat for the other 2 EC2 instances with the remain elastic IPs.


### Setup instances

**Given** K3s K8s cluster installed on EC2 instances.

*And* Deploy Multus DaemonSet on the control-plane node.
- Download YAML.
```
curl -O https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/v4.0.2/deployments/multus-daemonset.yml
```
- Edit YAML.
```
diff --git a/deployments/multus-daemonset.yml b/deployments/multus-daemonset.yml
index ab626a66..a7228942 100644
--- a/deployments/multus-daemonset.yml
+++ b/deployments/multus-daemonset.yml
@@ -145,7 +145,7 @@ data:
]
}
],
- "kubeconfig": "/etc/cni/net.d/multus.d/multus.kubeconfig"
+ "kubeconfig": "/var/lib/rancher/k3s/agent/etc/cni/net.d/multus.d/multus.kubeconfig"
}
---
apiVersion: apps/v1
@@ -179,12 +179,13 @@ spec:
serviceAccountName: multus
containers:
- name: kube-multus
- image: ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot
+ image: ghcr.io/k8snetworkplumbingwg/multus-cni:v4.0.2
command: ["/thin_entrypoint"]
args:
- "--multus-conf-file=auto"
- "--multus-autoconfig-dir=/host/etc/cni/net.d"
- "--cni-conf-dir=/host/etc/cni/net.d"
+ - "--multus-kubeconfig-file-host=/var/lib/rancher/k3s/agent/etc/cni/net.d/multus.d/multus.kubeconfig"
resources:
requests:
cpu: "100m"
@@ -222,10 +223,10 @@ spec:
volumes:
- name: cni
hostPath:
- path: /etc/cni/net.d
+ path: /var/lib/rancher/k3s/agent/etc/cni/net.d
- name: cnibin
hostPath:
- path: /opt/cni/bin
+ path: /var/lib/rancher/k3s/data/current/bin
- name: multus-cfg
configMap:
name: multus-cni-config
```
- Apply YAML to K8s cluster.
```
kubectl apply -f multus-daemonset.yml.new
```

*And* Download `ipvlan` and put to K3s binaries path to all cluster nodes.
```
curl -OL https://github.com/containernetworking/plugins/releases/download/v1.3.0/cni-plugins-linux-amd64-v1.3.0.tgz
tar -zxvf cni-plugins-linux-amd64-v1.3.0.tgz
cp ipvlan /var/lib/rancher/k3s/data/current/bin/
```

*And* Setup flannels on all cluster nodes.
```
# Update nodes eth1 IP to N1, N2, N3
N1="10.0.2.95"
N2="10.0.2.139"
N3="10.0.2.158"
NODES=(${N1} ${N2} ${N3})
STORAGE_NETWORK_PREFIX="192.168"
ETH1_IP=`ip a | grep eth1 | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | awk '{print $2}'`
count=1
for n in "${NODES[@]}"; do
[[ ${ETH1_IP} != $n ]] && ((count=count+1)) && continue
NET=$count
break
done
cat << EOF > /run/flannel/multus-subnet-${STORAGE_NETWORK_PREFIX}.0.0.env
FLANNEL_NETWORK=${STORAGE_NETWORK_PREFIX}.0.0/16
FLANNEL_SUBNET=${STORAGE_NETWORK_PREFIX}.${NET}.0/24
FLANNEL_MTU=1472
FLANNEL_IPMASQ=true
EOF
```
*And* Setup routes on all cluster nodes.
```
# Update nodes eth1 IP to N1, N2, N3
N1="10.0.2.95"
N2="10.0.2.139"
N3="10.0.2.158"
STORAGE_NETWORK_PREFIX="192.168"
ACTION="add"
ETH1_IP=`ip a | grep eth1 | grep -Eo 'inet (addr:)?([0-9]*\.){3}[0-9]*' | awk '{print $2}'`
[[ ${ETH1_IP} != ${N1} ]] && ip r ${ACTION} ${STORAGE_NETWORK_PREFIX}.1.0/24 via ${N1} dev eth1
[[ ${ETH1_IP} != ${N2} ]] && ip r ${ACTION} ${STORAGE_NETWORK_PREFIX}.2.0/24 via ${N2} dev eth1
[[ ${ETH1_IP} != ${N3} ]] && ip r ${ACTION} ${STORAGE_NETWORK_PREFIX}.3.0/24 via ${N3} dev eth1
```

*And* Deploy `NetworkAttachmentDefinition`.
```
cat << EOF > nad-192-168-0-0.yaml
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: demo-192-168-0-0
namespace: kube-system
#namespace: longhorn-system
spec:
config: '{
"cniVersion": "0.3.1",
"type": "flannel",
"subnetFile": "/run/flannel/multus-subnet-192.168.0.0.env",
"dataDir": "/var/lib/cni/multus-subnet-192.168.0.0",
"delegate": {
"type": "ipvlan",
"master": "eth1",
"mode": "l3",
"capabilities": {
"ips": true
}
},
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/multus.d/multus.kubeconfig"
}
}'
EOF
kubectl apply -f nad-192-168-0-0.yaml
```


### Test storage network
**Given** Longhorn deployed.

**When** Update storage network setting value to `kube-system/demo-192-168-0-0`.

**Then** Instance manager pods should restart.

*And* Should have storage network in `k8s.v1.cni.cncf.io/network-status` instance manager pods annotations.
- Should have 2 network in `k8s.v1.cni.cncf.io/network-status` annotation
- `kube-system/demo-192-168-0-0` should exist in `k8s.v1.cni.cncf.io/network-status` annotation
- `kube-system/demo-192-168-0-0` should use `lhnet1` interface.
- `kube-system/demo-192-168-0-0` should be in `192.168.0.0/16` subnet.
*And* Should be able to create/attach/detach/delete volumes successfully.
- Example:
```
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "cbr0",
"interface": "eth0",
"ips": [
"10.42.2.35"
],
"mac": "26:a7:d3:0d:af:68",
"default": true,
"dns": {}
},{
"name": "kube-system/demo-192-168-0-0",
"interface": "lhnet1",
"ips": [
"192.168.2.230"
],
"mac": "02:d3:d9:0b:2e:50",
"dns": {}
}]
k8s.v1.cni.cncf.io/networks: [{"namespace": "kube-system", "name": "demo-192-168-0-0", "interface": "lhnet1"}]
```
- Should see engine/replica `storageIP` in `192.168.0.0` subnet.
5 changes: 5 additions & 0 deletions e2e/keywords/common.resource
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@
Documentation Common keywords
Library ../libs/keywords/common_keywords.py
Library ../libs/keywords/node_keywords.py
Library ../libs/keywords/volume_keywords.py
Library ../libs/keywords/recurring_job_keywords.py
Library ../libs/keywords/workload_keywords.py
Library ../libs/keywords/network_keywords.py


*** Variables ***
Expand All @@ -21,9 +23,12 @@ Set test environment
Set Test Variable ${deployment_list}
@{statefulset_list} = Create List
Set Test Variable ${statefulset_list}
setup_control_plane_network_latency

Cleanup test resources
cleanup_control_plane_network_latency
cleanup_node_exec
cleanup_stress_helper
cleanup_recurring_jobs ${volume_list}
cleanup_volumes ${volume_list}
cleanup_deployments ${deployment_list}
Expand Down
Loading

0 comments on commit 535e115

Please sign in to comment.