[BUG] DR volume fails to reattach and faulted after node stop and start during incremental restore #9752

roger-ryao · 2024-11-04T05:00:17Z

Describe the bug

When I was working on the [TEST] [ROBOT] implementation of the manual test case for 'The node the DR volume is attached to is rebooted' (#8425), I discovered the issue. When the node with the attached DR volume is stopped during the incremental restore, we expect the DR volume to reattach to another available node. However, the DR volume goes into a Faulted state. Even after restarting the previously stopped node, the DR volume remains in a Faulted state and does not recover as expected.

To Reproduce

https://longhorn.github.io/longhorn-tests/manual/pre-release/node-not-ready/node-restart/dr-volume-node-rebooted/

Scenario 2

Create a pod with a Longhorn volume.
Write data to the volume and get its MD5 checksum.
Create the first backup for the volume.
Create a DR (Disaster Recovery) volume from this backup.
Wait for the DR volume to complete the initial restore.
Write more data to the original volume and get the new MD5 checksum.
Create a second backup for the volume.
During the incremental restore of the DR volume, immediately stop the node to which the DR volume is attached.
Wait for the instance status change to stopped and then start the node
Wait for the DR volume to detach and then reattach itself to the node.
Wait for the DR volume to complete the restore after reattachment.
Activate the DR volume and check the MD5 checksum of the data.

Expected behavior

The DR volume should reattach to an available node and continue the restore process without entering a faulted state, even after a node reboot.

Support bundle for troubleshooting

supportbundle_295c69fd-6ae1-4b69-a7be-8b0b023fbf84_2024-11-04T04-08-10Z.zip

Environment

Longhorn version: master-head
Impacted volume (PV): vol-1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.31.1+k3s1
- Number of control plane nodes in the cluster: 1
- Number of worker nodes in the cluster: 3
Node config
- OS type and version: SLES 15-sp5
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
Number of Longhorn volumes in the cluster: 2

Additional context

#8425

Workaround and Mitigation

N/A

The text was updated successfully, but these errors were encountered:

derekbit · 2024-11-04T05:06:13Z

@roger-ryao Can you check v1.6.3 and 1.7.2 as well? Thanks.

roger-ryao · 2024-11-04T05:47:29Z

@roger-ryao Can you check v1.6.3 and 1.7.2 as well? Thanks.

Hi @derekbit @c3y1huang
I tried a few more times and found that if the DR volume is attached to the same node as the longhorn-test-minio pod’s node, the issue can be reproduced. This seems to be an environment-related issue, and I still need to think about how to prevent this from happening.

cc @longhorn/qa

roger-ryao · 2024-11-04T06:41:08Z

@roger-ryao Can you check v1.6.3 and 1.7.2 as well? Thanks.

Hi @derekbit @c3y1huang I tried a few more times and found that if the DR volume is attached to the same node as the longhorn-test-minio pod’s node, the issue can be reproduced. This seems to be an environment-related issue, and I still need to think about how to prevent this from happening.

cc @longhorn/qa

After consulting with @yangchiu & @chriscchien , we should use a customized backup store YAML to deploy the backup store on the control node instead of worker nodes, so it won't be affected by node reboot operations.

derekbit · 2024-11-04T08:23:33Z

After consulting with @yangchiu & @chriscchien , we should use a customized backup store YAML to deploy the backup store on the control node instead of worker nodes, so it won't be affected by node reboot operations.

@roger-ryao Let's check if the issue remains after re-testing it by an external backup target.

roger-ryao · 2024-11-04T14:37:15Z

After consulting with @yangchiu & @chriscchien , we should use a customized backup store YAML to deploy the backup store on the control node instead of worker nodes, so it won't be affected by node reboot operations.

@roger-ryao Let's check if the issue remains after re-testing it by an external backup target.

I was able to reproduce the issue using an external backup target and will test it on v1.7.x and v1.6.x next week.

DR Volume Node Reboot During Initial Restoration 1 supportbundle :
supportbundle_295c69fd-6ae1-4b69-a7be-8b0b023fbf84_2024-11-04T13-48-16Z.zip

DR Volume Node Reboot During Incremental Restoration 1 supportbundle :
121supportbundle_295c69fd-6ae1-4b69-a7be-8b0b023fbf84_2024-11-04T14-21-10Z.zip

robotlog :
log.tar.gz

c3y1huang · 2024-11-10T23:52:12Z

@roger-ryao could you share the robot case? I cannot find it in #8425.

roger-ryao · 2024-11-11T01:16:32Z

@roger-ryao could you share the robot case? I cannot find it in #8425.

Hi @c3y1huang
I submitted the case at https://github.com/roger-ryao/longhorn-tests/tree/issue9752.
I haven’t submitted a PR yet since I’m still verifying the stability of the case, but it should help you clarify the issue.

roger-ryao · 2024-11-11T07:59:21Z

@roger-ryao Can you check v1.6.3 and 1.7.2 as well? Thanks.

I was able to reproduce the issue using an external backup target and will test it on v1.7.x and v1.6.x next week.

I was able to reproduce the issue on longhorn-v1.7.3-dev-20241103 using the test case (reproduced once out of four attempts), but I haven’t observed any failures in 12 attempts on longhorn-v1.6.4-dev-20241103. I’ll continue testing on v1.6.4 and v1.7.2.

For reference, the issue reproduction rate is around 25-50% on longhorn-v1.8.0-dev-20241103 & longhorn-v1.7.3-dev-20241103.

roger-ryao · 2024-11-12T00:35:53Z

@roger-ryao Can you check v1.6.3 and 1.7.2 as well? Thanks.

I was able to reproduce the issue using an external backup target and will test it on v1.7.x and v1.6.x next week.
I was able to reproduce the issue on longhorn-v1.7.3-dev-20241103 using the test case (reproduced once out of four attempts), but I haven’t observed any failures in 12 attempts on longhorn-v1.6.4-dev-20241103. I’ll continue testing on v1.6.4 and v1.7.2.

I haven’t observed any failures in 15 attempts on v1.6.3, while on v1.7.2, I observed only one failure in 15 attempts.

roger-ryao · 2024-11-12T05:46:54Z

Discussed with @c3y1huang and assisted in replicating the issue.
Here is a brief summary of the current test results:

We observed the issue on versions v1.7.0, v1.7.2, v1.7.3-dev-20241103, and v1.8.0-dev-20241103 (We did not test v1.7.1, but we infer that it likely has the same issue.). However, when running the same case on v1.6.3, we did not observe any failures.

The support bundle for v1.7.0 ( volume e2e-test-volume-1 ). :
170-supportbundle_c022ef54-5c0f-47df-acd3-2dffce86cadb_2024-11-12T05-28-01Z.zip

c3y1huang · 2024-11-13T08:36:37Z

Possible cause:

The cluster has a single CoreDNS pod and the test case execution rebooted the CoreDNS pod node. This caused the failure of the other two restoring replicas.

2024-11-13T01:22:25.009821689Z time="2024-11-13T01:22:25Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"e2e-test-volume-12\", UID:\"f6a2e7a3-90e2-4735-b0a8-1715126c2810\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"435147\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRestore' replica e2e-test-volume-12-r-0e46e87b failed the restore: tcp://10.42.3.57:10043: failed to get the current restoring backup info: failed to list objects with param: {\n  Bucket: \"c3y1-s3\",\n  Delimiter: \"/\",\n  Prefix: \"/\"\n} error: AWS Error:  RequestError send request failed Get \"https://c3y1-s3.s3.ap-southeast-1.amazonaws.com/?delimiter=%2F&prefix=%2F\": dial tcp: lookup c3y1-s3.s3.ap-southeast-1.amazonaws.com on 10.43.0.10:53: read udp 10.42.3.57:39714->10.43.0.10:53: read: connection refused\n" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2024-11-13T01:22:25.009833704Z time="2024-11-13T01:22:25Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"e2e-test-volume-12\", UID:\"f6a2e7a3-90e2-4735-b0a8-1715126c2810\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"435147\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRestore' replica e2e-test-volume-12-r-e463c397 failed the restore: tcp://10.42.2.28:10222: failed to get the current restoring backup info: failed to list objects with param: {\n  Bucket: \"c3y1-s3\",\n  Delimiter: \"/\",\n  Prefix: \"/\"\n} error: AWS Error:  RequestError send request failed Get \"https://c3y1-s3.s3.ap-southeast-1.amazonaws.com/?delimiter=%2F&prefix=%2F\": dial tcp: lookup c3y1-s3.s3.ap-southeast-1.amazonaws.com on 10.43.0.10:53: read udp 10.42.3.57:39714->10.43.0.10:53: read: connection refused\n" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"

Check with conditions (WIP):

Using NFS as backup storage to check if this issue is specific to S3, cc @roger-ryao
Scale-up CoreDNS to check if this is a DNS resolution issue, @c3y1huang

roger-ryao · 2024-11-13T09:22:19Z

Check with conditions (WIP):

Using NFS as backup storage to check if this issue is specific to S3, cc @roger-ryao

So far, after a total of 9 attempts, no failures have been observed.
I was not successful in changing the NFS backup server on the control node through the file at https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/backupstores/nfs-backupstore.yaml.
I used my external NFS server as the backup target, but the server uses the IP directory nfs://<nfs_server_ip>:/opt/nfs.
This might be why I couldn't replicate the issue with NFS.

Let me try to see how I can configure my NFS server to connect using a domain name.
Perhaps this way, I can replicate the issue when the backup target is NFS.

c3y1huang · 2024-11-14T00:53:08Z

After checking with @roger-ryao , we found that this issue is not reproducible when using NFS as backup storage.

The NFS list operation doesn't depend on CoreDNS because it verifies the local path to get the backup info.
https://github.com/longhorn/backupstore/blob/b405e8f77dc300b23307275817390b0139dd9c15/fsops/fsops.go#L114

The S3 list operation checks the backup store object, which relies on the DNS resolution
https://github.com/longhorn/backupstore/blob/b405e8f77dc300b23307275817390b0139dd9c15/s3/s3.go#L114

This means this issue is specific to the cloud provider storage.

roger-ryao · 2024-11-14T00:54:28Z

Check with conditions (WIP):

Using NFS as backup storage to check if this issue is specific to S3, cc @roger-ryao

Scale-up CoreDNS to check if this is a DNS resolution issue, @c3y1huang

After running it 20 times, I did not observe the volume becoming faulted when the backup target was the NFS server

c3y1huang · 2024-11-14T01:21:03Z

I am able to reproduce this issue with v1.6.2 by scaling down CoreDNS during the node reboot, while the other 2 replicas are rebuilding.

c3y1huang · 2024-11-14T01:22:55Z

Proposed change:

Document CoreDNS in best practice.
Add a CoreDNS check to the Longhorn CLI.

cc @derekbit

longhorn-io-github-bot · 2024-11-14T03:51:14Z

Pre Ready-For-Testing Checklist

roger-ryao added this to the v1.9.0 milestone Nov 4, 2024

github-project-automation bot moved this to New Issues in Longhorn Sprint Nov 4, 2024

github-project-automation bot added this to Longhorn Sprint Nov 4, 2024

derekbit modified the milestones: v1.9.0, v1.8.0 Nov 4, 2024

derekbit assigned c3y1huang Nov 4, 2024

roger-ryao mentioned this issue Nov 4, 2024

[TEST] [ROBOT] implement manual test case The node the DR volume attached to is rebooted #8425

Open

derekbit moved this from New Issues to Testing in Longhorn Sprint Nov 4, 2024

derekbit assigned roger-ryao and unassigned c3y1huang Nov 4, 2024

derekbit assigned c3y1huang Nov 4, 2024

derekbit moved this from Testing to New Issues in Longhorn Sprint Nov 4, 2024

c3y1huang moved this from New Issues to Analysis and Design in Longhorn Sprint Nov 11, 2024

c3y1huang added the area/resilience System or volume resilience label Nov 11, 2024

c3y1huang added the investigation-needed Need to identify the case before estimating and starting the development label Nov 12, 2024

roger-ryao added reproduce/rare < 50% reproducible and removed reproduce/often 80 - 50% reproducible labels Nov 12, 2024

c3y1huang moved this from Analysis and Design to Implement in Longhorn Sprint Nov 14, 2024

c3y1huang added area/cli require/doc Require updating the longhorn.io documentation backport/1.6.4 backport/1.7.3 and removed investigation-needed Need to identify the case before estimating and starting the development labels Nov 14, 2024

This was referenced Nov 14, 2024

[BACKPORT][v1.6.4][BUG] DR volume fails to reattach and faulted after node stop and start during incremental restore #9802

Open

[BACKPORT][v1.7.3][BUG] DR volume fails to reattach and faulted after node stop and start during incremental restore #9803

Open

This was referenced Nov 14, 2024

feat: check CoreDNS longhorn/go-common-libs#84

Merged

feat: check CoreDNS longhorn/cli#120

Open

doc(1.8.0, 1.7.4, 1.6.3): add CoreDNS setup to best practices longhorn/website#1011

Merged

c3y1huang moved this from Implement to Review in Longhorn Sprint Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DR volume fails to reattach and faulted after node stop and start during incremental restore #9752

[BUG] DR volume fails to reattach and faulted after node stop and start during incremental restore #9752

roger-ryao commented Nov 4, 2024

derekbit commented Nov 4, 2024 •

edited

Loading

roger-ryao commented Nov 4, 2024

roger-ryao commented Nov 4, 2024

derekbit commented Nov 4, 2024

roger-ryao commented Nov 4, 2024

c3y1huang commented Nov 10, 2024 •

edited

Loading

roger-ryao commented Nov 11, 2024

roger-ryao commented Nov 11, 2024 •

edited

Loading

roger-ryao commented Nov 12, 2024

roger-ryao commented Nov 12, 2024

c3y1huang commented Nov 13, 2024 •

edited

Loading

roger-ryao commented Nov 13, 2024

c3y1huang commented Nov 14, 2024 •

edited

Loading

roger-ryao commented Nov 14, 2024

c3y1huang commented Nov 14, 2024

c3y1huang commented Nov 14, 2024

longhorn-io-github-bot commented Nov 14, 2024 •

edited by c3y1huang

Loading

[BUG] DR volume fails to reattach and faulted after node stop and start during incremental restore #9752

[BUG] DR volume fails to reattach and faulted after node stop and start during incremental restore #9752

Comments

roger-ryao commented Nov 4, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

Workaround and Mitigation

derekbit commented Nov 4, 2024 • edited Loading

roger-ryao commented Nov 4, 2024

roger-ryao commented Nov 4, 2024

derekbit commented Nov 4, 2024

roger-ryao commented Nov 4, 2024

c3y1huang commented Nov 10, 2024 • edited Loading

roger-ryao commented Nov 11, 2024

roger-ryao commented Nov 11, 2024 • edited Loading

roger-ryao commented Nov 12, 2024

roger-ryao commented Nov 12, 2024

c3y1huang commented Nov 13, 2024 • edited Loading

roger-ryao commented Nov 13, 2024

c3y1huang commented Nov 14, 2024 • edited Loading

roger-ryao commented Nov 14, 2024

c3y1huang commented Nov 14, 2024

c3y1huang commented Nov 14, 2024

longhorn-io-github-bot commented Nov 14, 2024 • edited by c3y1huang Loading

Pre Ready-For-Testing Checklist

derekbit commented Nov 4, 2024 •

edited

Loading

c3y1huang commented Nov 10, 2024 •

edited

Loading

roger-ryao commented Nov 11, 2024 •

edited

Loading

c3y1huang commented Nov 13, 2024 •

edited

Loading

c3y1huang commented Nov 14, 2024 •

edited

Loading

longhorn-io-github-bot commented Nov 14, 2024 •

edited by c3y1huang

Loading