Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DR volume fails to reattach and faulted after node stop and start during incremental restore #9752

Open
roger-ryao opened this issue Nov 4, 2024 · 17 comments
Assignees
Labels
area/cli area/negative-testing area/resilience System or volume resilience area/volume-disaster-recovery Volume DR backport/1.6.4 backport/1.7.3 kind/bug reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@roger-ryao
Copy link

Describe the bug

When I was working on the [TEST] [ROBOT] implementation of the manual test case for 'The node the DR volume is attached to is rebooted' (#8425), I discovered the issue. When the node with the attached DR volume is stopped during the incremental restore, we expect the DR volume to reattach to another available node. However, the DR volume goes into a Faulted state. Even after restarting the previously stopped node, the DR volume remains in a Faulted state and does not recover as expected.

To Reproduce

https://longhorn.github.io/longhorn-tests/manual/pre-release/node-not-ready/node-restart/dr-volume-node-rebooted/

  • Scenario 2
  1. Create a pod with a Longhorn volume.
  2. Write data to the volume and get its MD5 checksum.
  3. Create the first backup for the volume.
  4. Create a DR (Disaster Recovery) volume from this backup.
  5. Wait for the DR volume to complete the initial restore.
  6. Write more data to the original volume and get the new MD5 checksum.
  7. Create a second backup for the volume.
  8. During the incremental restore of the DR volume, immediately stop the node to which the DR volume is attached.
  9. Wait for the instance status change to stopped and then start the node
  10. Wait for the DR volume to detach and then reattach itself to the node.
  11. Wait for the DR volume to complete the restore after reattachment.
  12. Activate the DR volume and check the MD5 checksum of the data.

Expected behavior

The DR volume should reattach to an available node and continue the restore process without entering a faulted state, even after a node reboot.

Support bundle for troubleshooting

supportbundle_295c69fd-6ae1-4b69-a7be-8b0b023fbf84_2024-11-04T04-08-10Z.zip

Environment

  • Longhorn version: master-head
  • Impacted volume (PV): vol-1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.31.1+k3s1
    • Number of control plane nodes in the cluster: 1
    • Number of worker nodes in the cluster: 3
  • Node config
    • OS type and version: SLES 15-sp5
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster: 2

Additional context

#8425

Workaround and Mitigation

N/A

@roger-ryao roger-ryao added kind/bug reproduce/often 80 - 50% reproducible area/volume-disaster-recovery Volume DR require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. area/negative-testing labels Nov 4, 2024
@roger-ryao roger-ryao added this to the v1.9.0 milestone Nov 4, 2024
@github-project-automation github-project-automation bot moved this to New Issues in Longhorn Sprint Nov 4, 2024
@derekbit
Copy link
Member

derekbit commented Nov 4, 2024

@roger-ryao Can you check v1.6.3 and 1.7.2 as well? Thanks.

@roger-ryao
Copy link
Author

@roger-ryao Can you check v1.6.3 and 1.7.2 as well? Thanks.

Hi @derekbit @c3y1huang
I tried a few more times and found that if the DR volume is attached to the same node as the longhorn-test-minio pod’s node, the issue can be reproduced. This seems to be an environment-related issue, and I still need to think about how to prevent this from happening.

cc @longhorn/qa

@roger-ryao
Copy link
Author

@roger-ryao Can you check v1.6.3 and 1.7.2 as well? Thanks.

Hi @derekbit @c3y1huang I tried a few more times and found that if the DR volume is attached to the same node as the longhorn-test-minio pod’s node, the issue can be reproduced. This seems to be an environment-related issue, and I still need to think about how to prevent this from happening.

cc @longhorn/qa

After consulting with @yangchiu & @chriscchien , we should use a customized backup store YAML to deploy the backup store on the control node instead of worker nodes, so it won't be affected by node reboot operations.

@derekbit
Copy link
Member

derekbit commented Nov 4, 2024

After consulting with @yangchiu & @chriscchien , we should use a customized backup store YAML to deploy the backup store on the control node instead of worker nodes, so it won't be affected by node reboot operations.

@roger-ryao Let's check if the issue remains after re-testing it by an external backup target.

@derekbit derekbit moved this from New Issues to Testing in Longhorn Sprint Nov 4, 2024
@derekbit derekbit assigned roger-ryao and unassigned c3y1huang Nov 4, 2024
@roger-ryao
Copy link
Author

After consulting with @yangchiu & @chriscchien , we should use a customized backup store YAML to deploy the backup store on the control node instead of worker nodes, so it won't be affected by node reboot operations.

@roger-ryao Let's check if the issue remains after re-testing it by an external backup target.

I was able to reproduce the issue using an external backup target and will test it on v1.7.x and v1.6.x next week.

DR Volume Node Reboot During Initial Restoration 1 supportbundle :
supportbundle_295c69fd-6ae1-4b69-a7be-8b0b023fbf84_2024-11-04T13-48-16Z.zip

DR Volume Node Reboot During Incremental Restoration 1 supportbundle :
121supportbundle_295c69fd-6ae1-4b69-a7be-8b0b023fbf84_2024-11-04T14-21-10Z.zip

robotlog :
log.tar.gz

Screenshot_20241104_223304
Screenshot_20241104_223247

@derekbit derekbit moved this from Testing to New Issues in Longhorn Sprint Nov 4, 2024
@c3y1huang
Copy link
Contributor

c3y1huang commented Nov 10, 2024

@roger-ryao could you share the robot case? I cannot find it in #8425.

@roger-ryao
Copy link
Author

@roger-ryao could you share the robot case? I cannot find it in #8425.

Hi @c3y1huang
I submitted the case at https://github.com/roger-ryao/longhorn-tests/tree/issue9752.
I haven’t submitted a PR yet since I’m still verifying the stability of the case, but it should help you clarify the issue.

@c3y1huang c3y1huang moved this from New Issues to Analysis and Design in Longhorn Sprint Nov 11, 2024
@c3y1huang c3y1huang added the area/resilience System or volume resilience label Nov 11, 2024
@roger-ryao
Copy link
Author

roger-ryao commented Nov 11, 2024

@roger-ryao Can you check v1.6.3 and 1.7.2 as well? Thanks.

I was able to reproduce the issue using an external backup target and will test it on v1.7.x and v1.6.x next week.

I was able to reproduce the issue on longhorn-v1.7.3-dev-20241103 using the test case (reproduced once out of four attempts), but I haven’t observed any failures in 12 attempts on longhorn-v1.6.4-dev-20241103. I’ll continue testing on v1.6.4 and v1.7.2.

For reference, the issue reproduction rate is around 25-50% on longhorn-v1.8.0-dev-20241103 & longhorn-v1.7.3-dev-20241103.

@roger-ryao
Copy link
Author

@roger-ryao Can you check v1.6.3 and 1.7.2 as well? Thanks.

I was able to reproduce the issue using an external backup target and will test it on v1.7.x and v1.6.x next week.
I was able to reproduce the issue on longhorn-v1.7.3-dev-20241103 using the test case (reproduced once out of four attempts), but I haven’t observed any failures in 12 attempts on longhorn-v1.6.4-dev-20241103. I’ll continue testing on v1.6.4 and v1.7.2.

I haven’t observed any failures in 15 attempts on v1.6.3, while on v1.7.2, I observed only one failure in 15 attempts.

@c3y1huang c3y1huang added the investigation-needed Need to identify the case before estimating and starting the development label Nov 12, 2024
@roger-ryao roger-ryao added reproduce/rare < 50% reproducible and removed reproduce/often 80 - 50% reproducible labels Nov 12, 2024
@roger-ryao
Copy link
Author

Discussed with @c3y1huang and assisted in replicating the issue.
Here is a brief summary of the current test results:

We observed the issue on versions v1.7.0, v1.7.2, v1.7.3-dev-20241103, and v1.8.0-dev-20241103 (We did not test v1.7.1, but we infer that it likely has the same issue.). However, when running the same case on v1.6.3, we did not observe any failures.

The support bundle for v1.7.0 ( volume e2e-test-volume-1 ). :
170-supportbundle_c022ef54-5c0f-47df-acd3-2dffce86cadb_2024-11-12T05-28-01Z.zip

Screenshot_20241112_133142

@c3y1huang
Copy link
Contributor

c3y1huang commented Nov 13, 2024

Possible cause:

The cluster has a single CoreDNS pod and the test case execution rebooted the CoreDNS pod node. This caused the failure of the other two restoring replicas.

2024-11-13T01:22:25.009821689Z time="2024-11-13T01:22:25Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"e2e-test-volume-12\", UID:\"f6a2e7a3-90e2-4735-b0a8-1715126c2810\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"435147\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRestore' replica e2e-test-volume-12-r-0e46e87b failed the restore: tcp://10.42.3.57:10043: failed to get the current restoring backup info: failed to list objects with param: {\n  Bucket: \"c3y1-s3\",\n  Delimiter: \"/\",\n  Prefix: \"/\"\n} error: AWS Error:  RequestError send request failed Get \"https://c3y1-s3.s3.ap-southeast-1.amazonaws.com/?delimiter=%2F&prefix=%2F\": dial tcp: lookup c3y1-s3.s3.ap-southeast-1.amazonaws.com on 10.43.0.10:53: read udp 10.42.3.57:39714->10.43.0.10:53: read: connection refused\n" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"
2024-11-13T01:22:25.009833704Z time="2024-11-13T01:22:25Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"e2e-test-volume-12\", UID:\"f6a2e7a3-90e2-4735-b0a8-1715126c2810\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"435147\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedRestore' replica e2e-test-volume-12-r-e463c397 failed the restore: tcp://10.42.2.28:10222: failed to get the current restoring backup info: failed to list objects with param: {\n  Bucket: \"c3y1-s3\",\n  Delimiter: \"/\",\n  Prefix: \"/\"\n} error: AWS Error:  RequestError send request failed Get \"https://c3y1-s3.s3.ap-southeast-1.amazonaws.com/?delimiter=%2F&prefix=%2F\": dial tcp: lookup c3y1-s3.s3.ap-southeast-1.amazonaws.com on 10.43.0.10:53: read udp 10.42.3.57:39714->10.43.0.10:53: read: connection refused\n" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:377"

Check with conditions (WIP):

  • Using NFS as backup storage to check if this issue is specific to S3, cc @roger-ryao
  • Scale-up CoreDNS to check if this is a DNS resolution issue, @c3y1huang

@roger-ryao
Copy link
Author

Check with conditions (WIP):

  • Using NFS as backup storage to check if this issue is specific to S3, cc @roger-ryao

So far, after a total of 9 attempts, no failures have been observed.
I was not successful in changing the NFS backup server on the control node through the file at https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/backupstores/nfs-backupstore.yaml.
I used my external NFS server as the backup target, but the server uses the IP directory nfs://<nfs_server_ip>:/opt/nfs.
This might be why I couldn't replicate the issue with NFS.

Let me try to see how I can configure my NFS server to connect using a domain name.
Perhaps this way, I can replicate the issue when the backup target is NFS.

@c3y1huang
Copy link
Contributor

c3y1huang commented Nov 14, 2024

After checking with @roger-ryao , we found that this issue is not reproducible when using NFS as backup storage.

The NFS list operation doesn't depend on CoreDNS because it verifies the local path to get the backup info.
https://github.com/longhorn/backupstore/blob/b405e8f77dc300b23307275817390b0139dd9c15/fsops/fsops.go#L114

The S3 list operation checks the backup store object, which relies on the DNS resolution
https://github.com/longhorn/backupstore/blob/b405e8f77dc300b23307275817390b0139dd9c15/s3/s3.go#L114

This means this issue is specific to the cloud provider storage.

@roger-ryao
Copy link
Author

Check with conditions (WIP):

  • Using NFS as backup storage to check if this issue is specific to S3, cc @roger-ryao
  • Scale-up CoreDNS to check if this is a DNS resolution issue, @c3y1huang

After running it 20 times, I did not observe the volume becoming faulted when the backup target was the NFS server

@c3y1huang
Copy link
Contributor

I am able to reproduce this issue with v1.6.2 by scaling down CoreDNS during the node reboot, while the other 2 replicas are rebuilding.

@c3y1huang
Copy link
Contributor

Proposed change:

  1. Document CoreDNS in best practice.
  2. Add a CoreDNS check to the Longhorn CLI.

cc @derekbit

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Nov 14, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is:

    • Scale up the CoreDNS replica count.
  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at:

  • Which areas/issues this PR might have potential impacts on?
    Area DR volumes, CLI
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at doc(1.8.0, 1.7.4, 1.6.3): add CoreDNS setup to best practices website#1011

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cli area/negative-testing area/resilience System or volume resilience area/volume-disaster-recovery Volume DR backport/1.6.4 backport/1.7.3 kind/bug reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation require/qa-review-coverage Require QA to review coverage
Projects
Status: Review
Development

No branches or pull requests

4 participants