-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] DR volume fails to reattach and faulted after node stop and start during incremental restore #9752
Comments
@roger-ryao Can you check v1.6.3 and 1.7.2 as well? Thanks. |
Hi @derekbit @c3y1huang cc @longhorn/qa |
After consulting with @yangchiu & @chriscchien , we should use a customized backup store YAML to deploy the backup store on the |
@roger-ryao Let's check if the issue remains after re-testing it by an external backup target. |
I was able to reproduce the issue using an external backup target and will test it on v1.7.x and v1.6.x next week. DR Volume Node Reboot During Initial Restoration 1 supportbundle : DR Volume Node Reboot During Incremental Restoration 1 supportbundle : robotlog : |
@roger-ryao could you share the robot case? I cannot find it in #8425. |
Hi @c3y1huang |
I was able to reproduce the issue on For reference, the issue reproduction rate is around 25-50% on |
I haven’t observed any failures in 15 attempts on |
Discussed with @c3y1huang and assisted in replicating the issue. We observed the issue on versions The support bundle for v1.7.0 ( volume |
Possible cause: The cluster has a single CoreDNS pod and the test case execution rebooted the CoreDNS pod node. This caused the failure of the other two restoring replicas.
Check with conditions (WIP):
|
So far, after a total of 9 attempts, no failures have been observed. Let me try to see how I can configure my NFS server to connect using a domain name. |
After checking with @roger-ryao , we found that this issue is not reproducible when using NFS as backup storage. The NFS list operation doesn't depend on CoreDNS because it verifies the local path to get the backup info. The S3 list operation checks the backup store object, which relies on the DNS resolution This means this issue is specific to the cloud provider storage. |
After running it 20 times, I did not observe the volume becoming faulted when the backup target was the NFS server |
I am able to reproduce this issue with v1.6.2 by scaling down CoreDNS during the node reboot, while the other 2 replicas are rebuilding. |
Proposed change:
cc @derekbit |
Pre Ready-For-Testing Checklist
|
Describe the bug
When I was working on the [TEST] [ROBOT] implementation of the manual test case for 'The node the DR volume is attached to is rebooted' (#8425), I discovered the issue. When the node with the attached DR volume is stopped during the incremental restore, we expect the DR volume to reattach to another available node. However, the DR volume goes into a
Faulted
state. Even after restarting the previously stopped node, the DR volume remains in aFaulted
state and does not recover as expected.To Reproduce
https://longhorn.github.io/longhorn-tests/manual/pre-release/node-not-ready/node-restart/dr-volume-node-rebooted/
stop
the node to which the DR volume is attached.stopped
and thenstart
the nodeExpected behavior
The DR volume should reattach to an available node and continue the restore process without entering a faulted state, even after a node reboot.
Support bundle for troubleshooting
supportbundle_295c69fd-6ae1-4b69-a7be-8b0b023fbf84_2024-11-04T04-08-10Z.zip
Environment
master-head
vol-1
v1.31.1+k3s1
1
3
2
Additional context
#8425
Workaround and Mitigation
N/A
The text was updated successfully, but these errors were encountered: