-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
debug_assert causes leader to panic #898
Comments
👋 Thanks for opening this issue! Get help or engage by:
|
Thank you for your detailed report. By the current design it should not be Because the removed and then restarted node has unexpectedly changed its data, In a distributed system like Openraft, a node must guarantee the persisted However, we understand that in certain scenarios, like testing or To address your specific use case, I propose adding a feature flag that relaxes Please let me know if you have further questions. |
I guess I'm thinking of this slightly differently, in my case all the nodes are known in advance and configured as a cluster. These clusters are then deployed as an isolated unit of 5 machines. I was thinking of production scenarios, where a node has to be replaced. In production the nodes could be VMs which can die at any time. Being able to bring a new one online to replace a failed one would make this easier to operate, and does seem to work if we don't panic in the leader. Is there something I'm not considering here that makes this operational plan invalid or dangerous? |
The feature flag seems like a good plan though. I must admit I didn't find a good place in the codepath to check for this case. |
@pepinns
The standard way is call Another non-standard tricky way is to set up a cluster of 6 nodes. The quorum is 4 nodes thus you can replace one node with empty state without losing data. I'll open a PR to show you where to add the patch to address the panicking issue. |
Thanks for clarifying, I was overlooking the fact that a log is committed before it's replicated everywhere. |
…res loosen-follower-log-revert` Add a new feature flag `loosen-follower-log-revert`, to permit the follower's log to roll back to an earlier state without causing the leader to panic. Although log state reversion is typically seen as a bug, enabling it can be useful for testing or in some special scenarios. For instance, in an even number nodes cluster, erasing a node's data and then rebooting it(log reverts to empty) will not result in data loss. - Related issue: databendlabs#898
@pepinns |
Describe the bug
The debug_assert in the progress entry module can cause the leader to panic if a node goes offline, and then comes back online after it's data has been wiped such that it's log is erased.
To Reproduce
Steps to reproduce the behavior:
Note: that if we remove the init() on startup at step 5 the leader still panics, but the follower never recovers.
Expected behavior
I expect that the node coming back online would be brought up to date by the leader.
Actual behavior
The follower returns a Conflict message, which causes the leader to panic due to a debug_assert call. Once the leader panics it is un-recoverable.
Env (please complete the following information):
main
branchmain
for the latest release branch, orrelease-0.7
for 0.7 branch)? yescargo 1.65.0-nightly (ce40690a5 2022-08-09)
; Find toolchain withcargo version
in a crate dir]nightly-2023-05-04
ubuntu 20.04
x86_64
Additional files:
Logs from my service right before the panic
Additional context
The following patch seems to resolve the issue, as it allows the leader to stay online, and the follower eventually gets updated.
The actual offending code could be in the append_entries on the follower, but I do think the leader should be resilient to faults and not panic in this case at the very least.
There was no error path in this code as it returns a Result<T, E> but the E is the same as the T in this case. The above was the simplest way I could think of to avoid this panic, but Ideally if the code has an error condition there would be an actual error return path so the Leader could re-act and perhaps fix whatever internal state is causing this situation.
The text was updated successfully, but these errors were encountered: