-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vdev_open: clear async fault flag after reopen #16258
Conversation
If this can handle the transient USB faults on my USB 3.1 Gen 2 drive cages causing pools to go offline until reboot... |
5246603
to
473e99f
Compare
0c750d8
to
e0f525b
Compare
A single disk pool should suspend when its disk fails and hold the IO. When the disk is returned, the pool should return and the IO be reissued, leaving everything in good shape. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
After c3f2f1a, vdev_fault_wanted is set on a vdev after a probe fails. An end-of-txg async task is charged with actually faulting the vdev. In a single-disk pool, the probe failure will degrade the last disk, and then suspend the pool. However, vdev_fault_wanted is not cleared. After the pool returns, the transaction finishes and the async task runs and faults the vdev, which suspends the pool again. The fix is simple: when reopening a vdev, clear the async fault flag. If the vdev is still failed, the startup probe will quickly notice and degrade/suspend it again. If not, all is well! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Co-authored-by: Don Brady <[email protected]> Signed-off-by: Rob Norris <[email protected]>
e0f525b
to
f5b16ed
Compare
Further testing shows the bug's impact is a little wider: if multiple disks are lost on the same txg causing the pool to suspend, after return they will all re-fault at end of txg, and the pool will fail again. This happens if a disk array or backplane fails, taking out multiple disks in the same moment. Not a hugely big deal, and the fix here takes care of it in the same way. |
Motivation and Context
After #15839,
vdev_fault_wanted
is set on a vdev after a probe fails. An end-of-txg async task is charged with actually faulting the vdev.In a single-disk pool, the probe failure will degrade the last disk, and then suspend the pool. However,
vdev_fault_wanted
is not cleared. After the pool returns, the transaction finishes and the async task runs and faults the vdev, which suspends the pool again.Description
The fix is simple: when reopening a vdev, clear the async fault flag. If the vdev is still failed, the startup probe will quickly notice and degrade/suspend it again. If not, all is well!
How Has This Been Tested?
Test case is included. It fails before, and now passes.
Types of changes
Checklist:
Signed-off-by
.