Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdev_open: clear async fault flag after reopen #16258

Closed
wants to merge 2 commits into from

Conversation

robn
Copy link
Member

@robn robn commented Jun 11, 2024

Motivation and Context

After #15839, vdev_fault_wanted is set on a vdev after a probe fails. An end-of-txg async task is charged with actually faulting the vdev.

In a single-disk pool, the probe failure will degrade the last disk, and then suspend the pool. However, vdev_fault_wanted is not cleared. After the pool returns, the transaction finishes and the async task runs and faults the vdev, which suspends the pool again.

Description

The fix is simple: when reopening a vdev, clear the async fault flag. If the vdev is still failed, the startup probe will quickly notice and degrade/suspend it again. If not, all is well!

How Has This Been Tested?

Test case is included. It fails before, and now passes.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn robn changed the title vdev_open: clear fault state after reopen vdev_open: clear async fault flag after reopen Jun 11, 2024
@satmandu
Copy link
Contributor

If this can handle the transient USB faults on my USB 3.1 Gen 2 drive cages causing pools to go offline until reboot...

@robn robn force-pushed the vdev-probe-clear-fault branch 3 times, most recently from 0c750d8 to e0f525b Compare June 30, 2024 05:35
robn and others added 2 commits July 11, 2024 14:18
A single disk pool should suspend when its disk fails and hold the IO.
When the disk is returned, the pool should return and the IO be
reissued, leaving everything in good shape.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
After c3f2f1a, vdev_fault_wanted is set on a vdev after a probe fails.
An end-of-txg async task is charged with actually faulting the vdev.

In a single-disk pool, the probe failure will degrade the last disk, and
then suspend the pool. However, vdev_fault_wanted is not cleared. After
the pool returns, the transaction finishes and the async task runs and
faults the vdev, which suspends the pool again.

The fix is simple: when reopening a vdev, clear the async fault flag. If
the vdev is still failed, the startup probe will quickly notice and
degrade/suspend it again. If not, all is well!

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Co-authored-by: Don Brady <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
@robn
Copy link
Member Author

robn commented Jul 17, 2024

Further testing shows the bug's impact is a little wider: if multiple disks are lost on the same txg causing the pool to suspend, after return they will all re-fault at end of txg, and the pool will fail again. This happens if a disk array or backplane fails, taking out multiple disks in the same moment. Not a hugely big deal, and the fix here takes care of it in the same way.

@robn robn mentioned this pull request Jul 17, 2024
13 tasks
@tonyhutter
Copy link
Contributor

Merged as 393b7ad 5de3ac2

@tonyhutter tonyhutter closed this Jul 17, 2024
@robn robn deleted the vdev-probe-clear-fault branch July 20, 2024 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants