vdev_open: clear async fault flag after reopen #16258

robn · 2024-06-11T11:57:04Z

Motivation and Context

After #15839, vdev_fault_wanted is set on a vdev after a probe fails. An end-of-txg async task is charged with actually faulting the vdev.

In a single-disk pool, the probe failure will degrade the last disk, and then suspend the pool. However, vdev_fault_wanted is not cleared. After the pool returns, the transaction finishes and the async task runs and faults the vdev, which suspends the pool again.

Description

The fix is simple: when reopening a vdev, clear the async fault flag. If the vdev is still failed, the startup probe will quickly notice and degrade/suspend it again. If not, all is well!

How Has This Been Tested?

Test case is included. It fails before, and now passes.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

satmandu · 2024-06-13T12:28:52Z

If this can handle the transient USB faults on my USB 3.1 Gen 2 drive cages causing pools to go offline until reboot...

A single disk pool should suspend when its disk fails and hold the IO. When the disk is returned, the pool should return and the IO be reissued, leaving everything in good shape. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

After c3f2f1a, vdev_fault_wanted is set on a vdev after a probe fails. An end-of-txg async task is charged with actually faulting the vdev. In a single-disk pool, the probe failure will degrade the last disk, and then suspend the pool. However, vdev_fault_wanted is not cleared. After the pool returns, the transaction finishes and the async task runs and faults the vdev, which suspends the pool again. The fix is simple: when reopening a vdev, clear the async fault flag. If the vdev is still failed, the startup probe will quickly notice and degrade/suspend it again. If not, all is well! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Co-authored-by: Don Brady <[email protected]> Signed-off-by: Rob Norris <[email protected]>

robn · 2024-07-17T00:29:31Z

Further testing shows the bug's impact is a little wider: if multiple disks are lost on the same txg causing the pool to suspend, after return they will all re-fault at end of txg, and the pool will fail again. This happens if a disk array or backplane fails, taking out multiple disks in the same moment. Not a hugely big deal, and the fix here takes care of it in the same way.

tonyhutter · 2024-07-17T17:04:28Z

Merged as 393b7ad 5de3ac2

robn changed the title ~~vdev_open: clear fault state after reopen~~ vdev_open: clear async fault flag after reopen Jun 11, 2024

robn force-pushed the vdev-probe-clear-fault branch from 5246603 to 473e99f Compare June 25, 2024 04:59

lundman approved these changes Jun 25, 2024

View reviewed changes

robn force-pushed the vdev-probe-clear-fault branch 3 times, most recently from 0c750d8 to e0f525b Compare June 30, 2024 05:35

robn mentioned this pull request Jun 30, 2024

zio_flush: propagate flush errors to the ZIL #16314

Draft

13 tasks

tonyhutter approved these changes Jul 8, 2024

View reviewed changes

robn and others added 2 commits July 11, 2024 14:18

robn force-pushed the vdev-probe-clear-fault branch from e0f525b to f5b16ed Compare July 11, 2024 04:27

don-brady approved these changes Jul 15, 2024

View reviewed changes

robn mentioned this pull request Jul 17, 2024

zfs-2.2.5 patchset #16359

Merged

13 tasks

tonyhutter closed this Jul 17, 2024

robn deleted the vdev-probe-clear-fault branch July 20, 2024 04:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vdev_open: clear async fault flag after reopen #16258

vdev_open: clear async fault flag after reopen #16258

robn commented Jun 11, 2024

satmandu commented Jun 13, 2024

robn commented Jul 17, 2024

tonyhutter commented Jul 17, 2024

vdev_open: clear async fault flag after reopen #16258

vdev_open: clear async fault flag after reopen #16258

Conversation

robn commented Jun 11, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

satmandu commented Jun 13, 2024

robn commented Jul 17, 2024

tonyhutter commented Jul 17, 2024