Add slow disk diagnosis to ZED #15469

don-brady · 2023-10-31T02:11:31Z

Motivation and Context

Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). But no action is taken, like is done for checksum or I/O errors. It would be nice to have the option to degrade a slow disk that is slowing down throughput and allow for it to be replaced with a spare.

Description

Add new VDEV properties, VDEV_PROP_SLOW_IO_N and VDEV_PROP_SLOW_IO_T that ZED can consume for the SERD engine parameters. They default to 10 slow I/Os in 30 seconds.
Add a new SERD engine type, zc_serd_slow_io, that can track these events and initiate a zpool_vdev_degrade() when the limit has occurred.

If multiple VDEVs in a pool are undergoing slow I/Os, then skip the zpool_vdev_degrade().

Also did some minor code cleanup in the zfs diagnosis module.

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.

How Has This Been Tested?

Ran ZTS functional/events
Added two new ZTS tests to verify the changes:

zed_slow_io: verify that override properties function as expected and a VDEV gets degraded
zed_slow_io_many_vdevs: confirm that when multiple VDEVs are involved no degrade occurs.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

tonyhutter · 2023-11-01T17:12:56Z

Just some quick questions as I was looking over the code -

Is it correct to say that zfs would only call zpool_vdev_degrade() in response to slow IOs if you have VDEV_PROP_SLOW_IO_[N|T] set? That is, if you just have a vanilla pool without setting any vdev properties, could its vdevs ever get marked degraded on slow IOs?
I'm not really clear on what marking a vdev degraded means from a functional standpoint. I see it defined here:

       VDEV_STATE_DEGRADED, /* Replicated vdev with unhealthy kids    */

...but that makes it sound like a top level vdev state. Does something special happen to leaf vdevs marked DEGRADED? Like we stop allocating data on them or something? Or is it functionally the same as being FAULTED?

Can you add the description from this PR into your commit message for the Add slow disk diagnosis to ZED commit?

don-brady · 2023-11-02T20:11:54Z

2. I'm not really clear on what marking a vdev degraded means from a functional standpoint.

An individual vdev is degraded when it trips the checksum SERD diagnosis. For this PR, we are extending that to include a too many slow I/Os diagnosis. A degraded vdev is mentioned in pool-concepts:

                   The number of checksum errors exceeds acceptable levels and the de‐
                   vice is degraded as an indication that something may be wrong.  ZFS
                   continues to use the device as necessary.

I will update the man page to reflect that slow I/Os can also cause a degraded state.

A hot spare will replace a vdev that is degraded or faulted. We are leveraging this behavior so that a slow disk can be replaced with a spare.

don-brady · 2023-11-02T20:17:38Z

if you just have a vanilla pool without setting any vdev properties, could its vdevs ever get marked degraded on slow IOs?

Yes, with the default SERD slow I/O thresholds a generic pool would see a degrade if one, and only one, vdev is slow (30 sec+ delay events). The current behavior is that the per-vdev slow I/O counters increase and zevents are generated, but no diagnosis occurs.

tonyhutter · 2023-11-02T23:36:00Z

if you just have a vanilla pool without setting any vdev properties, could its vdevs ever get marked degraded on slow IOs?

Yes, with the default SERD slow I/O thresholds a generic pool would see a degrade if one, and only one, vdev is slow (30 sec+ delay events).

I think you may only want to enable the behavior if the user has set VDEV_PROP_SLOW_IO_[N|T] to something. I'm just worried that if it's enabled by default that there's going to be a lot of "why are my vdevs getting degraded?" issues getting opened up. I wouldn't be surprised if long IO delays are more normal than people are aware of (they're very common at our site).

cmd/zed/agents/zfs_diagnosis.c

man/man7/vdevprops.7

don-brady · 2023-11-09T18:43:04Z

Rebased to resolve merge conflicts. Addressed review feedback.

behlendorf · 2023-11-10T18:41:02Z

Just for reference I noticed some test failures for CentOS 7

pashford · 2023-11-10T20:40:55Z

Is this related to the following messages I'm finding in kmsg on my QNAP?

2023-08-09 10:18:13 -07:00 <7> [ 606.733785] ----- [DISK SLOW] Pool zpool1 vd 8851305046343565969 -----
2023-08-09 10:18:13 -07:00 <7> [ 606.733785] 100ms 300ms 500ms 1s 5s 10s
2023-08-09 10:18:13 -07:00 <7> [ 606.733785] write 0 0 0 9 0 0

Thank you.

don-brady · 2023-11-27T17:28:52Z

Fixed commit message to be checkstyle compliant.

don-brady · 2023-11-28T19:19:54Z

Is this related to the following messages I'm finding in kmsg on my QNAP?

2023-08-09 10:18:13 -07:00 <7> [ 606.733785] ----- [DISK SLOW] Pool zpool1 vd 8851305046343565969 ----- 2023-08-09 10:18:13 -07:00 <7> [ 606.733785] 100ms 300ms 500ms 1s 5s 10s 2023-08-09 10:18:13 -07:00 <7> [ 606.733785] write 0 0 0 9 0 0

Thank you.

I believe it is related in that the above message is likely a summary of a zevent for a slow I/O. If you have cli (shell) access you can confirm with zpool status -s <pool>.

behlendorf · 2023-12-02T00:50:26Z

@don-brady when you get a chance can you rebase this. I also the new test case if failing on the CentOS 7 builder.

    FAIL events/zed_slow_io (expected PASS)

don-brady · 2023-12-04T19:43:11Z

FAIL events/zed_slow_io (expected PASS)

The CentOS 7 test failure was very interesting. On this OS, and only on this OS, a spa txg sync was coming in during/after the test file reads and generating a lot of slow I/Os from the txg sync writes which in turn was causing a degrade where one was not expected for the test.

The current zinject -D mechanism does not discriminate the type of I/O and was ignoring the -T option. So I extended zio_handle_io_delay() to also consider the io_type before injecting a delay. And the slow I/O tests will now inject delays using "-T read" to have a more predictable outcome.

Also updated zinject(8) man page to indicate that -T is an optional param for -D

diff --git a/module/zfs/zio_inject.c b/module/zfs/zio_inject.c
index 3598351c4..609182f4a 100644
--- a/module/zfs/zio_inject.c
+++ b/module/zfs/zio_inject.c
@@ -605,6 +605,10 @@ zio_handle_io_delay(zio_t *zio)
                if (vd->vdev_guid != handler->zi_record.zi_guid)
                        continue;

+               if (handler->zi_record.zi_iotype != ZIO_TYPES &&
+                   handler->zi_record.zi_iotype != zio->io_type)
+                               continue;
+
                /*
                 * Defensive; should never happen as the array allocation
                 * occurs prior to inserting this handler on the list.

@@ -96,7 +97,7 @@ function start_slow_io
        log_must set_tunable64 ZIO_SLOW_IO_MS 10
        log_must set_tunable64 SLOW_IO_EVENTS_PER_SECOND 1000

-       log_must zinject -d $VDEV -D10:1 $POOL
+       log_must zinject -d $VDEV -D10:1 -T read $POOL
 }

 function stop_slow_io

don-brady · 2023-12-09T05:59:17Z

Another fix for ZTS for recent Ubuntu failures.

tests/zfs-tests/tests/functional/events/zed_slow_io.ksh

tests/zfs-tests/tests/functional/events/zed_slow_io_many_vdevs.ksh

tests/zfs-tests/tests/functional/events/zed_slow_io.ksh

tests/zfs-tests/tests/functional/events/zed_slow_io_many_vdevs.ksh

cmd/zed/agents/zfs_retire.c

tests/zfs-tests/tests/functional/events/zed_slow_io.ksh

pashford · 2023-12-12T22:19:49Z

Is this related to the following messages I'm finding in kmsg on my QNAP?
2023-08-09 10:18:13 -07:00 <7> [ 606.733785] ----- [DISK SLOW] Pool zpool1 vd 8851305046343565969

I believe it is related in that the above message is likely a summary of a zevent for a slow I/O. If you have cli (shell) access you can confirm with zpool status -s <pool>.

The ZFS on QNAP is a branch from the original Solaris release and is too old to have the '-s' option.

Thank you for trying.

tests/zfs-tests/tests/functional/events/zed_slow_io_many_vdevs.ksh

behlendorf

Looks good. @don-brady could you just rebase this on the latest master branch for force update the PR. We should get a cleaner CI run.

allanjude

Reviewed-by: Allan Jude <[email protected]>

behlendorf · 2024-01-09T17:54:36Z

@don-brady it looks like this change is causing at least the following ZTS failures.

https://github.com/openzfs/zfs/actions/runs/7429220144?pr=15469

 Tests with results other than PASS that are unexpected:
    FAIL fault/auto_spare_001_pos (expected PASS)
    FAIL fault/auto_spare_002_pos (expected PASS)
    FAIL fault/auto_spare_multiple (expected PASS)

don-brady · 2024-01-17T22:45:49Z

Investigating ZTS auto_spare test failures...

Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). However, no action is taken by ZED, like is done for checksum or I/O errors. This change adds slow disk diagnosis to ZED which is opt-in using new VDEV properties: VDEV_PROP_SLOW_IO_N VDEV_PROP_SLOW_IO_T If multiple VDEVs in a pool are undergoing slow I/Os, then it skips the zpool_vdev_degrade(). Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Co-authored-by: Rob Wing <[email protected]> Signed-off-by: Don Brady <[email protected]>

don-brady · 2024-02-07T23:24:32Z

Update
I was able to root cause why seemingly unrelated ZTS tests involving ZED were failing after our new tests for diagnosing slow I/Os were added.

In the ZTS, there are ZED related tests in functional/events and also in functional/fault. In the functional/event tests, ZED is started and stopped before each test and in the functional/fault tests ZED is started in the setup script and stopped in the cleanup script. Both methods are fine and insure that ZED is starting in a know state with whatever zedlets the suite needs installed.

The mystery I encountered, was that in one of the auto_spare tests, it was failing to replace a faulted disk with a spare after 60 seconds (should take less than a second). It was as if ZED was stuck or stalled. And if I removed the new zed_slow_io_xxx tests, the auto_spare test failures go away. I think this confirms that the ZED diagnosis changes in the PR were not the source of auto_spare tests failing on Ubuntu hosts.

Root Cause
The problem is that when ZED was started for the fault suite of tests, there is a bunch of events that were generated from the zed_slow_io_many_vdevs test still hanging out in the kernel. So when ZED is started during setup for the fault tests suite, it starts processing those stale events. Note that these are events for a pool that no longer exists! ZED is so backlogged that it cannot process the latest events in time. It's effectively stuck in the past.

Normally ZED keeps persistent state (zed.state) about what events it has processed. But in the ZTS, that state resides in a temporary directory that is removed each time zed_cleanup is called. So it is important to manually drain the events before starting ZED process so that it won't reprocess unrelated stale events in the next test run!

To fix this failure, I added a call to zed_events_drain before starting up ZED in the setup scripts. This insures that ZED doesn't have a backlog of events when it starts.

Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). However, no action is taken by ZED, like is done for checksum or I/O errors. This change adds slow disk diagnosis to ZED which is opt-in using new VDEV properties: VDEV_PROP_SLOW_IO_N VDEV_PROP_SLOW_IO_T If multiple VDEVs in a pool are undergoing slow I/Os, then it skips the zpool_vdev_degrade(). Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Rob Wing <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes openzfs#15469

Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). However, no action is taken by ZED, like is done for checksum or I/O errors. This change adds slow disk diagnosis to ZED which is opt-in using new VDEV properties: VDEV_PROP_SLOW_IO_N VDEV_PROP_SLOW_IO_T If multiple VDEVs in a pool are undergoing slow I/Os, then it skips the zpool_vdev_degrade(). Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Co-authored-by: Rob Wing <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #15469

don-brady requested review from behlendorf, allanjude and tonyhutter October 31, 2023 02:11

behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 1, 2023

behlendorf approved these changes Nov 7, 2023

View reviewed changes

cmd/zed/agents/zfs_diagnosis.c Outdated Show resolved Hide resolved

man/man7/vdevprops.7 Show resolved Hide resolved

don-brady force-pushed the zed-slowio-serd branch from 37c9f86 to 18dcd6c Compare November 9, 2023 18:41

behlendorf self-requested a review November 17, 2023 23:09

don-brady force-pushed the zed-slowio-serd branch from 18dcd6c to b777a27 Compare November 27, 2023 17:28

don-brady force-pushed the zed-slowio-serd branch 2 times, most recently from f1fe8ce to 258c87e Compare December 9, 2023 05:59