Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZIO: add "vdev tracing" facility; use it for ZIL flushing #16375

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

robn
Copy link
Member

@robn robn commented Jul 22, 2024

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

In #16314 I noted that the change addresses the correctness issues, but introduces a performance problem when the pool is degraded. This PR resolves that.

This PR is dependent on #16314, so those commits are included here. I suggest reviewing that PR first; the additional change to that is in the top commit.

Description

A problem with zio_flush() is that it issues a flush ZIO to a top-level vdev, which then recursively issues child flush ZIOs until the real leaf devices are flushed. As usual, an error in a child ZIO results in the parent ZIO erroring too, so if a leaf device has failed, it's flush ZIO will fail, and so will the entire flush operation.

This didn't matter when we used to ignore flush errors, but now that we propagate them, the flush error propagates into the ZIL write ZIO. This causes the ZIL to believe its write failed, and fall back to a full txg wait. This still provides correct behaviour for zil_commit() callers (eg fsync()) but it ruins performance.

We cannot simply skip flushing failed vdevs, because the associated write may have succeeded before the vdev failed, which would give the appearance the write is fully flushed when it is not. Neither can we issue a "syncing write" to the device (eg SCSI FUA), as this also degrades performance.

The answer is that we must bind writes and flushes together in a way such that we only flush the physical devices that we wrote to.

This adds a "vdev tracing" facility to ZIOs, zio_vdev_trace. When enabled on a ZIO with ZIO_FLAG_VDEV_TRACE, then upon successful completion (in the _done handler), zio->io_vdev_trace_tree will have a list of zio_vdev_trace_t objects that each describe a vdev that was involved in the successful completion of the ZIO.

A companion function, zio_vdev_trace_flush(), is included, that issues a flush ZIO to the child vdevs on the given trace tree. zil_lwb_write_done() is updated to use this to bind ZIL writes and flushes together.

The tracing facility is similar in many ways to the "deferred flushing" facility inside the ZIL, to the point where it can replace it. Now, if the flush should be deferred, the trace records from the writing ZIO are captured and combined with any captured from previous writes. When its finally time to issue the flush, we issue it to the entire accumulated set of traced vdevs.

Further reading

I presented this work at AsiaBSDCon 2024. Paper, slides and other notes available at: https://despairlabs.com/presentations/openzfs-fsync/

How Has This Been Tested?

A test is included to provide some small proof that it works. Without the change, it will fail, because it will see a txg sync fallback.

Full ZTS run passes.

Also in production at a customer site, and appears to be working well.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn robn force-pushed the zil-flush-vdev-trace branch 2 times, most recently from 789f6df to ca324c5 Compare August 1, 2024 00:50
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside of better error handling I think this may improve ZIL write latency for RAIDZ and especially DRAID, where vdev may consist of hundred(s) of leafs, while only few or at most a dozen of them are written for specific ZIL block.

While generally looks good, few thoughts below:

vdev_t *vd;

for (zvt = avl_first(t); zvt != NULL; zvt = AVL_NEXT(t, zvt)) {
vd = vdev_lookup_by_guid(spa->spa_root_vdev, zvt->zvt_guid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am bit worried about cost of this lookup on a pool of 1000 vdevs. Wonder if it could be beneficial to add the top-level vdev number into zio_vdev_trace_t.

Comment on lines +846 to +852
* If the child has itself collected trace records, copy them
* to ours. Note that we can't steal them, as there may be
* multiple parents.
*/
if (zio->io_flags & ZIO_FLAG_VDEV_TRACE) {
zio_vdev_trace_copy(&zio->io_vdev_trace_tree,
&pio->io_vdev_trace_tree);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about the child trace if it failed? Shouldn't this also be under the zio->io_error == 0 condition above, or actually the condition could be moved into upward if?

Comment on lines +828 to +836
if (pio->io_flags & ZIO_FLAG_VDEV_TRACE &&
wait == ZIO_WAIT_DONE && zio->io_vd != NULL &&
((zio->io_flags & (ZIO_FLAG_OPTIONAL | ZIO_FLAG_IO_REPAIR)) == 0)) {
avl_tree_t *t = &pio->io_vdev_trace_tree;
zio_vdev_trace_t *zvt, zvt_search;
avl_index_t where;

if (zio->io_error == 0) {
zvt_search.zvt_guid = zio->io_vd->vdev_guid;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that zio->io_vd != NULL is a sufficient condition to consider the vdev as leaf, and so worth flushing.

If fsync() (zil_commit()) writes successfully, but then the flush fails,
fsync() should not return success, but instead should fall into a full
transaction wait.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn robn marked this pull request as draft August 17, 2024 04:34
@robn
Copy link
Member Author

robn commented Aug 17, 2024

Making this a draft for now. I still think the technique is good for what it is, but I've been doing more work on flush response in spa_sync() and it needs a whole different mechanism (for reasons that I won't go into here). So I want to wait until that is fleshed out, then I can revisit this (and #16314) to see if it can be re-expressed in terms of whatever I come up with, or if we will need both mechanisms.

Since the beginning, ZFS' "flush" operation has always ignored
errors[1]. Write errors are captured and dealt with, but if a write
succeeds but the subsequent flush fails, the operation as a whole will
appear to succeed[2].

In the end-of-transaction uberblock+label write+flush ceremony, it's
very difficult for this situation to occur. Since all devices are
written to, typically the first write will succeed, the first flush will
fail unobserved, but then the second write will fail, and the entire
transaction is aborted. It's difficult to imagine a real-world scenario
where all the writes in that sequence could succeed even as the flushes
are failing (understanding that the OS is still seeing hardware problems
and taking devices offline).

In the ZIL however, it's another story. Since only the write response is
checked, if that write succeeds but the flush then fails, the ZIL will
believe that it succeeds, and zil_commit() (and thus fsync()) will
return success rather than the "correct" behaviour of falling back into
txg_wait_synced()[3].

This commit fixes this by adding a simple flag to zio_flush() to
indicate whether or not the caller wants to receive flush errors. This
flag is enabled for ZIL calls. The existing zio chaining inside the ZIL
and the flush handler zil_lwb_flush_vdevs_done() already has all the
necessary support to properly handle a flush failure and fail the entire
zio chain. This causes zil_commit() to correct fall back to
txg_wait_synced() rather than returning success prematurely.

1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support
   for flushing devices with write caches with the DKIOCFLUSHWRITECACHE
   ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from
   from the time shows the thinking:

   /*
    * Wait for all the flushes to complete.  Not all devices actually
    * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
    */

2. It's not entirely clear from the code history why this was acceptable
   for devices that _do_ have write caches. Our best guess is that this
   was an oversight: between the combination of hardware, pool topology
   and application behaviour required to hit this, it basically didn't
   come up.

3. Somewhat frustratingly, zil.c contains comments describing this exact
   behaviour, and further discussion in openzfs#12443 (September 2021). It
   appears that those involved saw the potential, but were looking at a
   different problem and so didn't have the context to recognise it for
   what it was.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
The first time a device returns ENOTSUP in repsonse to a flush request,
we set vdev_nowritecache so we don't issue flushes in the future and
instead just pretend the succeeded. However, we still return an error
for the initial flush, even though we just decided such errors are
meaningless!

So, when setting vdev_nowritecache in response to a flush error, also
reset the error code to assume success.

Along the way, it seems there's no good reason for vdev_disk & vdev_geom
to explicitly detect no support for flush and set vdev_nowritecache;
just letting the error through to zio_vdev_io_assess() will cause it all
to fall out nicely. So remove those checks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
If the pool is degraded, ZIL writes should still succeed without falling
back to a full txg sync.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
A problem with zio_flush() is that it issues a flush ZIO to a top-level
vdev, which then recursively issues child flush ZIOs until the real leaf
devices are flushed. As usual, an error in a child ZIO results in the
parent ZIO erroring too, so if a leaf device has failed, it's flush ZIO
will fail, and so will the entire flush operation.

This didn't matter when we used to ignore flush errors, but now that we
propagate them, the flush error propagates into the ZIL write ZIO. This
causes the ZIL to believe its write failed, and fall back to a full txg
wait. This still provides correct behaviour for zil_commit() callers (eg
fsync()) but it ruins performance.

We cannot simply skip flushing failed vdevs, because the associated
write may have succeeded before the vdev failed, which would give the
appearance the write is fully flushed when it is not. Neither can we
issue a "syncing write" to the device (eg SCSI FUA), as this also
degrades performance.

The answer is that we must bind writes and flushes together in a way
such that we only flush the physical devices that we wrote to.

This adds a "vdev tracing" facility to ZIOs, zio_vdev_trace. When
enabled on a ZIO with ZIO_FLAG_VDEV_TRACE, then upon successful
completion (in the _done handler), zio->io_vdev_trace_tree will have a
list of zio_vdev_trace_t objects that each describe a vdev that was
involved in the successful completion of the ZIO.

A companion function, zio_vdev_trace_flush(), is included, that issues a
flush ZIO to the child vdevs on the given trace tree.
zil_lwb_write_done() is updated to use this to bind ZIL writes and
flushes together.

The tracing facility is similar in many ways to the "deferred flushing"
facility inside the ZIL, to the point where it can replace it. Now, if
the flush should be deferred, the trace records from the writing ZIO are
captured and combined with any captured from previous writes. When its
finally time to issue the flush, we issue it to the entire accumulated
set of traced vdevs.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 19, 2024
@amotin amotin added the Status: Work in Progress Not yet ready for general review label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants