Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vdev_disk: rewrite BIO filling machinery to avoid split pages #15588

Closed
wants to merge 9 commits into from

Commits on Mar 23, 2024

  1. linux 5.4 compat: page_size()

    Before 5.4 we have to do a little math.
    
    Signed-off-by: Rob Norris <[email protected]>
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    robn committed Mar 23, 2024
    Configuration menu
    Copy the full SHA
    222e079 View commit details
    Browse the repository at this point in the history
  2. abd: add page iterator

    The regular ABD iterators yield data buffers, so they have to map and
    unmap pages into kernel memory. If the caller only wants to count
    chunks, or can use page pointers directly, then the map/unmap is just
    unnecessary overhead.
    
    This adds adb_iterate_page_func, which yields unmapped struct page
    instead.
    
    Signed-off-by: Rob Norris <[email protected]>
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    robn committed Mar 23, 2024
    Configuration menu
    Copy the full SHA
    54af159 View commit details
    Browse the repository at this point in the history
  3. vdev_disk: rename existing functions to vdev_classic_*

    This is just renaming the existing functions we're about to replace and
    grouping them together to make the next commits easier to follow.
    
    Signed-off-by: Rob Norris <[email protected]>
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    robn committed Mar 23, 2024
    Configuration menu
    Copy the full SHA
    cb44c34 View commit details
    Browse the repository at this point in the history
  4. vdev_disk: reorganise vdev_disk_io_start

    Light reshuffle to make it a bit more linear to read and get rid of a
    bunch of args that aren't needed in all cases.
    
    Signed-off-by: Rob Norris <[email protected]>
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    robn committed Mar 23, 2024
    Configuration menu
    Copy the full SHA
    1cbe7e8 View commit details
    Browse the repository at this point in the history
  5. vdev_disk: make read/write IO function configurable

    This is just setting up for the next couple of commits, which will add a
    new IO function and a parameter to select it.
    
    Signed-off-by: Rob Norris <[email protected]>
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    robn committed Mar 23, 2024
    Configuration menu
    Copy the full SHA
    e349ad6 View commit details
    Browse the repository at this point in the history
  6. vdev_disk: rewrite BIO filling machinery to avoid split pages

    This commit tackles a number of issues in the way BIOs (`struct bio`)
    are constructed for submission to the Linux block layer.
    
    ### BIO segment limits are set incorrectly
    
    The kernel has a hard upper limit on the number of pages/segments that
    can be added to a BIO, as well as a separate limit for each device
    (related to its queue depth and other scheduling characteristics).
    
    ZFS counts the number of memory pages in the request ABD
    (`abd_nr_pages_off()`, and then uses that as the number of segments to
    put into the BIO, up to the hard upper limit. If it requires more than
    the limit, it will create multiple BIOs.
    
    Leaving aside the fact that page count method is wrong (see below), not
    limiting to the device segment max means that the device driver will
    need to split the BIO in half. This is alone is not necessarily a
    problem, but it interacts with another issue to cause a much larger
    problem.
    
    ### BIOs are filled inefficiently
    
    The kernel function to add a segment to a BIO (`bio_add_page()`) takes a
    `struct page` pointer, and offset+len within it. `struct page` can
    represent a run of contiguous memory pages (known as a "compound page").
    In can be of arbitrary length.
    
    The ZFS functions that count ABD pages and load them into the BIO
    (`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never
    consider a page to be more than `PAGE_SIZE` (4K), even if the `struct
    page` is for multiple pages. In this case, it will load the same `struct
    page` into the BIO multiple times, with the offset adjusted each time.
    
    With a sufficiently large ABD, this can easily lead to the BIO being
    entirely filled much earlier than it could have been. This is also
    further contributes to the problem caused by the incorrect segment limit
    calculation, as its much easier to go past the device limit, and so
    require a split.
    
    Again, this is not a problem on its own.
    
    ### Incomplete pages are submitted to BIOs
    
    The logic for "never submit more than `PAGE_SIZE`" is actually a little
    more subtle. It will actually never submit a buffer that crosses a 4K
    page boundary.
    
    In practice, this is fine, as most ABDs are scattered, that is a list of
    complete 4K pages, and so are loaded in as such.
    
    Linear ABDs are typically allocated from slabs, and for small sizes they
    are frequently not aligned to page boundaries. For example, a 12K
    allocation can span four pages, eg:
    
         -- 4K -- -- 4K -- -- 4K -- -- 4K --
        |        |        |        |        |
              :## ######## ######## ######:    [1K, 4K, 4K, 3K]
    
    Such an allocation would be loaded into a BIO as you see:
    
        [1K, 4K, 4K, 3K]
    
    This tends not to be a problem in practice, because even if the BIO were
    filled and needed to be split, each half would still have either a start
    or end aligned to the logical block size of the device (assuming 4K at
    least).
    
    ---
    
    In ideal circumstances, these shortcomings don't cause any particular
    problems. Its when they start to interact with other ZFS features that
    things get interesting.
    
    ### Aggregation
    
    Aggregation will create a "gang" ABD, which is simply a list of other
    ABDs. Iterating over a gang ABD is just iterating over each ABD within
    it in turn.
    
    Because the segments are simply loaded in order, we can end up with
    uneven segments either side of the "gap" between the two ABDs. For
    example, two 12K ABDs might be aggregated and then loaded as:
    
        [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K]
    
    Should a split occur, each individual BIO can end up either having an
    start or end offset that is not aligned to the logical block size, which
    some drivers (eg SCSI) will reject. However, this tends not to happen
    because the default aggregation limit usually keeps the BIO small enough
    to not require more than one split, and most pages are actually full 4K
    pages, so hitting an uneven gap is very rare anyway.
    
    ### Gang blocks
    
    If the pool is under particular memory pressure, then an IO can be
    broken down into a "gang block", a 512-byte block composed of a header
    and up to three block pointers. Each points to a fragment of the
    original write, or in turn, another gang block, breaking the original
    data up over and over until space can be found in the pool for each of
    them.
    
    Each gang header is a separate 512-byte memory allocation from a slab,
    that needs to be written down to disk. When the gang header is added to
    the BIO, its a single 512-byte segment.
    
    ### Aggregation with gang blocks
    
    Pulling all this together, consider a large aggregated write of gang
    blocks. This results a BIO containing lots of 512-byte segments. Given
    our tendency to overfill the BIO, a split is likely, and most possible
    split points will yield a pair of BIOs that are misaligned. Drivers that
    care, like the SCSI driver, will reject them.
    
    ---
    
    This commit is a substantial refactor and rewrite of much of `vdev_disk`
    to sort all this out.
    
    ### Configure maximum segment size for device
    
    `vdev_bio_max_segs()` now returns the ideal maximum size for the device,
    if available. There's also a tuneable `zfs_vdev_disk_max_segs` to
    override this, to assist with testing.
    
    ### ABDs checked up front for alignment
    
    We scan the ABD up front to count the number of pages within it, and to
    confirm that if we submitted all those pages to one or more BIOs, it
    could be split at any point with creating a misaligned BIO.  If the
    pages in the BIO are not usable (as in any of the above situations), the
    ABD is linearised, and then checked again. This is the same technique
    used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size
    and allocator quirks.
    
    ### Virtual block IO object
    
    `vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The
    idea is simply that it can hold all the state needed to create, submit
    and return multiple BIOs, including all the refcounts, the ABD copy if
    it was needed, and so on. Apart from what I hope is a clearer interface,
    the major difference is that because we know how many BIOs we'll need up
    front, we don't need the old overflow logic that would grow the BIO
    array, throw away all the old work and restart. We can get it right from
    the start.
    
    Signed-off-by: Rob Norris <[email protected]>
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    robn committed Mar 23, 2024
    Configuration menu
    Copy the full SHA
    4da5982 View commit details
    Browse the repository at this point in the history
  7. vdev_disk: add module parameter to select BIO submission method

    This makes the submission method selectable at module load time via the
    `zfs_vdev_disk_classic` parameter, allowing this change to be backported
    to 2.2 safely, and disabled in favour of the "classic" submission method
    if new problems come up.
    
    Signed-off-by: Rob Norris <[email protected]>
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    robn committed Mar 23, 2024
    Configuration menu
    Copy the full SHA
    ebc8350 View commit details
    Browse the repository at this point in the history
  8. vdev_disk: use bio_chain() to submit multiple BIOs

    Simplifies our code a lot, so we don't have to wait for each and
    reassemble them.
    
    Signed-off-by: Rob Norris <[email protected]>
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    robn committed Mar 23, 2024
    Configuration menu
    Copy the full SHA
    63e55ee View commit details
    Browse the repository at this point in the history
  9. abd_iter_page: don't use compound heads on Linux <4.5

    Before 4.5 (specifically, torvalds/linux@ddc58f2), head and tail pages
    in a compound page were refcounted separately. This means that using the
    head page without taking a reference to it could see it cleaned up later
    before we're finished with it. Specifically, bio_add_page() would take a
    reference, and drop its reference after the bio completion callback
    returns.
    
    If the zio is executed immediately from the completion callback, this is
    usually ok, as any data is referenced through the tail page referenced
    by the ABD, and so becomes "live" that way. If there's a delay in zio
    execution (high load, error injection), then the head page can be freed,
    along with any dirty flags or other indicators that the underlying
    memory is used. Later, when the zio completes and that memory is
    accessed, its either unmapped and an unhandled fault takes down the
    entire system, or it is mapped and we end up messing around in someone
    else's memory. Both of these are very bad.
    
    The solution on these older kernels is to take a reference to the head
    page when we use it, and release it when we're done. There's not really
    a sensible way under our current structure to do this; the "best" would
    be to keep a list of head page references in the ABD, and release them
    when the ABD is freed.
    
    Since this additional overhead is totally unnecessary on 4.5+, where
    head and tail pages share refcounts, I've opted to simply not use the
    compound head in ABD page iteration there. This is theoretically less
    efficient (though cleaning up head page references would add overhead),
    but its safe, and we still get the other benefits of not mapping pages
    before adding them to a bio and not mis-splitting pages.
    
    There doesn't appear to be an obvious symbol name or config option we
    can match on to discover this behaviour in configure (and the mm/page
    APIs have changed a lot since then anyway), so I've gone with a simple
    version check.
    
    Signed-off-by: Rob Norris <[email protected]>
    Sponsored-by: Klara, Inc.
    Sponsored-by: Wasabi Technology, Inc.
    robn committed Mar 23, 2024
    Configuration menu
    Copy the full SHA
    9d1e514 View commit details
    Browse the repository at this point in the history