Fast Dedup: “flat” DDT entry format #15893

allanjude · 2024-02-14T14:12:14Z

Motivation and Context

The on-disk and in-memory dedup entry structures are larger than they need to be. Any reduction we can make reduces memory and IO overheads for every entry, which in large dedup tables can be huge.

Description

This slims down the in-memory ddt_entry_t, partly by reorganizing the structure and using narrower types, and partly by moving rarely-used parts out.

This then adds a new variant of the entry format. The traditional format keeps a complete set of 4x DVAs for each possible value of copies= (plus one for the deprecated ditto blocks feature), which makes the in-memory and on-disk entry mostly empty, which is significant wasted overhead. This adds a new “flat” format which only has a single set of DVAs, but can “extend” them if a write requests more (eg writing a block with copies=1, setting copies=2, then copying the block).

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

module/zfs/ddt.c

module/zfs/zio.c

robn · 2024-05-15T03:47:07Z

[Fast dedup stack rebased to master 3c941d1]

robn · 2024-05-21T05:02:49Z

FYI, added one commit, that adds a birth time field to ddt_flat_phys_t. This is for the upcoming pruning feature, but we want the change here so that we don't need an additional on-disk format change.

amotin

My previous questions are still open, but they are minor, while otherwise this part looks good to me.

robn · 2024-06-14T01:40:07Z

[Fast dedup stack rebased to master c98295e]

robn · 2024-06-14T02:51:45Z

I think this is good to go; I'm just waiting for confirmation internally that this is all we need for the on-disk format changes.

robn · 2024-06-17T05:45:16Z

No format changes, but there is a good chunk of code change to make the split between traditional and flat entries a bit easier to work with. Earlier versions of this code assumed ddt_phys_t is the same in both; change was shoehorned into flat quite late to support prune. This is a much better model for calling code to handle the differences.

We believe this is the final on-disk format, and likely the final code change to support it. We've got some stress testing to do but that's it.

amotin

Just some cosmetics:

include/sys/ddt.h

module/zfs/ddt.c

module/zfs/ddt_stats.c

include/sys/ddt_impl.h

tonyhutter

I am not very familiar with the DDT code, but I don't see any surface level issues. Please rebase on master though to pull in the latest Fedora ZTS fixes.

ixhamza · 2024-08-09T09:45:31Z

@robn - JFYI, the ztest with your latest push is still failing, similar to #15895 (comment):

leaked space: vdev 0, offset 0xe98da000, size 28672
block traversal size 610742272 != alloc 610770944 (leaked 28672)

ztest: '/sbin/zdb -bccsv -G -d -Y -e -y  -p /var/tmp/zloop-run 14775065184965730235' exit code 2

robn · 2024-08-09T10:16:35Z

Yeah, on it. The push was actually just a rebase; there's no actual difference dedup-side. The difference now is just that zdb is no longer fit for purpose. I'm right now splicing the relevant bits of the "final" FDT-aware zdb into the right PRs, and that should be that. Should be pushing in the next hour or two.

robn · 2024-08-09T14:21:26Z

@ixhamza can you confirm, which branch did you see this leak from ztest/zdb on? I understand it on fdt-rel-log, I don't understand it on fdt-rel-feature, and sorta-maybe for fdt-rel-flat (this PR).

As it is, I have some rework to do on the new zdb. I did the work described above, and then was chasing a leak for a while. I just now figured out what's wrong, but its after midnight and I'm toast. So I'll come back to it in the morning.

ixhamza · 2024-08-09T14:32:32Z

@robn - I am just checking the logs from the PR GitHub Actions. They are shown as Zpool-logs-20.04 and Zpool-logs-22.04, which I believe is a typo. If there are no core dump in the zip file, they are most probably failed due to zdb and you can find out the reason in ztest.out. truenas@302fb34 - If it's helpful, I use this just to run ztest GitHub actions in a temporary branch.

robn · 2024-08-11T11:09:56Z

Last push rebased onto the latest #15892. Small zdb included in this PR to properly count/claim phys blocks extended with additional DVAs.

This is the supporting infrastructure for the upcoming dedup features. Traditionally, dedup objects live directly in the MOS root. While their details vary (checksum, type and class), they are all the same "kind" of thing - a store of dedup entries. The new features are more varied than that, and are better thought of as a set of related stores for the overall state of a dedup table. This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will cause new DDTs to be created as a ZAP in the MOS root, named DDT-<checksum>. The is used as the root object for the normal type/class store objects, but will also be a place for any storage required by new features. This commit adds two new fields to ddt_t, for version and flags. These are intended to describe the structure and features of the overall dedup table, and are stored as-is in the DDT root. In this commit, flags are always zero, but the intent is that they can be used to hang optional logic or state onto for new dedup features. Version is always 1. For a "legacy" dedup table, where no DDT root directory exists, the version will be 0. ddt_configure() is expected to determine the version and flags features currently in operation based on whether or not the fast_dedup feature is enabled, and from what's available on disk. In this way, its possible to support both old and new tables. This also provides a migration path. A legacy setup can be upgraded to FDT by creating the DDT root ZAP, moving the existing objects into it, and setting version and flags appropriately. There's no support for that here, but it would be straightforward to add later and allows the possibility that newer features could be applied to existing dedup tables. Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.

Very basic coverage to make sure things appear to work, have the right format on disk, and pool upgrades and mixed table types work as expected. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.

The upcoming dedup features break the long held assumption that all blocks on disk with a 'D' dedup bit will always be present in the DDT, or will have the same set of DVA allocations on disk as in the DDT. If the DDT is no longer a complete picture of all the dedup blocks that will be and should be on disk, then it does us no good to walk and prime it up front, since it won't necessarily match up with every block we'll see anyway. Instead, we rework things here to be more like the BRT checks. When we see a dedup'd block, we look it up in the DDT, consume a refcount, and for the second-or-later instances, count them as duplicates. The DDT and BRT are moved ahead of the space accounting. This will become important for the "flat" feature, which may need to count a modified version of the block. Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Don Brady <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.

The "flat phys" feature will use only a single phys slot for all entries, which means the old "single", "double" etc naming now makes no sense, and more importantly, means that choosing the right slot for a given block pointer will depend on how many slots are in use for a given DDT. This removes the old names, and adds accessor macros to decouple specific phys array indexes from any particular meaning. (These macros look strange in isolation, mainly in the way they take the ddt_t* as an arg but don't use it. This is mostly a separate commit to introduce the concept to the reader before the "flat phys" commit extends it). Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.

The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.

This slims down the in-memory entry to as small as it can be. The IO-related parts are made into a separate entry, since they're relatively rarely needed. The variable allocation for dde_phys is to support the upcoming flat format. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.

Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Don Brady <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.

The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893

This slims down the in-memory entry to as small as it can be. The IO-related parts are made into a separate entry, since they're relatively rarely needed. The variable allocation for dde_phys is to support the upcoming flat format. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893

Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Don Brady <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893

The "flat phys" feature will use only a single phys slot for all entries, which means the old "single", "double" etc naming now makes no sense, and more importantly, means that choosing the right slot for a given block pointer will depend on how many slots are in use for a given DDT. This removes the old names, and adds accessor macros to decouple specific phys array indexes from any particular meaning. (These macros look strange in isolation, mainly in the way they take the ddt_t* as an arg but don't use it. This is mostly a separate commit to introduce the concept to the reader before the "flat phys" commit extends it). Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893

The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893

This slims down the in-memory entry to as small as it can be. The IO-related parts are made into a separate entry, since they're relatively rarely needed. The variable allocation for dde_phys is to support the upcoming flat format. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893

Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Don Brady <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893

behlendorf added the Status: Code Review Needed Ready for review and testing label Feb 15, 2024

robn force-pushed the fdt-rel-flat branch from 70b998c to a20dbec Compare February 15, 2024 19:58

amotin reviewed May 5, 2024

View reviewed changes

module/zfs/ddt.c Outdated Show resolved Hide resolved

module/zfs/zio.c Outdated Show resolved Hide resolved

robn force-pushed the fdt-rel-flat branch from a20dbec to 2e6efcb Compare May 15, 2024 03:39

robn force-pushed the fdt-rel-flat branch from 2e6efcb to c8bdcc3 Compare May 15, 2024 07:46

amotin approved these changes Jun 11, 2024

View reviewed changes

robn force-pushed the fdt-rel-flat branch from b3b1f49 to 89c93d2 Compare June 14, 2024 01:35

robn force-pushed the fdt-rel-flat branch 2 times, most recently from 05f309f to b5dfe43 Compare June 14, 2024 02:50

robn force-pushed the fdt-rel-flat branch 2 times, most recently from 0cb2a35 to 062633d Compare June 17, 2024 05:40

amotin reviewed Jun 17, 2024

View reviewed changes

robn force-pushed the fdt-rel-flat branch from 062633d to 1a6d527 Compare June 18, 2024 04:41

amotin approved these changes Jun 18, 2024

View reviewed changes

robn force-pushed the fdt-rel-flat branch from 1a6d527 to fc703cb Compare June 20, 2024 06:39

allanjude requested a review from behlendorf June 21, 2024 00:19

robn force-pushed the fdt-rel-flat branch 3 times, most recently from 2540fe6 to ec7acc8 Compare June 25, 2024 01:32

tonyhutter reviewed Jul 16, 2024

View reviewed changes

include/sys/ddt_impl.h Outdated Show resolved Hide resolved

tonyhutter reviewed Jul 16, 2024

View reviewed changes

allanjude force-pushed the fdt-rel-flat branch 3 times, most recently from 596597b to 785f991 Compare July 28, 2024 20:02

allanjude force-pushed the fdt-rel-flat branch from 785f991 to 9de88d4 Compare August 6, 2024 17:16

behlendorf mentioned this pull request Aug 8, 2024

Fast Dedup: FDT-log feature #15895

Closed

13 tasks

robn force-pushed the fdt-rel-flat branch from 9de88d4 to 8f9c87b Compare August 9, 2024 08:55

robn force-pushed the fdt-rel-flat branch from 8f9c87b to 3852876 Compare August 11, 2024 10:39

allanjude force-pushed the fdt-rel-flat branch from 3852876 to 2a46268 Compare August 11, 2024 14:03

behlendorf approved these changes Aug 12, 2024

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 12, 2024

robn force-pushed the fdt-rel-flat branch from 2a46268 to e119f94 Compare August 13, 2024 11:56

robn and others added 7 commits August 16, 2024 09:59

ZTS: tests for dedup legacy/FDT tables

e957dc8

Very basic coverage to make sure things appear to work, have the right format on disk, and pool upgrades and mixed table types work as expected. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.

robn force-pushed the fdt-rel-flat branch from e119f94 to eb0cb79 Compare August 16, 2024 00:00

behlendorf closed this in d17ab63 Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Dedup: “flat” DDT entry format #15893

Fast Dedup: “flat” DDT entry format #15893

allanjude commented Feb 14, 2024

robn commented May 15, 2024

robn commented May 21, 2024

amotin left a comment

robn commented Jun 14, 2024

robn commented Jun 14, 2024

robn commented Jun 17, 2024

amotin left a comment

tonyhutter left a comment

ixhamza commented Aug 9, 2024

robn commented Aug 9, 2024

robn commented Aug 9, 2024

ixhamza commented Aug 9, 2024

robn commented Aug 11, 2024

Fast Dedup: “flat” DDT entry format #15893

Fast Dedup: “flat” DDT entry format #15893

Conversation

allanjude commented Feb 14, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

robn commented May 15, 2024

robn commented May 21, 2024

amotin left a comment

Choose a reason for hiding this comment

robn commented Jun 14, 2024

robn commented Jun 14, 2024

robn commented Jun 17, 2024

amotin left a comment

Choose a reason for hiding this comment

tonyhutter left a comment

Choose a reason for hiding this comment

ixhamza commented Aug 9, 2024

robn commented Aug 9, 2024

robn commented Aug 9, 2024

ixhamza commented Aug 9, 2024

robn commented Aug 11, 2024