Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast Dedup: “flat” DDT entry format #15893

Closed
wants to merge 7 commits into from

Conversation

allanjude
Copy link
Contributor

Motivation and Context

The on-disk and in-memory dedup entry structures are larger than they need to be. Any reduction we can make reduces memory and IO overheads for every entry, which in large dedup tables can be huge.

Description

This slims down the in-memory ddt_entry_t, partly by reorganizing the structure and using narrower types, and partly by moving rarely-used parts out.

This then adds a new variant of the entry format. The traditional format keeps a complete set of 4x DVAs for each possible value of copies= (plus one for the deprecated ditto blocks feature), which makes the in-memory and on-disk entry mostly empty, which is significant wasted overhead. This adds a new “flat” format which only has a single set of DVAs, but can “extend” them if a write requests more (eg writing a block with copies=1, setting copies=2, then copying the block).

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

module/zfs/ddt.c Outdated Show resolved Hide resolved
module/zfs/zio.c Outdated Show resolved Hide resolved
@robn
Copy link
Member

robn commented May 15, 2024

[Fast dedup stack rebased to master 3c941d1]

@robn
Copy link
Member

robn commented May 21, 2024

FYI, added one commit, that adds a birth time field to ddt_flat_phys_t. This is for the upcoming pruning feature, but we want the change here so that we don't need an additional on-disk format change.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous questions are still open, but they are minor, while otherwise this part looks good to me.

@robn
Copy link
Member

robn commented Jun 14, 2024

[Fast dedup stack rebased to master c98295e]

@robn robn force-pushed the fdt-rel-flat branch 2 times, most recently from 05f309f to b5dfe43 Compare June 14, 2024 02:50
@robn
Copy link
Member

robn commented Jun 14, 2024

I think this is good to go; I'm just waiting for confirmation internally that this is all we need for the on-disk format changes.

@robn robn force-pushed the fdt-rel-flat branch 2 times, most recently from 0cb2a35 to 062633d Compare June 17, 2024 05:40
@robn
Copy link
Member

robn commented Jun 17, 2024

No format changes, but there is a good chunk of code change to make the split between traditional and flat entries a bit easier to work with. Earlier versions of this code assumed ddt_phys_t is the same in both; change was shoehorned into flat quite late to support prune. This is a much better model for calling code to handle the differences.

We believe this is the final on-disk format, and likely the final code change to support it. We've got some stress testing to do but that's it.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some cosmetics:

include/sys/ddt.h Outdated Show resolved Hide resolved
include/sys/ddt.h Show resolved Hide resolved
module/zfs/ddt.c Outdated Show resolved Hide resolved
module/zfs/ddt.c Outdated Show resolved Hide resolved
module/zfs/ddt.c Outdated Show resolved Hide resolved
module/zfs/ddt.c Show resolved Hide resolved
module/zfs/ddt_stats.c Outdated Show resolved Hide resolved
include/sys/ddt_impl.h Outdated Show resolved Hide resolved
Copy link
Contributor

@tonyhutter tonyhutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with the DDT code, but I don't see any surface level issues. Please rebase on master though to pull in the latest Fedora ZTS fixes.

@ixhamza
Copy link
Contributor

ixhamza commented Aug 9, 2024

@robn - JFYI, the ztest with your latest push is still failing, similar to #15895 (comment):

leaked space: vdev 0, offset 0xe98da000, size 28672
block traversal size 610742272 != alloc 610770944 (leaked 28672)

ztest: '/sbin/zdb -bccsv -G -d -Y -e -y  -p /var/tmp/zloop-run 14775065184965730235' exit code 2

@robn
Copy link
Member

robn commented Aug 9, 2024

Yeah, on it. The push was actually just a rebase; there's no actual difference dedup-side. The difference now is just that zdb is no longer fit for purpose. I'm right now splicing the relevant bits of the "final" FDT-aware zdb into the right PRs, and that should be that. Should be pushing in the next hour or two.

@robn
Copy link
Member

robn commented Aug 9, 2024

@ixhamza can you confirm, which branch did you see this leak from ztest/zdb on? I understand it on fdt-rel-log, I don't understand it on fdt-rel-feature, and sorta-maybe for fdt-rel-flat (this PR).

As it is, I have some rework to do on the new zdb. I did the work described above, and then was chasing a leak for a while. I just now figured out what's wrong, but its after midnight and I'm toast. So I'll come back to it in the morning.

@ixhamza
Copy link
Contributor

ixhamza commented Aug 9, 2024

@robn - I am just checking the logs from the PR GitHub Actions. They are shown as Zpool-logs-20.04 and Zpool-logs-22.04, which I believe is a typo. If there are no core dump in the zip file, they are most probably failed due to zdb and you can find out the reason in ztest.out. truenas@302fb34 - If it's helpful, I use this just to run ztest GitHub actions in a temporary branch.

@robn
Copy link
Member

robn commented Aug 11, 2024

Last push rebased onto the latest #15892. Small zdb included in this PR to properly count/claim phys blocks extended with additional DVAs.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 12, 2024
robn and others added 7 commits August 16, 2024 09:59
This is the supporting infrastructure for the upcoming dedup features.

Traditionally, dedup objects live directly in the MOS root. While their
details vary (checksum, type and class), they are all the same "kind" of
thing - a store of dedup entries.

The new features are more varied than that, and are better thought of as
a set of related stores for the overall state of a dedup table.

This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will
cause new DDTs to be created as a ZAP in the MOS root, named
DDT-<checksum>. The is used as the root object for the normal type/class
store objects, but will also be a place for any storage required by new
features.

This commit adds two new fields to ddt_t, for version and flags. These
are intended to describe the structure and features of the overall dedup
table, and are stored as-is in the DDT root. In this commit, flags are
always zero, but the intent is that they can be used to hang optional
logic or state onto for new dedup features. Version is always 1.

For a "legacy" dedup table, where no DDT root directory exists, the
version will be 0.

ddt_configure() is expected to determine the version and flags features
currently in operation based on whether or not the fast_dedup feature is
enabled, and from what's available on disk. In this way, its possible to
support both old and new tables.

This also provides a migration path. A legacy setup can be upgraded to
FDT by creating the DDT root ZAP, moving the existing objects into it,
and setting version and flags appropriately. There's no support for that
here, but it would be straightforward to add later and allows the
possibility that newer features could be applied to existing dedup
tables.

Co-authored-by: Allan Jude <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Very basic coverage to make sure things appear to work, have the right
format on disk, and pool upgrades and mixed table types work as
expected.

Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
The upcoming dedup features break the long held assumption that all
blocks on disk with a 'D' dedup bit will always be present in the DDT,
or will have the same set of DVA allocations on disk as in the DDT.

If the DDT is no longer a complete picture of all the dedup blocks that
will be and should be on disk, then it does us no good to walk and prime
it up front, since it won't necessarily match up with every block we'll
see anyway.

Instead, we rework things here to be more like the BRT checks. When we
see a dedup'd block, we look it up in the DDT, consume a refcount, and
for the second-or-later instances, count them as duplicates.

The DDT and BRT are moved ahead of the space accounting. This will
become important for the "flat" feature, which may need to count a
modified version of the block.

Co-authored-by: Allan Jude <[email protected]>
Co-authored-by: Don Brady <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
The "flat phys" feature will use only a single phys slot for all
entries, which means the old "single", "double" etc naming now makes no
sense, and more importantly, means that choosing the right slot for a
given block pointer will depend on how many slots are in use for a given
DDT.

This removes the old names, and adds accessor macros to decouple
specific phys array indexes from any particular meaning.

(These macros look strange in isolation, mainly in the way they take the
ddt_t* as an arg but don't use it. This is mostly a separate commit to
introduce the concept to the reader before the "flat phys" commit
extends it).

Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
The idea here is that sometimes you need the contents of an entry with
no intent to modify it, and/or from a place where its difficult to get
hold of its originating ddt_t to know how to interpret it.

A lightweight entry contains everything you might need to "read" an
entry - its key, type and phys contents - but none of the extras for
modifying it or using it in a larger context. It also has the full
complement of phys slots, so it can represent any kind of dedup entry
without having to know the specific configuration of the table it came
from.

Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
This slims down the in-memory entry to as small as it can be. The
IO-related parts are made into a separate entry, since they're
relatively rarely needed.

The variable allocation for dde_phys is to support the upcoming flat
format.

Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.

This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.

Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.

Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.

Signed-off-by: Rob Norris <[email protected]>
Co-authored-by: Don Brady <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
behlendorf pushed a commit that referenced this pull request Aug 16, 2024
The idea here is that sometimes you need the contents of an entry with
no intent to modify it, and/or from a place where its difficult to get
hold of its originating ddt_t to know how to interpret it.

A lightweight entry contains everything you might need to "read" an
entry - its key, type and phys contents - but none of the extras for
modifying it or using it in a larger context. It also has the full
complement of phys slots, so it can represent any kind of dedup entry
without having to know the specific configuration of the table it came
from.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
behlendorf pushed a commit that referenced this pull request Aug 16, 2024
This slims down the in-memory entry to as small as it can be. The
IO-related parts are made into a separate entry, since they're
relatively rarely needed.

The variable allocation for dde_phys is to support the upcoming flat
format.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
behlendorf pushed a commit that referenced this pull request Aug 16, 2024
Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.

This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.

Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.

Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Co-authored-by: Don Brady <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes #15893
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
The "flat phys" feature will use only a single phys slot for all
entries, which means the old "single", "double" etc naming now makes no
sense, and more importantly, means that choosing the right slot for a
given block pointer will depend on how many slots are in use for a given
DDT.

This removes the old names, and adds accessor macros to decouple
specific phys array indexes from any particular meaning.

(These macros look strange in isolation, mainly in the way they take the
ddt_t* as an arg but don't use it. This is mostly a separate commit to
introduce the concept to the reader before the "flat phys" commit
extends it).

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes openzfs#15893
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
The idea here is that sometimes you need the contents of an entry with
no intent to modify it, and/or from a place where its difficult to get
hold of its originating ddt_t to know how to interpret it.

A lightweight entry contains everything you might need to "read" an
entry - its key, type and phys contents - but none of the extras for
modifying it or using it in a larger context. It also has the full
complement of phys slots, so it can represent any kind of dedup entry
without having to know the specific configuration of the table it came
from.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes openzfs#15893
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
This slims down the in-memory entry to as small as it can be. The
IO-related parts are made into a separate entry, since they're
relatively rarely needed.

The variable allocation for dde_phys is to support the upcoming flat
format.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes openzfs#15893
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
Traditional dedup keeps a separate ddt_phys_t "type" for each possible
count of DVAs (that is, copies=) parameter. Each of these are tracked
independently of each other, and have their own set of DVAs. This leads
to an (admittedly rare) situation where you can create as many as six
copies of the data, by changing the copies= parameter between copying.
This is both a waste of storage on disk, but also a waste of space in
the stored DDT entries, since there never needs to be more than three
DVAs to handle all possible values of copies=.

This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the
first ddt_phys_t is used. Each time a block is written with the dedup
bit set, this single phys is checked to see if it has enough DVAs to
fulfill the request. If it does, the block is filled with the saved DVAs
as normal. If not, an adjusted write is issued to create as many extra
copies as are needed to fulfill the request, which are then saved into
the entry too.

Because a single phys is no longer an all-or-nothing, but can be
transitioning from fewer to more DVAs, the write path now has to keep a
copy of the previous "known good" DVA set so we can revert to it in case
an error occurs. zio_ddt_write() has been restructured and heavily
commented to make it much easier to see what's happening.

Backwards compatibility is maintained simply by allocating four
ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys
selection macros to check the flag. In the old arrangement, each number
of copies gets a whole phys, so it will always have either zero or all
necessary DVAs filled, with no in-between, so the old behaviour
naturally falls out of the new code.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Co-authored-by: Don Brady <[email protected]>
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Closes openzfs#15893
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants