Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast Dedup: Dedup Quota #15889

Merged
merged 1 commit into from
Jul 25, 2024
Merged

Conversation

allanjude
Copy link
Contributor

Motivation and Context

Dedup tables can grow unbounded, which lets them consume an entire dedup vdev and so spill into the main pool, or grow too big to fit in RAM, hurting performance. This change adds options to allow the administrator to set a quota, which when reached will effectively disable dedup for new blocks.

Description

This adds two new pool properties:

  • dedup_table_size, the total size of all DDTs on the pool; and
  • dedup_table_quota, the maximum possible size of all DDTs in the pool

When set, quota will be enforced by checking when a new entry is about to be created. If the pool is over its dedup quota, the entry won't be created, and the corresponding write will be converted to a regular non-dedup write. Note that existing entries can be updated (ie their refcounts changed), as that reuses the space rather than requiring more.

dedup_table_quota can be set to auto, which will set it based on the size of the devices backing the "dedup" allocation class. This makes it possible to limit the DDTs to the size of a dedup vdev only, such that when the device fills, no new blocks are deduplicated.

This replaces #10169

How Has This Been Tested?

Test added.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

include/sys/spa_impl.h Outdated Show resolved Hide resolved
module/zfs/ddt.c Outdated Show resolved Hide resolved
module/zfs/ddt.c Outdated Show resolved Hide resolved
@don-brady
Copy link
Contributor

Addressed review feedback and rebased to fix merge conflicts

@don-brady
Copy link
Contributor

Fixed dedup quota test for Redhat distros

@don-brady
Copy link
Contributor

rebased after zap shrinking was landed

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, except some cosmetics in tests.

Just a though, not directly related to this commit, except may be the tests, but close: wouldn't it make sense to block dedup for blocks below certain size via some tunable or property? Dedup for blocks with 4KB physical size IMHO is highly questionable, while for blocks with 512 bytes physical size I'd consider malicious, unless there is somehow a huge hit rate for them.

tests/zfs-tests/tests/functional/dedup/dedup_quota.ksh Outdated Show resolved Hide resolved
tests/zfs-tests/tests/functional/dedup/dedup_quota.ksh Outdated Show resolved Hide resolved
tests/zfs-tests/tests/functional/dedup/dedup_quota.ksh Outdated Show resolved Hide resolved
module/zcommon/zpool_prop.c Outdated Show resolved Hide resolved
man/man7/zpoolprops.7 Outdated Show resolved Hide resolved
module/zcommon/zpool_prop.c Outdated Show resolved Hide resolved
module/zfs/ddt.c Show resolved Hide resolved
module/zfs/ddt_stats.c Show resolved Hide resolved
module/zfs/ddt_stats.c Show resolved Hide resolved
module/zfs/zio.c Show resolved Hide resolved
tests/zfs-tests/tests/functional/dedup/dedup_quota.ksh Outdated Show resolved Hide resolved
Comment on lines 185 to 192
ddt_get_ddt_dsize(spa_t *spa)
{
ddt_object_t ddo_total;

ddt_get_dedup_object_stats(spa, &ddo_total);

return (spa->spa_dedup_dsize);
Copy link
Member

@amotin amotin Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a performance overhead concern here. If quota is set to a specific value, ddt_over_quota() called for each new record will call ddt_get_ddt_dsize() each time, which by calling ddt_get_dedup_object_stats() will recalculate space statistics for all checksums, types and classes. While requests will likely be cached, I am worried by CPU overhead. Do we really need to update it each time if load and sync already update it, and it should not change otherwise? Have I missed explanation why do we need it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there isn't a need to call it so often. Pushed a fix.

@robn
Copy link
Member

robn commented Jun 14, 2024

[Fast dedup stack rebased to master c98295e]

@@ -1032,6 +1139,7 @@ ddt_sync_table(ddt_t *ddt, dmu_tx_t *tx, uint64_t txg)
memcpy(&ddt->ddt_histogram_cache, ddt->ddt_histogram,
sizeof (ddt->ddt_histogram));
spa->spa_dedup_dspace = ~0ULL;
spa->spa_dedup_dsize = ~0ULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does spa_dedup_dsize need to be wiped anywhere aside of may be ddt_load()? I think it may create a time window for open context between spa_dedup_dsize being wiped and reset when dedup is disabled for no good reason. I don't think we need to wipe it, only update to the new up to date value during sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance issue you raised was caused by a past regression in ddt_get_ddt_dsize() . The original mechanism was that spa_dedup_dsize (which is a cached value) was reset after syncing the ddt and the next reader of ddt_get_ddt_dsize() would check if it had been reset and perform a recalculation and the cached value was good until the next txg sync.

I have restored that original behavior and made sure there were no direct consumers of spa_dedup_dsize and retested to confirm that we only recalculate after ddt is updated during sync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. It should help with extra requests. But it seems to me there is another issue, possibly affecting existing spa_dedup_dspace also. I think by this time DDT ZAP is already written, but not yet synced, that means the dnode does not have the new information on used space yet. So ddt_get_ddt_dsize() may regularly fetch and cache the value that is one TXG old.

@robn
Copy link
Member

robn commented Jun 25, 2024

Last push just tightens up a number parse in the dedup_quota test a little, but no other changes.

@tonyhutter
Copy link
Contributor

dedup_table_quota can be set to auto, which will set it based on the size of the devices backing the "dedup" allocation class.

I assume if you set it to auto and you don't have an alloc class device, then it doesn't place a limit on the DDT size (basically the old dedup behavior)?

If you have a special device but no dedup device, does auto set the DDT size to the special device size, since those can handle dedup allocations?

This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool

When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.

dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.

Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Co-authored-by: Rob Wing <[email protected]>
Co-authored-by: Sean Eric Fagan <[email protected]>
Co-authored-by: Allan Jude <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Signed-off-by: Don Brady <[email protected]>
@allanjude
Copy link
Contributor Author

dedup_table_quota can be set to auto, which will set it based on the size of the devices backing the "dedup" allocation class.

I assume if you set it to auto and you don't have an alloc class device, then it doesn't place a limit on the DDT size (basically the old dedup behavior)?

If you have a special device but no dedup device, does auto set the DDT size to the special device size, since those can handle dedup allocations?

If you set it to auto and have no dedup vdev, but do have a special vdev AND have zfs_ddt_data_is_special set (the default), then yes, it will use the special device size, respecting the same limits.

       /*
        * For automatic quota, table size is limited by dedup or special class
        */
       if (ddt_special_over_quota(spa, spa_dedup_class(spa)))
               return (B_TRUE);
       else if (spa_special_has_ddt(spa) &&
           ddt_special_over_quota(spa, spa_special_class(spa)))
               return (B_TRUE);

       return (B_FALSE);

Comment on lines +27 to +31
. $STF_SUITE/include/libtest.shlib

DISK=${DISKS%% *}

default_setup $DISK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you ever use the default pool in dedup_quota.ksh, so I think you can git rid of setup.ksh and cleanup.ksh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe followup commits add additional tests which depends on these so there's no harm in adding them now.

@tonyhutter
Copy link
Contributor

I haven't worked much with the dedup code, but I don't see any major issues.

@tonyhutter
Copy link
Contributor

If you can fix the minor setup.ksh and cleanup.ksh stuff, then I think we can start wrapping this up.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my previous comments. Looks good.

* If dedup quota is 0, we translate this into 'none'
* (unless literal is set). And if it is UINT64_MAX
* we translate that as 'automatic' (limit to size of
* the dedicated dedup VDEV. Otherwise, fall throught
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* the dedicated dedup VDEV. Otherwise, fall throught
* the dedicated dedup VDEV. Otherwise, fall through

@behlendorf behlendorf removed the Status: Code Review Needed Ready for review and testing label Jul 25, 2024
@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Jul 25, 2024
@behlendorf behlendorf merged commit c7ada64 into openzfs:master Jul 25, 2024
22 of 25 checks passed
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
This adds two new pool properties:
- dedup_table_size, the total size of all DDTs on the pool; and
- dedup_table_quota, the maximum possible size of all DDTs in the pool

When set, quota will be enforced by checking when a new entry is about
to be created. If the pool is over its dedup quota, the entry won't be
created, and the corresponding write will be converted to a regular
non-dedup write. Note that existing entries can be updated (ie their
refcounts changed), as that reuses the space rather than requiring more.

dedup_table_quota can be set to 'auto', which will set it based on the
size of the devices backing the "dedup" allocation device. This makes it
possible to limit the DDTs to the size of a dedup vdev only, such that
when the device fills, no new blocks are deduplicated.

Sponsored-by: iXsystems, Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Signed-off-by: Don Brady <[email protected]>
Co-authored-by: Don Brady <[email protected]>
Co-authored-by: Rob Wing <[email protected]>
Co-authored-by: Sean Eric Fagan <[email protected]>
Closes openzfs#15889
@guenther-alka
Copy link

guenther-alka commented Sep 8, 2024

Some aspects are clear to me, others not

If you set a dedup_quota ex to 2G without a special or dedup vdev:
The table is limited to 2G RAM usage
If you set auto with a dedup vdev
quota is dedup vdev size

unclear

from comments
If you set it to auto and have no dedup vdev, but do have a special vdev 
AND have zfs_ddt_data_is_special set (the default), then yes, it will use the special 
device size, respecting the same limits. 

This makes the special vdev worthless for small io, 
auto should then mean a certain percentage not the whole special vdev to allow a 
mixed use of a special vdev for small io and dedup per default
If you set it to 100G and have a dedup vdev or a special vdev 
AND have zfs_ddt_data_is_special set (the default), 

Dedup quota is then 100G on a special or dedup vdev?
If you set it to auto without a special or dedup vdev

Does this mean 100% Ram or a percentage like 10%
similar to defaults on Arc or writecache
With more than one pool, is the RAM quota related per pool or per all pools
I suppose you must add pool values for whole RAM consumption
What happens if you remove a special or dedup vdev and RAM is too small for dedup table then
or is a vdev remove then not possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants