Fast Dedup: Dedup Quota #15889

allanjude · 2024-02-14T13:50:25Z

Motivation and Context

Dedup tables can grow unbounded, which lets them consume an entire dedup vdev and so spill into the main pool, or grow too big to fit in RAM, hurting performance. This change adds options to allow the administrator to set a quota, which when reached will effectively disable dedup for new blocks.

Description

This adds two new pool properties:

dedup_table_size, the total size of all DDTs on the pool; and
dedup_table_quota, the maximum possible size of all DDTs in the pool

When set, quota will be enforced by checking when a new entry is about to be created. If the pool is over its dedup quota, the entry won't be created, and the corresponding write will be converted to a regular non-dedup write. Note that existing entries can be updated (ie their refcounts changed), as that reuses the space rather than requiring more.

dedup_table_quota can be set to auto, which will set it based on the size of the devices backing the "dedup" allocation class. This makes it possible to limit the DDTs to the size of a dedup vdev only, such that when the device fills, no new blocks are deduplicated.

This replaces #10169

How Has This Been Tested?

Test added.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

include/sys/spa_impl.h

module/zfs/ddt.c

don-brady · 2024-04-02T22:51:57Z

Addressed review feedback and rebased to fix merge conflicts

don-brady · 2024-04-18T18:13:26Z

Fixed dedup quota test for Redhat distros

don-brady · 2024-04-25T02:30:33Z

rebased after zap shrinking was landed

amotin

Looks good to me, except some cosmetics in tests.

Just a though, not directly related to this commit, except may be the tests, but close: wouldn't it make sense to block dedup for blocks below certain size via some tunable or property? Dedup for blocks with 4KB physical size IMHO is highly questionable, while for blocks with 512 bytes physical size I'd consider malicious, unless there is somehow a huge hit rate for them.

tests/zfs-tests/tests/functional/dedup/dedup_quota.ksh

module/zcommon/zpool_prop.c

man/man7/zpoolprops.7

module/zcommon/zpool_prop.c

module/zfs/ddt.c

module/zfs/ddt_stats.c

module/zfs/zio.c

tests/zfs-tests/tests/functional/dedup/dedup_quota.ksh

amotin · 2024-06-11T18:04:40Z

module/zfs/ddt_stats.c

+ddt_get_ddt_dsize(spa_t *spa)
+{
+	ddt_object_t ddo_total;
+
+	ddt_get_dedup_object_stats(spa, &ddo_total);
+
+	return (spa->spa_dedup_dsize);


I've got a performance overhead concern here. If quota is set to a specific value, ddt_over_quota() called for each new record will call ddt_get_ddt_dsize() each time, which by calling ddt_get_dedup_object_stats() will recalculate space statistics for all checksums, types and classes. While requests will likely be cached, I am worried by CPU overhead. Do we really need to update it each time if load and sync already update it, and it should not change otherwise? Have I missed explanation why do we need it?

Yeah, there isn't a need to call it so often. Pushed a fix.

robn · 2024-06-14T01:39:09Z

[Fast dedup stack rebased to master c98295e]

amotin · 2024-06-20T02:23:46Z

module/zfs/ddt.c

@@ -1032,6 +1139,7 @@ ddt_sync_table(ddt_t *ddt, dmu_tx_t *tx, uint64_t txg)
 	memcpy(&ddt->ddt_histogram_cache, ddt->ddt_histogram,
 	    sizeof (ddt->ddt_histogram));
 	spa->spa_dedup_dspace = ~0ULL;
+	spa->spa_dedup_dsize = ~0ULL;


Does spa_dedup_dsize need to be wiped anywhere aside of may be ddt_load()? I think it may create a time window for open context between spa_dedup_dsize being wiped and reset when dedup is disabled for no good reason. I don't think we need to wipe it, only update to the new up to date value during sync.

The performance issue you raised was caused by a past regression in ddt_get_ddt_dsize() . The original mechanism was that spa_dedup_dsize (which is a cached value) was reset after syncing the ddt and the next reader of ddt_get_ddt_dsize() would check if it had been reset and perform a recalculation and the cached value was good until the next txg sync.

I have restored that original behavior and made sure there were no direct consumers of spa_dedup_dsize and retested to confirm that we only recalculate after ddt is updated during sync.

Thank you. It should help with extra requests. But it seems to me there is another issue, possibly affecting existing spa_dedup_dspace also. I think by this time DDT ZAP is already written, but not yet synced, that means the dnode does not have the new information on used space yet. So ddt_get_ddt_dsize() may regularly fetch and cache the value that is one TXG old.

robn · 2024-06-25T01:37:32Z

Last push just tightens up a number parse in the dedup_quota test a little, but no other changes.

tonyhutter · 2024-07-11T18:21:35Z

dedup_table_quota can be set to auto, which will set it based on the size of the devices backing the "dedup" allocation class.

I assume if you set it to auto and you don't have an alloc class device, then it doesn't place a limit on the DDT size (basically the old dedup behavior)?

If you have a special device but no dedup device, does auto set the DDT size to the special device size, since those can handle dedup allocations?

This adds two new pool properties: - dedup_table_size, the total size of all DDTs on the pool; and - dedup_table_quota, the maximum possible size of all DDTs in the pool When set, quota will be enforced by checking when a new entry is about to be created. If the pool is over its dedup quota, the entry won't be created, and the corresponding write will be converted to a regular non-dedup write. Note that existing entries can be updated (ie their refcounts changed), as that reuses the space rather than requiring more. dedup_table_quota can be set to 'auto', which will set it based on the size of the devices backing the "dedup" allocation device. This makes it possible to limit the DDTs to the size of a dedup vdev only, such that when the device fills, no new blocks are deduplicated. Sponsored-by: iXsystems, Inc. Sponsored-By: Klara Inc. Co-authored-by: Rob Wing <[email protected]> Co-authored-by: Sean Eric Fagan <[email protected]> Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Signed-off-by: Don Brady <[email protected]>

allanjude · 2024-07-23T13:35:39Z

dedup_table_quota can be set to auto, which will set it based on the size of the devices backing the "dedup" allocation class.

I assume if you set it to auto and you don't have an alloc class device, then it doesn't place a limit on the DDT size (basically the old dedup behavior)?

If you have a special device but no dedup device, does auto set the DDT size to the special device size, since those can handle dedup allocations?

If you set it to auto and have no dedup vdev, but do have a special vdev AND have zfs_ddt_data_is_special set (the default), then yes, it will use the special device size, respecting the same limits.

       /*
        * For automatic quota, table size is limited by dedup or special class
        */
       if (ddt_special_over_quota(spa, spa_dedup_class(spa)))
               return (B_TRUE);
       else if (spa_special_has_ddt(spa) &&
           ddt_special_over_quota(spa, spa_special_class(spa)))
               return (B_TRUE);

       return (B_FALSE);

tonyhutter · 2024-07-23T22:01:12Z

tests/zfs-tests/tests/functional/dedup/setup.ksh

+. $STF_SUITE/include/libtest.shlib
+
+DISK=${DISKS%% *}
+
+default_setup $DISK


I don't think you ever use the default pool in dedup_quota.ksh, so I think you can git rid of setup.ksh and cleanup.ksh.

I believe followup commits add additional tests which depends on these so there's no harm in adding them now.

tonyhutter · 2024-07-23T22:08:39Z

I haven't worked much with the dedup code, but I don't see any major issues.

tonyhutter · 2024-07-25T00:19:06Z

If you can fix the minor setup.ksh and cleanup.ksh stuff, then I think we can start wrapping this up.

behlendorf

Thanks for addressing my previous comments. Looks good.

behlendorf · 2024-07-25T00:33:43Z

lib/libzfs/libzfs_pool.c

+			 * If dedup quota is 0, we translate this into 'none'
+			 * (unless literal is set). And if it is UINT64_MAX
+			 * we translate that as 'automatic' (limit to size of
+			 * the dedicated dedup VDEV.  Otherwise, fall throught


Suggested change

* the dedicated dedup VDEV. Otherwise, fall throught

* the dedicated dedup VDEV. Otherwise, fall through

This adds two new pool properties: - dedup_table_size, the total size of all DDTs on the pool; and - dedup_table_quota, the maximum possible size of all DDTs in the pool When set, quota will be enforced by checking when a new entry is about to be created. If the pool is over its dedup quota, the entry won't be created, and the corresponding write will be converted to a regular non-dedup write. Note that existing entries can be updated (ie their refcounts changed), as that reuses the space rather than requiring more. dedup_table_quota can be set to 'auto', which will set it based on the size of the devices backing the "dedup" allocation device. This makes it possible to limit the DDTs to the size of a dedup vdev only, such that when the device fills, no new blocks are deduplicated. Sponsored-by: iXsystems, Inc. Sponsored-By: Klara Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Signed-off-by: Don Brady <[email protected]> Co-authored-by: Don Brady <[email protected]> Co-authored-by: Rob Wing <[email protected]> Co-authored-by: Sean Eric Fagan <[email protected]> Closes openzfs#15889

guenther-alka · 2024-09-08T08:58:01Z

Some aspects are clear to me, others not

If you set a dedup_quota ex to 2G without a special or dedup vdev:
The table is limited to 2G RAM usage

If you set auto with a dedup vdev
quota is dedup vdev size

unclear

from comments
If you set it to auto and have no dedup vdev, but do have a special vdev 
AND have zfs_ddt_data_is_special set (the default), then yes, it will use the special 
device size, respecting the same limits. 

This makes the special vdev worthless for small io, 
auto should then mean a certain percentage not the whole special vdev to allow a 
mixed use of a special vdev for small io and dedup per default

If you set it to 100G and have a dedup vdev or a special vdev 
AND have zfs_ddt_data_is_special set (the default), 

Dedup quota is then 100G on a special or dedup vdev?

If you set it to auto without a special or dedup vdev

Does this mean 100% Ram or a percentage like 10%
similar to defaults on Arc or writecache

With more than one pool, is the RAM quota related per pool or per all pools
I suppose you must add pool values for whole RAM consumption

What happens if you remove a special or dedup vdev and RAM is too small for dedup table then
or is a vdev remove then not possible

behlendorf added the Status: Code Review Needed Ready for review and testing label Feb 15, 2024

robn force-pushed the fdt-rel-quota branch from bc82d39 to 7af7ec1 Compare February 15, 2024 19:58

amotin reviewed Mar 2, 2024

View reviewed changes

include/sys/spa_impl.h Outdated Show resolved Hide resolved

module/zfs/ddt.c Outdated Show resolved Hide resolved

module/zfs/ddt.c Outdated Show resolved Hide resolved

don-brady force-pushed the fdt-rel-quota branch from 7af7ec1 to 476200f Compare April 2, 2024 22:48

don-brady force-pushed the fdt-rel-quota branch from 476200f to 8ab3dc6 Compare April 18, 2024 18:12

don-brady force-pushed the fdt-rel-quota branch from 8ab3dc6 to 79a7842 Compare April 25, 2024 02:19

amotin approved these changes May 1, 2024

View reviewed changes

behlendorf reviewed May 3, 2024

View reviewed changes

don-brady force-pushed the fdt-rel-quota branch from 79a7842 to 1297718 Compare May 15, 2024 03:05

amotin reviewed Jun 11, 2024

View reviewed changes

robn force-pushed the fdt-rel-quota branch from 1297718 to 3b5cda6 Compare June 14, 2024 01:35

robn force-pushed the fdt-rel-quota branch from 3b5cda6 to 7f0f28a Compare June 17, 2024 05:30

don-brady force-pushed the fdt-rel-quota branch from 7f0f28a to 5859ff2 Compare June 19, 2024 04:01

amotin reviewed Jun 20, 2024

View reviewed changes

don-brady force-pushed the fdt-rel-quota branch from 5859ff2 to 08967e7 Compare June 20, 2024 15:12

robn force-pushed the fdt-rel-quota branch from 08967e7 to f35e333 Compare June 25, 2024 01:32

allanjude force-pushed the fdt-rel-quota branch from f35e333 to 5a6dd55 Compare July 23, 2024 13:38

tonyhutter reviewed Jul 23, 2024

View reviewed changes

behlendorf approved these changes Jul 25, 2024

View reviewed changes

behlendorf removed the Status: Code Review Needed Ready for review and testing label Jul 25, 2024

behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Jul 25, 2024

behlendorf merged commit c7ada64 into openzfs:master Jul 25, 2024
22 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Dedup: Dedup Quota #15889

Fast Dedup: Dedup Quota #15889

allanjude commented Feb 14, 2024

don-brady commented Apr 2, 2024

don-brady commented Apr 18, 2024

don-brady commented Apr 25, 2024

amotin left a comment

amotin Jun 11, 2024 •

edited

Loading

don-brady Jun 19, 2024

robn commented Jun 14, 2024

amotin Jun 20, 2024

don-brady Jun 20, 2024

amotin Jun 20, 2024

robn commented Jun 25, 2024

tonyhutter commented Jul 11, 2024

allanjude commented Jul 23, 2024

tonyhutter Jul 23, 2024

behlendorf Jul 25, 2024

tonyhutter commented Jul 23, 2024

tonyhutter commented Jul 25, 2024

behlendorf left a comment

behlendorf Jul 25, 2024

guenther-alka commented Sep 8, 2024 •

edited

Loading

	* the dedicated dedup VDEV. Otherwise, fall throught
	* the dedicated dedup VDEV. Otherwise, fall through

Fast Dedup: Dedup Quota #15889

Fast Dedup: Dedup Quota #15889

Conversation

allanjude commented Feb 14, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

don-brady commented Apr 2, 2024

don-brady commented Apr 18, 2024

don-brady commented Apr 25, 2024

amotin left a comment

Choose a reason for hiding this comment

amotin Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

don-brady Jun 19, 2024

Choose a reason for hiding this comment

robn commented Jun 14, 2024

amotin Jun 20, 2024

Choose a reason for hiding this comment

don-brady Jun 20, 2024

Choose a reason for hiding this comment

amotin Jun 20, 2024

Choose a reason for hiding this comment

robn commented Jun 25, 2024

tonyhutter commented Jul 11, 2024

allanjude commented Jul 23, 2024

tonyhutter Jul 23, 2024

Choose a reason for hiding this comment

behlendorf Jul 25, 2024

Choose a reason for hiding this comment

tonyhutter commented Jul 23, 2024

tonyhutter commented Jul 25, 2024

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf Jul 25, 2024

Choose a reason for hiding this comment

guenther-alka commented Sep 8, 2024 • edited Loading

amotin Jun 11, 2024 •

edited

Loading

guenther-alka commented Sep 8, 2024 •

edited

Loading