Shared L2ARC - Proof of Concept #14060

problame · 2022-10-20T11:33:26Z

I gave a talk on this PoC at the OpenZFS Developer Summit 2022: Wiki , Slides , Recording

The ARC dynamically shares DRAM capacity among all currently imported zpools. However, the L2ARC does not do the same for block capacity: the L2ARC vdevs of one zpool only cache buffers of that zpool. This can be undesirable on systems that host multiple zpools because it inhibits dynamic sharing of the cache device capacity which is desirable if the need for L2ARC varies among zpools over time, or if the set of zpools that are imported in the system varies over time.

Shared L2ARC addresses this need by decoupling the L2ARC vdevs from the zpools that store actual data. The mechanism that we use is to place the L2ARC vdevs into a special zpool, and to adjust the L2ARC feed thread logic to use that special zpool's L2ARC vdevs for all zpools' buffers.

High-level changes:

Reserve "NTNX-fsvm-local-l2arc" as a magic zpool name. We call this "the l2arc pool". All other pools are called "primary pools".
Make l2arc feed thread feed ARC buffers from any zpool to the l2arc zpool. (Before this patch, the l2arc feed thread would only feed ARC buffers to l2arc devices if they are for the same spa_t).
Change the locking to ensure that the l2arc zpool cannot be removed while there are ongoing reads initiated by arc_read on one of the primary pools.

This is sufficient and retains correctness of the ARC because nothing about the fundamental operation of L2ARC changes. The only thing that changes is that the L2ARC data is stored on vdevs outside the primary pool.

Proof Of Concept => Production

This commit is a proof-of-concept.
It works, it results in the desired performance improvement, and it's stable. But to make it production ready, more work needs to be done.

(1) The design is based on a version of ZFS that does not support encryption nor Persisent L2ARC. I'm no expert in either of these features. Encryption might work just fine as long as the l2arc feed thread can access the encryption keys for l2arc_apply_transforms.
But Persistent L2ARC definitely needs more design work (multiple L2ARC headers?).

(2) Remove hard-coded magic name; use a property instead. Make it opt-in so that existing setups are not affected. Example:
zpool create -o share_l2arc_vdevs=on my-l2arc-pool

(3) Coexistence with non-shared L2ARC; also via property. Make it opt-in so that existing setups are not affected. Example:
zpool set use_shared_l2arc=on my-data-pool

Signed-off-by: Christian Schwarz [email protected]

(I will give a talk on this PoC at the OpenZFS Developer Summit 2022.) The ARC dynamically shares DRAM capacity among all currently imported zpools. However, the L2ARC does not do the same for block capacity: the L2ARC vdevs of one zpool only cache buffers of that zpool. This can be undesirable on systems that host multiple zpools because it inhibits dynamic sharing of the cache device capacity which is desirable if the need for L2ARC varies among zpools over time, or if the set of zpools that are imported in the system varies over time. Shared L2ARC addresses this need by decoupling the L2ARC vdevs from the zpools that store actual data. The mechanism that we use is to place the L2ARC vdevs into a special zpool, and to adjust the L2ARC feed thread logic to use that special zpool's L2ARC vdevs for all zpools' buffers. High-level changes: * Reserve "NTNX-fsvm-local-l2arc" as a magic zpool name. We call this "the l2arc pool". All other pools are called "primary pools". * Make l2arc feed thread feed ARC buffers from any zpool to the l2arc zpool. (Before this patch, the l2arc feed thread would only feed ARC buffers to l2arc devices if they are for the same spa_t). * Change the locking to ensure that the l2arc zpool cannot be removed while there are ongoing reads initiated by arc_read on one of the primary pools. This is sufficient and retains correctness of the ARC because nothing about the fundamental operation of L2ARC changes. The only thing that changes is that the L2ARC data is stored on vdevs outside the primary pool. Proof Of Concept => Production ============================== This commit is a proof-of-concept. It works, it results in the desired performance improvement, and it's stable. But to make it production ready, more work needs to be done. (1) The design is based on a version of ZFS that does not support encryption nor Persisent L2ARC. I'm no expert in either of these features. Encryption might work just fine as long as the l2arc feed thread can access the encryption keys for l2arc_apply_transforms. But Persistent L2ARC definitely needs more design work (multiple L2ARC headers?). (2) Remove hard-coded magic name; use a property instead. Make it opt-in so that existing setups are not affected. Example: zpool create -o share_l2arc_vdevs=on my-l2arc-pool (3) Coexistence with non-shared L2ARC; also via property. Make it opt-in so that existing setups are not affected. Example: zpool set use_shared_l2arc=on my-data-pool Signed-off-by: Christian Schwarz <[email protected]>

jumbi77 · 2022-10-21T19:42:19Z

Nice idea. Maybe @gamanakis or @Ornias1993 want to take a look on high level design an especially on the persistent l2arc problem?! Thanks in advance to all participants.

PrivatePuffin · 2022-10-21T19:52:59Z

include/libzfs.h

@@ -419,6 +419,11 @@ typedef enum {
 	ZPOOL_STATUS_NON_NATIVE_ASHIFT,	/* (e.g. 512e dev with ashift of 9) */
 	ZPOOL_STATUS_COMPATIBILITY_ERR,	/* bad 'compatibility' property */
 	ZPOOL_STATUS_INCOMPATIBLE_FEAT,	/* feature set outside compatibility */
+	/*
+	 * Pool won't use the given L2ARC because this software version uses
+	 * the Nutanix shared L2ARC.


yeet branding:

Suggested change

* the Nutanix shared L2ARC.

* the shared L2ARC.

PrivatePuffin

L2ARC being per-pool, has been plagueing the viability of multi-pool (for example: a fast and a slow pool) deployments for a while. Even when using multiple SSD's for L2ARC, it would make more sense to have them striped, instead of each serving a different pool.

In abstract: I like the simplicity of the design.
Though we do need to add/adapt a BUNCH LOAD of tests, because we need to be 300% sure that all edge cases are tested against. But at <300 lines of code currently, this would be an amazing benefit to the project :)

It's also important to thoroughly test this with weirder setups like dedupe, metadata vdevs, l2arc being defined as "metadata" only etc. Though I do not expect big issues with this.

While at it, though it think it's extreme-extreme niche, it might be prudent to allow multiple shared-L2ARC groups as well.

Though I do want to highlight that we should get rid of all the brand references. For following review and discussion, it might be nice doing so sooner rather than later ;-)

Now the only reference left is the special pool name. That whole concept is going to replaced by zpool properties in the future.

RealFascinated · 2023-09-02T15:23:14Z

Is this PR dead?

problame · 2023-12-03T09:59:30Z

Sorry for the late reply.

I currently have no plans to pursue this PR any further.

That being said, I think the idea still stands and it's inevitable for the type of cloud ZFS setups illustrated in my dev summit talk and also @pcd1193182 's talk on shared log pool: EBS-like network disk for bulk storage, local NVMe for acceleration.

Note that similar efforts are underway for the ZIL (shared log pool).

behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Oct 20, 2022

PrivatePuffin reviewed Oct 21, 2022

View reviewed changes

cschwarz-nutanix added 3 commits October 24, 2022 12:52

remove unused ZPOOL_STATUS_USES_NTNX_SHARED_L2ARC

e650444

remove references to Nutanix from identifiers

95e0c3a

Now the only reference left is the special pool name. That whole concept is going to replaced by zpool properties in the future.

consistent naming: s/system-wide L2ARC/shared L2ARC/

f004d01

Ranlvor mentioned this pull request Nov 4, 2022

Allow pools to share a single l2arc cache #9859

Open

amotin added Status: Work in Progress Not yet ready for general review Status: Stale No recent activity for issue labels Oct 29, 2024

stale bot removed the Status: Stale No recent activity for issue label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared L2ARC - Proof of Concept #14060

Shared L2ARC - Proof of Concept #14060

problame commented Oct 20, 2022 •

edited

Loading

jumbi77 commented Oct 21, 2022

PrivatePuffin Oct 21, 2022

PrivatePuffin left a comment •

edited

Loading

RealFascinated commented Sep 2, 2023

problame commented Dec 3, 2023

Shared L2ARC - Proof of Concept #14060

Are you sure you want to change the base?

Shared L2ARC - Proof of Concept #14060

Conversation

problame commented Oct 20, 2022 • edited Loading

Proof Of Concept => Production

jumbi77 commented Oct 21, 2022

PrivatePuffin Oct 21, 2022

Choose a reason for hiding this comment

PrivatePuffin left a comment • edited Loading

Choose a reason for hiding this comment

RealFascinated commented Sep 2, 2023

problame commented Dec 3, 2023

problame commented Oct 20, 2022 •

edited

Loading

PrivatePuffin left a comment •

edited

Loading