Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared L2ARC - Proof of Concept #14060

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

problame
Copy link
Contributor

@problame problame commented Oct 20, 2022

I gave a talk on this PoC at the OpenZFS Developer Summit 2022: Wiki , Slides , Recording

The ARC dynamically shares DRAM capacity among all currently imported zpools. However, the L2ARC does not do the same for block capacity: the L2ARC vdevs of one zpool only cache buffers of that zpool. This can be undesirable on systems that host multiple zpools because it inhibits dynamic sharing of the cache device capacity which is desirable if the need for L2ARC varies among zpools over time, or if the set of zpools that are imported in the system varies over time.

Shared L2ARC addresses this need by decoupling the L2ARC vdevs from the zpools that store actual data. The mechanism that we use is to place the L2ARC vdevs into a special zpool, and to adjust the L2ARC feed thread logic to use that special zpool's L2ARC vdevs for all zpools' buffers.

High-level changes:

  • Reserve "NTNX-fsvm-local-l2arc" as a magic zpool name. We call this "the l2arc pool". All other pools are called "primary pools".
  • Make l2arc feed thread feed ARC buffers from any zpool to the l2arc zpool. (Before this patch, the l2arc feed thread would only feed ARC buffers to l2arc devices if they are for the same spa_t).
  • Change the locking to ensure that the l2arc zpool cannot be removed while there are ongoing reads initiated by arc_read on one of the primary pools.

This is sufficient and retains correctness of the ARC because nothing about the fundamental operation of L2ARC changes. The only thing that changes is that the L2ARC data is stored on vdevs outside the primary pool.

Proof Of Concept => Production

This commit is a proof-of-concept.
It works, it results in the desired performance improvement, and it's stable. But to make it production ready, more work needs to be done.

(1) The design is based on a version of ZFS that does not support encryption nor Persisent L2ARC. I'm no expert in either of these features. Encryption might work just fine as long as the l2arc feed thread can access the encryption keys for l2arc_apply_transforms.
But Persistent L2ARC definitely needs more design work (multiple L2ARC headers?).

(2) Remove hard-coded magic name; use a property instead. Make it opt-in so that existing setups are not affected. Example:
zpool create -o share_l2arc_vdevs=on my-l2arc-pool

(3) Coexistence with non-shared L2ARC; also via property. Make it opt-in so that existing setups are not affected. Example:
zpool set use_shared_l2arc=on my-data-pool

Signed-off-by: Christian Schwarz [email protected]

(I will give a talk on this PoC at the OpenZFS Developer Summit 2022.)

The ARC dynamically shares DRAM capacity among all currently imported zpools.
However, the L2ARC does not do the same for block capacity: the L2ARC vdevs of
one zpool only cache buffers of that zpool. This can be undesirable on systems
that host multiple zpools because it inhibits dynamic sharing of the cache
device capacity which is desirable if the need for L2ARC varies among zpools
over time, or if the set of zpools that are imported in the system varies over
time.

Shared L2ARC addresses this need by decoupling the L2ARC vdevs from the
zpools that store actual data. The mechanism that we use is to place the L2ARC
vdevs into a special zpool, and to adjust the L2ARC feed thread logic to use
that special zpool's L2ARC vdevs for all zpools' buffers.

High-level changes:

* Reserve "NTNX-fsvm-local-l2arc" as a magic zpool name.
  We call this "the l2arc pool".
  All other pools are called "primary pools".
* Make l2arc feed thread feed ARC buffers from any zpool to the l2arc zpool.
  (Before this patch, the l2arc feed thread would only feed ARC buffers to
  l2arc devices if they are for the same spa_t).
* Change the locking to ensure that the l2arc zpool cannot be removed while
  there are ongoing reads initiated by arc_read on one of the primary pools.

This is sufficient and retains correctness of the ARC because nothing
about the fundamental operation of L2ARC changes. The only thing that changes
is that the L2ARC data is stored on vdevs outside the primary pool.

Proof Of Concept => Production
==============================

This commit is a proof-of-concept.
It works, it results in the desired performance improvement, and it's stable.
But to make it production ready, more work needs to be done.

(1) The design is based on a version of ZFS that does not support
encryption nor Persisent L2ARC. I'm no expert in either of these features.
Encryption might work just fine as long as the l2arc feed thread can access
the encryption keys for l2arc_apply_transforms.
But Persistent L2ARC definitely needs more design work
(multiple L2ARC headers?).

(2) Remove hard-coded magic name; use a property instead.
Make it opt-in so that existing setups are not affected.
Example:
  zpool create -o share_l2arc_vdevs=on my-l2arc-pool

(3) Coexistence with non-shared L2ARC; also via property.
Make it opt-in so that existing setups are not affected.
Example:
  zpool set use_shared_l2arc=on my-data-pool

Signed-off-by: Christian Schwarz <[email protected]>
@behlendorf behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Oct 20, 2022
@jumbi77
Copy link
Contributor

jumbi77 commented Oct 21, 2022

Nice idea. Maybe @gamanakis or @Ornias1993 want to take a look on high level design an especially on the persistent l2arc problem?! Thanks in advance to all participants.

include/libzfs.h Outdated
@@ -419,6 +419,11 @@ typedef enum {
ZPOOL_STATUS_NON_NATIVE_ASHIFT, /* (e.g. 512e dev with ashift of 9) */
ZPOOL_STATUS_COMPATIBILITY_ERR, /* bad 'compatibility' property */
ZPOOL_STATUS_INCOMPATIBLE_FEAT, /* feature set outside compatibility */
/*
* Pool won't use the given L2ARC because this software version uses
* the Nutanix shared L2ARC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeet branding:

Suggested change
* the Nutanix shared L2ARC.
* the shared L2ARC.

Copy link
Contributor

@PrivatePuffin PrivatePuffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L2ARC being per-pool, has been plagueing the viability of multi-pool (for example: a fast and a slow pool) deployments for a while. Even when using multiple SSD's for L2ARC, it would make more sense to have them striped, instead of each serving a different pool.

In abstract: I like the simplicity of the design.
Though we do need to add/adapt a BUNCH LOAD of tests, because we need to be 300% sure that all edge cases are tested against. But at <300 lines of code currently, this would be an amazing benefit to the project :)

It's also important to thoroughly test this with weirder setups like dedupe, metadata vdevs, l2arc being defined as "metadata" only etc. Though I do not expect big issues with this.

While at it, though it think it's extreme-extreme niche, it might be prudent to allow multiple shared-L2ARC groups as well.


Though I do want to highlight that we should get rid of all the brand references. For following review and discussion, it might be nice doing so sooner rather than later ;-)

Now the only reference left is the special pool name.
That whole concept is going to replaced by zpool properties
in the future.
@RealFascinated
Copy link

Is this PR dead?

@problame
Copy link
Contributor Author

problame commented Dec 3, 2023

Sorry for the late reply.

I currently have no plans to pursue this PR any further.

That being said, I think the idea still stands and it's inevitable for the type of cloud ZFS setups illustrated in my dev summit talk and also @pcd1193182 's talk on shared log pool: EBS-like network disk for bulk storage, local NVMe for acceleration.

Note that similar efforts are underway for the ZIL (shared log pool).

@amotin amotin added Status: Work in Progress Not yet ready for general review Status: Stale No recent activity for issue labels Oct 29, 2024
@stale stale bot removed the Status: Stale No recent activity for issue label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants