-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS Interface for Accelerators (Z.I.A.) #13628
base: master
Are you sure you want to change the base?
Conversation
Having not looked at the code at all, "ZFS data structures are still allocated, but their contents are expected to diverge from the offloaded copy as operations are run." strikes fear into my heart. Could, instead, the existing ZFS code be made to appear as a DPUSM so that there's one source of truth and no risk of divergence? |
@nwf I should have worded that better. The only data that diverges from ZFS are abd payloads. I intentionally did not deallocate
Can you elaborate on what you mean by
? Are you saying that the existing functionality should be wrapped into a provider so that it is sort of an offload? If so, that was done to create |
ec83b23
to
977dbca
Compare
I'm just trying to understand this architecture... using compression as an example:
I'm confused why * Providers and offloaders are usually separate entities. However, to
* keep things simple, the kernel offloader is compiled into this
* provider. ... but I still don't understand. If you checked it into the dpusm module, all users of dpsum could use it, not just ZFS. You could also test and develop it independently of ZFS. |
@tonyhutter The software provider/kernel offloader is included with this pull request because it links with ZFS and reuses ZFS functions instead of implementing its own operations. The software provider can also be used as an example to show ZFS developers how to create other providers, such as one for the Intel QAT, if someone chooses to do so. The dpusm already has example providers, but they do not have very much code in them. Additionally, the software provider allows for Z.I.A. to be used immediately rather than requiring users to buy hardware accelerators and develop providers. |
I think the whole idea is that dpusm should be implementing its own operations, since it's an external module. That's why ZFS would want to call it - because it's more optimized/efficient that ZFS's internal functions. It should be a black box one layer below ZFS. It would be nice if dpusm provided reference implementations for all of its APIs within its own module. That way we can at least functionally test against it. It looks like many of the functions are already implemented: const dpusm_pf_t example_dpusm_provider_functions = {
.algorithms = dpusm_provider_algorithms,
.alloc = dpusm_provider_alloc,
.alloc_ref = dpusm_provider_alloc_ref,
.get_size = dpusm_provider_get_size,
.free = dpusm_provider_free,
.copy_from_mem = dpusm_provider_copy_from_mem,
.copy_to_mem = dpusm_provider_copy_to_mem,
.mem_stats = NULL,
.zero_fill = NULL,
.all_zeros = NULL,
.compress = NULL,
.decompress = NULL,
.checksum = NULL,
.raid = {
.alloc = NULL,
.free = NULL,
.gen = NULL,
.new_parity = NULL,
.cmp = NULL,
.rec = NULL,
},
.file = {
.open = NULL,
.write = NULL,
.close = NULL,
},
.disk = {
.open = NULL,
.invalidate = NULL,
.write = NULL,
.flush = NULL,
.close = NULL,
},
}; Note: the reference implementations don't have to be optimized, they just have to functionally work. checksum() could literally be a simple xor over the data, for example. compress() could just return a copy of the data with a "0% compression ratio". The other thing that came to mind when looking at all this is that ZFS already has an API for pluggable checksum algorithms: Lines 163 to 202 in cb01da6
and compression algorithms: Lines 52 to 71 in cb01da6
I think it would make sense to add dpusm as selectable checksum and compression algorithms as a first step, and after that's checked in, then look into integrating your other accelerated functions into ZFS. |
You are correct that providers registered to the dpusm should provide better implementations than ZFS. The software provider is special in that there is no backing hardware accelerator - it uses ZFS defined operations. It is not meant to be used for anything other than as an example and for testing. Providers do not implement their own operations. Rather, they are meant to call custom hardware accelerator APIs on behalf of the user to run operations on hardware. The software provider is special in that its "hardware accelerator API" is functions exported from ZFS.
ZFS and Z.I.A. should never reach down into the dpusm or provider to attempt to manipulate data. That is why all of the handles are opaque pointers. A few ZFS pointers do get passed into the provider, but those pointers are simple, such as arrays of handles, or are just passed along without being modified. Similarly, the dpusm, providers, and hardware accelerators never know who they are offloading for or what format the data they are offloading are in. They do not know anything about ZFS structures.
The example you copied contains the minimum set of functions required to have a valid provider. It shows how to create wrappers around opaque hardware accelerator handles. Providers are not expected to have all operations defined. A reference implementation like the one you recommend would effectively be a bunch of no-ops which may as well not exist.
That is what the software provider is, except with real operations, and shows that Z.I.A. works. The software provider is not meant for speed. If anything, it is slower than raw ZFS since it performs memcpys to move data out of ZFS memory space and then runs the same implementations of algorithms that ZFS runs.
I considered doing that early on during development. However, the goal of Z.I.A. is not to add new algorithms and end up with zpools with data encoded with proprietary algorithms. It is to provide alternative data paths for existing algorithms. When hardware accelerators fail, ZFS can still fall back to running the software code path without breaking the zpool. This additionally allows for providers/hardware accelerators to be swapped out or even removed from the system and still have usable zpools. |
0ade432
to
022ff85
Compare
I've a problem with this part:
There are a lot of products shipping ZFS, where the product is hardware agnostic. That being said: |
@Ornias1993 Can you elaborate on the QAT issues you have experienced? In theory, ZFS should work with or without Z.I.A. enabled (perhaps the ifguards can be removed when Z.I.A. is merged). It's just that with Z.I.A., operations are accelerated. Data modifications such as compression should always result in data compatible with the ZFS implementation so that if stock ZFS were loaded after writing with Z.I.A., the data should still be accessible. There is no need to link against the accelerator, since that would be the provider's responsibility. All accelerators would use the same code path within ZFS, so figuring out what is broken would be obvious: either ZFS or the provider/accelerator, never the accelerator specific code in ZFS, because it wouldn't exist. |
I'm not having issues. Might be best to reread what I wrote.... Downstreams with ZFS included that are hardware agnostic currently do not implement QAT for example. Mainly because the fact it needs to be enabled on build. This is a major problem with the current QAT implementation and this problem is also present in this design. These things should be able to set-up be AFTER the binary has been build. |
@Ornias1993 The configuration can be changed to always try to find dpusm symbols (allowing for it not to be found) and the include guards can be removed so that Z.I.A. always builds. ZFS + Z.I.A. without a dpusm will still run. There is a weak pointer in Z.I.A. that allows for the dpusm to be missing. |
Could you elaborate a bit on the weak pointer bit? That sounds like something which might complicate life on hardened kernels so curious to see where/how that's implemented upstream if you happen to know. In terms of accelerators being able to fail down to zfs internal codepaths, how much runtime testing is needed when initializing those APIs before were sure all possible offloaded computational products match the internal ones at the currently running version of ZFS? For example, if an offloader does ZSTD at v1.5 but zfs innards move to vX then how safe is it to fail midstream so to speak and fall back to the older compressor? |
0c6cb5d
to
6b47b51
Compare
9c676fd
to
8dfcc9e
Compare
Intel released QAT Zstd v0.2, can Z.I.A. be ready for testing this year? 🙂 |
30de080
to
3070faa
Compare
1d35b72
to
beeb530
Compare
The ZIO pipeline has been modified to allow for external, alternative implementations of existing operations to be used. The original ZFS functions remain in the code as fallback in case the external implementation fails. Definitions: Accelerator - an entity (usually hardware) that is intended to accelerate operations Offloader - synonym of accelerator; used interchangeably Data Processing Unit Services Module (DPUSM) - https://github.com/hpc/dpusm - defines a "provider API" for accelerator vendors to set up - defines a "user API" for accelerator consumers to call - maintains list of providers and coordinates interactions between providers and consumers. Provider - a DPUSM wrapper for an accelerator's API Offload - moving data from ZFS/memory to the accelerator Onload - the opposite of offload In order for Z.I.A. to be extensible, it does not directly communicate with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM, which is then used to acquire handles to providers. Using ZFS with Z.I.A.: 1. Build and start the DPUSM 2. Implement, build, and register a provider with the DPUSM 3. Reconfigure ZFS with '--with-zia=<DPUSM root>' 4. Rebuild and start ZFS 5. Create a zpool 6. Select the provider zpool set zia_provider=<provider name> <zpool> 7. Select operations to offload zpool set zia_<property>=on <zpool> The operations that have been modified are: - compression - non-raw-writes only - decompression - checksum - not handling embedded checksums - checksum compute and checksum error call the same function - raidz - generation - reconstruction - vdev_file - open - write - close - vdev_disk - open - invalidate - write - flush - close Successful operations do not bring data back into memory after they complete, allowing for subsequent offloader operations reuse the data. This results in only one data movement per ZIO at the beginning of a pipeline that is necessary for getting data from ZFS to the accelerator. When errors ocurr and the offloaded data is still accessible, the offloaded data will be onloaded (or dropped if it still matches the in-memory copy) for that ZIO pipeline stage and processed with ZFS. This will cause thrashing if a later operation offloads data. This should not happen often, as constant errors (resulting in data movement) is not expected to be the norm. Unrecoverable errors such as hardware failures will trigger pipeline restarts (if necessary) in order to complete the original ZIO using the software path. The modifications to ZFS can be thought of as two sets of changes: - The ZIO write pipeline - compression, checksum, RAIDZ generation, and write - Each stage starts by offloading data that was not previously offloaded - This allows for ZIOs to be offloaded at any point in the pipeline - Resilver - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and RAIDZ generation), and write - Because the core of resilver is vdev_raidz_io_done, data is only offloaded once at the beginning of vdev_raidz_io_done - Errors cause data to be onloaded, but will not re-offload in subsequent steps within resilver - Write is a separate ZIO pipeline stage, so it will attempt to offload data The zio_decompress function has been modified to allow for offloading but the ZIO read pipeline as a whole has not, so it is not part of the above list. An example provider implementation can be found in module/zia-software-provider - The provider's "hardware" is actually software - data is "offloaded" to memory not owned by ZFS - Calls ZFS functions in order to not reimplement operations - Has kernel module parameters that can be used to trigger ZIA_ACCELERATOR_DOWN states for testing pipeline restarts. abd_t, raidz_row_t, and vdev_t have each been given an additional "void *<prefix>_zia_handle" member. These opaque handles point to data that is located on an offloader. abds are still allocated, but their payloads are expected to diverge from the offloaded copy as operations are run. Encryption and deduplication are disabled for zpools with Z.I.A. operations enabled Aggregation is disabled for offloaded abds RPMs will build with Z.I.A. Signed-off-by: Jason Lee <[email protected]>
Motivation and Context
ZFS provides many powerful features such as compression, checksumming, and erasure coding. Such operations can be CPU/memory intensive. In particular, compressing with gzip reduces a zpool's performance significantly. Offloading data to hardware accelerators such as the Intel QAT can improve performance. However, offloading stages individually results in many data transfers to and from the accelerators. Z.I.A. provides a write path parallel to the ZIO pipeline that keeps data offloaded for as long as possible and allows for arbitrary accelerators to be used rather than integrating specific accelerators into the ZFS codebase.
Z.I.A. + DPUSM.pdf
Dec 7, 2021 OpenZFS Leadership Meeting
SDC 2022
OpenZFS Developer Summit 2023
Description
The ZIO pipeline has been modified to allow for external, alternative implementations of existing operations to be used. The original ZFS functions remain in the code as fallback in case the external implementation fails.
Definitions:
In order for Z.I.A. to be extensible, it does not directly communicate with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM, which is then used to acquire handles to providers.
Using ZFS with Z.I.A.:
--with-zia=<DPUSM root>
zpool set zia_provider=<provider name> <zpool>
zpool set zia_<property>=on <zpool>
The operations that have been modified are:
Successful operations do not bring data back into memory after they complete, allowing for subsequent offloader operations reuse the data. This results in only one data movement per ZIO at the beginning of a pipeline that is necessary for getting data from ZFS to the accelerator.
When errors ocurr and the offloaded data is still accessible, the offloaded data will be onloaded (or dropped if it still matches the in-memory copy) for that ZIO pipeline stage and processed with ZFS. This will cause thrashing if a later operation offloads data. This should not happen often, as constant errors (resulting in data movement) is not expected to be the norm.
Unrecoverable errors such as hardware failures will trigger pipeline restarts (if necessary) in order to complete the original ZIO using the software path.
The modifications to ZFS can be thought of as two sets of changes:
vdev_raidz_io_done
(RAIDZ reconstruction, checksum, and RAIDZ generation), and writevdev_raidz_io_done
, data is only offloaded once at the beginning ofvdev_raidz_io_done
The zio_decompress function has been modified to allow for offloading but the ZIO read pipeline as a whole has not, so it
is not part of the above list.
An example provider implementation can be found in
module/zia-software-provider
ZIA_ACCELERATOR_DOWN
states for testing pipeline restarts.abd_t
,raidz_row_t
, andvdev_t
have each been given an additionalvoid *<prefix>_zia_handle
member. These opaque handles point to data that is located on an offloader. abds are still allocated, but their contents are expected to diverge from the offloaded copy as operations are run.Encryption and deduplication are disabled for zpools with Z.I.A. operations enabled
Aggregation is disabled for offloaded abds
RPMs will build with Z.I.A.
TODO/Need help with:
make install
How Has This Been Tested?
Testing was done using FIO and XDD with stripe and raidz 2 zpools writing to direct attached NVMes and NVMe-oF. Tests were performed on Ubuntu 20.04 and Rocky Linux 8.6 running kernels 5.13 and 5.14.
Types of changes
Checklist:
Signed-off-by
.