Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS Interface for Accelerators (Z.I.A.) #13628

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

calccrypto
Copy link
Contributor

@calccrypto calccrypto commented Jul 5, 2022

Motivation and Context

ZFS provides many powerful features such as compression, checksumming, and erasure coding. Such operations can be CPU/memory intensive. In particular, compressing with gzip reduces a zpool's performance significantly. Offloading data to hardware accelerators such as the Intel QAT can improve performance. However, offloading stages individually results in many data transfers to and from the accelerators. Z.I.A. provides a write path parallel to the ZIO pipeline that keeps data offloaded for as long as possible and allows for arbitrary accelerators to be used rather than integrating specific accelerators into the ZFS codebase.

Z.I.A. + DPUSM.pdf
Dec 7, 2021 OpenZFS Leadership Meeting
SDC 2022
OpenZFS Developer Summit 2023

Description

The ZIO pipeline has been modified to allow for external, alternative implementations of existing operations to be used. The original ZFS functions remain in the code as fallback in case the external implementation fails.

Definitions:

Term Definition
Accelerator An entity (usually hardware) that is intended to accelerate operations
Offloader Synonym of accelerator; used interchangeably
Data Processing Unit Services Module (DPUSM)
  • https://github.com/hpc/dpusm
  • Defines a "provider API" for accelerator vendors to set up
  • Defines a "user API" for accelerator consumers to call
  • Maintains list of providers and coordinates interactions between providers and consumers.
Provider A DPUSM wrapper for an accelerator's API
Offload Moving data from ZFS/memory to the accelerator
Onload The opposite of offload

In order for Z.I.A. to be extensible, it does not directly communicate with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM, which is then used to acquire handles to providers.

Using ZFS with Z.I.A.:

  1. Build and start the DPUSM
  2. Implement, build, and register a provider with the DPUSM
  3. Reconfigure ZFS with --with-zia=<DPUSM root>
  4. Rebuild and start ZFS
  5. Create a zpool
  6. Select the provider
    zpool set zia_provider=<provider name> <zpool>
  7. Select operations to offload
    zpool set zia_<property>=on <zpool>

The operations that have been modified are:

  • compression
    • non-raw-writes only
  • decompression
  • checksum
    • not handling embedded checksums
    • checksum compute and checksum error call the same function
  • raidz
    • generation
    • reconstruction
  • vdev_file
    • open
    • write
    • close
  • vdev_disk
    • open
    • invalidate
    • write
    • flush
    • close

Successful operations do not bring data back into memory after they complete, allowing for subsequent offloader operations reuse the data. This results in only one data movement per ZIO at the beginning of a pipeline that is necessary for getting data from ZFS to the accelerator.

When errors ocurr and the offloaded data is still accessible, the offloaded data will be onloaded (or dropped if it still matches the in-memory copy) for that ZIO pipeline stage and processed with ZFS. This will cause thrashing if a later operation offloads data. This should not happen often, as constant errors (resulting in data movement) is not expected to be the norm.

Unrecoverable errors such as hardware failures will trigger pipeline restarts (if necessary) in order to complete the original ZIO using the software path.

The modifications to ZFS can be thought of as two sets of changes:

  • The ZIO write pipeline
    • compression, checksum, RAIDZ generation, and write
    • Each stage starts by offloading data that was not previously offloaded
      • This allows for ZIOs to be offloaded at any point in the pipeline
  • Resilver
    • vdev_raidz_io_done (RAIDZ reconstruction, checksum, and RAIDZ generation), and write
    • Because the core of resilver is vdev_raidz_io_done, data is only offloaded once at the beginning of vdev_raidz_io_done
      • Errors cause data to be onloaded, but will not re-offload in subsequent steps within resilver
      • Write is a separate ZIO pipeline stage, so it will attempt to offload data

The zio_decompress function has been modified to allow for offloading but the ZIO read pipeline as a whole has not, so it
is not part of the above list.

An example provider implementation can be found in module/zia-software-provider

  • The provider's hardware is actually software - data is "offloaded" to memory not owned by ZFS
  • Calls ZFS functions in order to not reimplement operations
  • Has kernel module parameters that can be used to trigger ZIA_ACCELERATOR_DOWN states for testing pipeline restarts.

abd_t, raidz_row_t, and vdev_t have each been given an additional void *<prefix>_zia_handle member. These opaque handles point to data that is located on an offloader. abds are still allocated, but their contents are expected to diverge from the offloaded copy as operations are run.

Encryption and deduplication are disabled for zpools with Z.I.A. operations enabled

Aggregation is disabled for offloaded abds

RPMs will build with Z.I.A.

TODO/Need help with:

  • Fix/Clean up build system
    • autoconf
    • m4
    • rpm spec
    • make install
  • Configuring with Z.I.A. enabled in GitHub Actions
  • Move example provider into contrib?

How Has This Been Tested?

Testing was done using FIO and XDD with stripe and raidz 2 zpools writing to direct attached NVMes and NVMe-oF. Tests were performed on Ubuntu 20.04 and Rocky Linux 8.6 running kernels 5.13 and 5.14.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@nwf
Copy link
Contributor

nwf commented Jul 6, 2022

Having not looked at the code at all, "ZFS data structures are still allocated, but their contents are expected to diverge from the offloaded copy as operations are run." strikes fear into my heart. Could, instead, the existing ZFS code be made to appear as a DPUSM so that there's one source of truth and no risk of divergence?

@calccrypto
Copy link
Contributor Author

calccrypto commented Jul 6, 2022

@nwf I should have worded that better. The only data that diverges from ZFS are abd payloads. I intentionally did not deallocate zio->io_abd and rc->rc_abd to

  1. Reuse existing data instead of storing extra data in Z.I.A. in order to recreate the abd.
    • ZFS expects abds with valid data other than the payloads such as abd_flags and abd_size
    • Z.I.A. handles are stored in the abd_t struct.
  2. Not spend extra time deallocating/reallocating abds in the middle of a pipeline.
  3. Not invalidate reference abds.
  4. Be able to not have to do an onload if data is not modified during a ZIO stage, since most stages do not modify the source data.
  5. Maintain an already allocated location to onload data into when an onload is needed.

Can you elaborate on what you mean by

Could, instead, the existing ZFS code be made to appear as a DPUSM

? Are you saying that the existing functionality should be wrapped into a provider so that it is sort of an offload? If so, that was done to create module/zia-software-provider. However, I do not plan on removing the existing code, leaving only Z.I.A. calls. This was done in anticipation of hardware accelerators failing in live systems: Z.I.A. will return errors, and ZFS falls back to processing with the original code path rather than completely fail.

@behlendorf behlendorf added Type: Feature Feature request or new feature Status: Design Review Needed Architecture or design is under discussion labels Jul 6, 2022
@tonyhutter
Copy link
Contributor

I'm just trying to understand this architecture... using compression as an example:

zio_write_compress()
	zia_compress()
		zio_compress_impl()
			dpusm->compress()
------------------------- dpusm layer- ----------------------------------
				sw_provider_compress()
					kernel_offloader_compress()
						// does a gzip compresion
		

I'm confused why sw_provider_compress() and below were being checked into the ZFS repository, considering they're part of the lower-level "dpusm" layer. I did see this comment:

 * Providers and offloaders are usually separate entities. However, to
 * keep things simple, the kernel offloader is compiled into this
 * provider.

... but I still don't understand. If you checked it into the dpusm module, all users of dpsum could use it, not just ZFS. You could also test and develop it independently of ZFS.

@calccrypto
Copy link
Contributor Author

calccrypto commented Jul 8, 2022

@tonyhutter The software provider/kernel offloader is included with this pull request because it links with ZFS and reuses ZFS functions instead of implementing its own operations. The software provider can also be used as an example to show ZFS developers how to create other providers, such as one for the Intel QAT, if someone chooses to do so. The dpusm already has example providers, but they do not have very much code in them.

Additionally, the software provider allows for Z.I.A. to be used immediately rather than requiring users to buy hardware accelerators and develop providers.

@tonyhutter
Copy link
Contributor

The software provider/kernel offloader is included with this pull request because it links with ZFS and reuses ZFS functions instead of implementing its own operations.

I think the whole idea is that dpusm should be implementing its own operations, since it's an external module. That's why ZFS would want to call it - because it's more optimized/efficient that ZFS's internal functions. It should be a black box one layer below ZFS.

It would be nice if dpusm provided reference implementations for all of its APIs within its own module. That way we can at least functionally test against it. It looks like many of the functions are already implemented:

const dpusm_pf_t example_dpusm_provider_functions = {
    .algorithms         = dpusm_provider_algorithms,
    .alloc              = dpusm_provider_alloc,
    .alloc_ref          = dpusm_provider_alloc_ref,
    .get_size           = dpusm_provider_get_size,
    .free               = dpusm_provider_free,
    .copy_from_mem      = dpusm_provider_copy_from_mem,
    .copy_to_mem        = dpusm_provider_copy_to_mem,
    .mem_stats          = NULL,
    .zero_fill          = NULL,
    .all_zeros          = NULL,
    .compress           = NULL,
    .decompress         = NULL,
    .checksum           = NULL,
    .raid               = {
                              .alloc       = NULL,
                              .free        = NULL,
                              .gen         = NULL,
                              .new_parity  = NULL,
                              .cmp         = NULL,
                              .rec         = NULL,
                          },
    .file               = {
                              .open        = NULL,
                              .write       = NULL,
                              .close       = NULL,

                          },
    .disk               = {
                              .open        = NULL,
                              .invalidate  = NULL,
                              .write       = NULL,
                              .flush       = NULL,
                              .close       = NULL,
                          },
};

Note: the reference implementations don't have to be optimized, they just have to functionally work. checksum() could literally be a simple xor over the data, for example. compress() could just return a copy of the data with a "0% compression ratio".

The other thing that came to mind when looking at all this is that ZFS already has an API for pluggable checksum algorithms:

const zio_checksum_info_t zio_checksum_table[ZIO_CHECKSUM_FUNCTIONS] = {
{{NULL, NULL}, NULL, NULL, 0, "inherit"},
{{NULL, NULL}, NULL, NULL, 0, "on"},
{{abd_checksum_off, abd_checksum_off},
NULL, NULL, 0, "off"},
{{abd_checksum_SHA256, abd_checksum_SHA256},
NULL, NULL, ZCHECKSUM_FLAG_METADATA | ZCHECKSUM_FLAG_EMBEDDED,
"label"},
{{abd_checksum_SHA256, abd_checksum_SHA256},
NULL, NULL, ZCHECKSUM_FLAG_METADATA | ZCHECKSUM_FLAG_EMBEDDED,
"gang_header"},
{{abd_fletcher_2_native, abd_fletcher_2_byteswap},
NULL, NULL, ZCHECKSUM_FLAG_EMBEDDED, "zilog"},
{{abd_fletcher_2_native, abd_fletcher_2_byteswap},
NULL, NULL, 0, "fletcher2"},
{{abd_fletcher_4_native, abd_fletcher_4_byteswap},
NULL, NULL, ZCHECKSUM_FLAG_METADATA, "fletcher4"},
{{abd_checksum_SHA256, abd_checksum_SHA256},
NULL, NULL, ZCHECKSUM_FLAG_METADATA | ZCHECKSUM_FLAG_DEDUP |
ZCHECKSUM_FLAG_NOPWRITE, "sha256"},
{{abd_fletcher_4_native, abd_fletcher_4_byteswap},
NULL, NULL, ZCHECKSUM_FLAG_EMBEDDED, "zilog2"},
{{abd_checksum_off, abd_checksum_off},
NULL, NULL, 0, "noparity"},
{{abd_checksum_SHA512_native, abd_checksum_SHA512_byteswap},
NULL, NULL, ZCHECKSUM_FLAG_METADATA | ZCHECKSUM_FLAG_DEDUP |
ZCHECKSUM_FLAG_NOPWRITE, "sha512"},
{{abd_checksum_skein_native, abd_checksum_skein_byteswap},
abd_checksum_skein_tmpl_init, abd_checksum_skein_tmpl_free,
ZCHECKSUM_FLAG_METADATA | ZCHECKSUM_FLAG_DEDUP |
ZCHECKSUM_FLAG_SALTED | ZCHECKSUM_FLAG_NOPWRITE, "skein"},
{{abd_checksum_edonr_native, abd_checksum_edonr_byteswap},
abd_checksum_edonr_tmpl_init, abd_checksum_edonr_tmpl_free,
ZCHECKSUM_FLAG_METADATA | ZCHECKSUM_FLAG_SALTED |
ZCHECKSUM_FLAG_NOPWRITE, "edonr"},
{{abd_checksum_blake3_native, abd_checksum_blake3_byteswap},
abd_checksum_blake3_tmpl_init, abd_checksum_blake3_tmpl_free,
ZCHECKSUM_FLAG_METADATA | ZCHECKSUM_FLAG_DEDUP |
ZCHECKSUM_FLAG_SALTED | ZCHECKSUM_FLAG_NOPWRITE, "blake3"},
};

and compression algorithms:
const zio_compress_info_t zio_compress_table[ZIO_COMPRESS_FUNCTIONS] = {
{"inherit", 0, NULL, NULL, NULL},
{"on", 0, NULL, NULL, NULL},
{"uncompressed", 0, NULL, NULL, NULL},
{"lzjb", 0, lzjb_compress, lzjb_decompress, NULL},
{"empty", 0, NULL, NULL, NULL},
{"gzip-1", 1, gzip_compress, gzip_decompress, NULL},
{"gzip-2", 2, gzip_compress, gzip_decompress, NULL},
{"gzip-3", 3, gzip_compress, gzip_decompress, NULL},
{"gzip-4", 4, gzip_compress, gzip_decompress, NULL},
{"gzip-5", 5, gzip_compress, gzip_decompress, NULL},
{"gzip-6", 6, gzip_compress, gzip_decompress, NULL},
{"gzip-7", 7, gzip_compress, gzip_decompress, NULL},
{"gzip-8", 8, gzip_compress, gzip_decompress, NULL},
{"gzip-9", 9, gzip_compress, gzip_decompress, NULL},
{"zle", 64, zle_compress, zle_decompress, NULL},
{"lz4", 0, lz4_compress_zfs, lz4_decompress_zfs, NULL},
{"zstd", ZIO_ZSTD_LEVEL_DEFAULT, zfs_zstd_compress_wrap,
zfs_zstd_decompress, zfs_zstd_decompress_level},
};

I think it would make sense to add dpusm as selectable checksum and compression algorithms as a first step, and after that's checked in, then look into integrating your other accelerated functions into ZFS.

@calccrypto
Copy link
Contributor Author

calccrypto commented Jul 9, 2022

I think the whole idea is that dpusm should be implementing its own operations, since it's an external module. That's why ZFS would want to call it - because it's more optimized/efficient that ZFS's internal functions.

You are correct that providers registered to the dpusm should provide better implementations than ZFS. The software provider is special in that there is no backing hardware accelerator - it uses ZFS defined operations. It is not meant to be used for anything other than as an example and for testing.

Providers do not implement their own operations. Rather, they are meant to call custom hardware accelerator APIs on behalf of the user to run operations on hardware. The software provider is special in that its "hardware accelerator API" is functions exported from ZFS.

It should be a black box one layer below ZFS.

ZFS and Z.I.A. should never reach down into the dpusm or provider to attempt to manipulate data. That is why all of the handles are opaque pointers. A few ZFS pointers do get passed into the provider, but those pointers are simple, such as arrays of handles, or are just passed along without being modified.

Similarly, the dpusm, providers, and hardware accelerators never know who they are offloading for or what format the data they are offloading are in. They do not know anything about ZFS structures.

It would be nice if dpusm provided reference implementations for all of its APIs within its own module.

The example you copied contains the minimum set of functions required to have a valid provider. It shows how to create wrappers around opaque hardware accelerator handles. Providers are not expected to have all operations defined. A reference implementation like the one you recommend would effectively be a bunch of no-ops which may as well not exist.

That way we can at least functionally test against it. It looks like many of the functions are already implemented:
...
Note: the reference implementations don't have to be optimized, they just have to functionally work. checksum() could literally be a simple xor over the data, for example. compress() could just return a copy of the data with a "0% compression ratio".

That is what the software provider is, except with real operations, and shows that Z.I.A. works. The software provider is not meant for speed. If anything, it is slower than raw ZFS since it performs memcpys to move data out of ZFS memory space and then runs the same implementations of algorithms that ZFS runs.

I think it would make sense to add dpusm as selectable checksum and compression algorithms as a first step, and after that's checked in, then look into integrating your other accelerated functions into ZFS.

I considered doing that early on during development. However, the goal of Z.I.A. is not to add new algorithms and end up with zpools with data encoded with proprietary algorithms. It is to provide alternative data paths for existing algorithms. When hardware accelerators fail, ZFS can still fall back to running the software code path without breaking the zpool. This additionally allows for providers/hardware accelerators to be swapped out or even removed from the system and still have usable zpools.

@PrivatePuffin
Copy link
Contributor

I've a problem with this part:

Reconfigure ZFS with --with-zia=<DPUSM root>
Rebuild and start ZFS

There are a lot of products shipping ZFS, where the product is hardware agnostic.
We already have the problem with QAT support where ZFS has to be rebuild to allow users to use it, which does not work well with hardware agnostic downstreams.

That being said:
I Applaud the idea of more modularity, it's just that modularity needs to take the above into account as well.

@calccrypto
Copy link
Contributor Author

calccrypto commented Jul 25, 2022

@Ornias1993 Can you elaborate on the QAT issues you have experienced? In theory, ZFS should work with or without Z.I.A. enabled (perhaps the ifguards can be removed when Z.I.A. is merged). It's just that with Z.I.A., operations are accelerated. Data modifications such as compression should always result in data compatible with the ZFS implementation so that if stock ZFS were loaded after writing with Z.I.A., the data should still be accessible.

There is no need to link against the accelerator, since that would be the provider's responsibility. All accelerators would use the same code path within ZFS, so figuring out what is broken would be obvious: either ZFS or the provider/accelerator, never the accelerator specific code in ZFS, because it wouldn't exist.

@PrivatePuffin
Copy link
Contributor

I'm not having issues. Might be best to reread what I wrote....

Downstreams with ZFS included that are hardware agnostic currently do not implement QAT for example. Mainly because the fact it needs to be enabled on build. This is a major problem with the current QAT implementation and this problem is also present in this design.

These things should be able to set-up be AFTER the binary has been build.

@calccrypto
Copy link
Contributor Author

calccrypto commented Jul 25, 2022

@Ornias1993 The configuration can be changed to always try to find dpusm symbols (allowing for it not to be found) and the include guards can be removed so that Z.I.A. always builds. ZFS + Z.I.A. without a dpusm will still run. There is a weak pointer in Z.I.A. that allows for the dpusm to be missing.

@sempervictus
Copy link
Contributor

@Ornias1993 The configuration can be changed to always try to find dpusm symbols (allowing for it not to be found) and the include guards can be removed so that Z.I.A. always builds. ZFS + Z.I.A. without a dpusm will still run. There is a weak pointer in Z.I.A. that allows for the dpusm to be missing.

Could you elaborate a bit on the weak pointer bit? That sounds like something which might complicate life on hardened kernels so curious to see where/how that's implemented upstream if you happen to know.

In terms of accelerators being able to fail down to zfs internal codepaths, how much runtime testing is needed when initializing those APIs before were sure all possible offloaded computational products match the internal ones at the currently running version of ZFS? For example, if an offloader does ZSTD at v1.5 but zfs innards move to vX then how safe is it to fail midstream so to speak and fall back to the older compressor?

@reneleonhardt
Copy link

Intel released QAT Zstd v0.2, can Z.I.A. be ready for testing this year? 🙂

The ZIO pipeline has been modified to allow for external,
alternative implementations of existing operations to be
used. The original ZFS functions remain in the code as
fallback in case the external implementation fails.

Definitions:
    Accelerator - an entity (usually hardware) that is
                  intended to accelerate operations
    Offloader   - synonym of accelerator; used interchangeably
    Data Processing Unit Services Module (DPUSM)
                - https://github.com/hpc/dpusm
                - defines a "provider API" for accelerator
                  vendors to set up
                - defines a "user API" for accelerator consumers
                  to call
                - maintains list of providers and coordinates
                  interactions between providers and consumers.
    Provider    - a DPUSM wrapper for an accelerator's API
    Offload     - moving data from ZFS/memory to the accelerator
    Onload      - the opposite of offload

In order for Z.I.A. to be extensible, it does not directly
communicate with a fixed accelerator. Rather, Z.I.A. acquires
a handle to a DPUSM, which is then used to acquire handles
to providers.

Using ZFS with Z.I.A.:
    1. Build and start the DPUSM
    2. Implement, build, and register a provider with the DPUSM
    3. Reconfigure ZFS with '--with-zia=<DPUSM root>'
    4. Rebuild and start ZFS
    5. Create a zpool
    6. Select the provider
           zpool set zia_provider=<provider name> <zpool>
    7. Select operations to offload
           zpool set zia_<property>=on <zpool>

The operations that have been modified are:
    - compression
        - non-raw-writes only
    - decompression
    - checksum
        - not handling embedded checksums
        - checksum compute and checksum error call the same function
    - raidz
        - generation
        - reconstruction
    - vdev_file
        - open
        - write
        - close
    - vdev_disk
        - open
        - invalidate
        - write
        - flush
        - close

Successful operations do not bring data back into memory after
they complete, allowing for subsequent offloader operations
reuse the data. This results in only one data movement per ZIO
at the beginning of a pipeline that is necessary for getting
data from ZFS to the accelerator.

When errors ocurr and the offloaded data is still accessible,
the offloaded data will be onloaded (or dropped if it still
matches the in-memory copy) for that ZIO pipeline stage and
processed with ZFS. This will cause thrashing if a later
operation offloads data. This should not happen often, as
constant errors (resulting in data movement) is not expected
to be the norm.

Unrecoverable errors such as hardware failures will trigger
pipeline restarts (if necessary) in order to complete the
original ZIO using the software path.

The modifications to ZFS can be thought of as two sets of changes:
    - The ZIO write pipeline
        - compression, checksum, RAIDZ generation, and write
        - Each stage starts by offloading data that was not
          previously offloaded
            - This allows for ZIOs to be offloaded at any point
              in the pipeline
    - Resilver
        - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and
          RAIDZ generation), and write
        - Because the core of resilver is vdev_raidz_io_done, data
          is only offloaded once at the beginning of
          vdev_raidz_io_done
            - Errors cause data to be onloaded, but will not
              re-offload in subsequent steps within resilver
            - Write is a separate ZIO pipeline stage, so it will
              attempt to offload data

The zio_decompress function has been modified to allow for
offloading but the ZIO read pipeline as a whole has not, so it
is not part of the above list.

An example provider implementation can be found in
module/zia-software-provider
    - The provider's "hardware" is actually software - data is
      "offloaded" to memory not owned by ZFS
    - Calls ZFS functions in order to not reimplement operations
    - Has kernel module parameters that can be used to trigger
      ZIA_ACCELERATOR_DOWN states for testing pipeline restarts.

abd_t, raidz_row_t, and vdev_t have each been given an additional
"void *<prefix>_zia_handle" member. These opaque handles point to
data that is located on an offloader. abds are still allocated,
but their payloads are expected to diverge from the offloaded copy
as operations are run.

Encryption and deduplication are disabled for zpools with Z.I.A.
operations enabled

Aggregation is disabled for offloaded abds

RPMs will build with Z.I.A.

Signed-off-by: Jason Lee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants