Incorporate comments on page size

cloudnativegeo · Dec 17, 2024 · a39e494 · a39e494
1 parent 8c1ce26
commit a39e494
Show file tree

Hide file tree

Showing 2 changed files with 46 additions and 23 deletions.
diff --git a/cloud-optimized-netcdf4-hdf5/index.qmd b/cloud-optimized-netcdf4-hdf5/index.qmd
@@ -5,7 +5,7 @@ author: Aimee Barciauskas, Alexey Shikmonalov, Alexsander Jelenak, Luis Lopez
 toc-depth: 3
 ---
 
-The following provides guidance on how to assess and create cloud-optimized HDF5 and NetCDF4 files. This assumes one of those formats is a requirement, usually for archival purposes, as metadata is stored alongside the data. If these formats are not a requirement, you may consider a cloud-native format like [Zarr](../zarr/intro.qmd). If these formats are a requirement, you may also  consider zarr-readable chunk indexes such as [kerchunk](../kerchunk/intro.qmd) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/).
+The following provides guidance on how to assess and create cloud-optimized HDF5 and NetCDF4 files. This assumes one of those formats is a requirement, usually for archival purposes. If these formats are not a requirement, you may consider a cloud-native format like [Zarr](../zarr/intro.qmd). If these formats are a requirement, you may also  consider zarr-readable chunk indexes such as [kerchunk](../kerchunk/intro.qmd) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/).
 
 ::: {.callout-note}
 You can skip the background and details by jumping to the [checklist](#cloud-optimized-hdfnetcdf-checklist).
@@ -17,11 +17,15 @@ NetCDF and HDF were originally designed with disk access in mind. As Matt Rockli
 
 >The HDF format is complex and metadata is strewn throughout the file, so that a complex sequence of reads is required to reach a specific chunk of data.
 
+## Why accessing HDF5 on the cloud is slow
+
+In the diagram below, `R0, ..., Rn` represents the number of metadata requests required. A large number of these requests slows down working with these files in the cloud.
+
 ![](../images/why-hdf-on-cloud-is-slow.png)
 
 [@barrett2024]
 
-For storage on disk, small chunks were preferred because access was fast, and retrieving any part of a chunk involved reading the entire chunk [@h5py_developers]. However, when this same data is stored in the cloud, performance can suffer due to the high number of requests required to access both metadata and raw data. With network access, reducing the number of requests makes access much more efficient.
+When reading and writing data from disk, small chunks were preferred because access was fast, and retrieving any part of a chunk involved reading the entire chunk [@h5py_developers]. However, when this same data is stored in the cloud, performance can suffer due to the high number of requests required to access both metadata and raw data. With network access, reducing the number of requests makes access much more efficient.
 
 A detailed explanation of current best practices for cloud-optimized HDF5 and NetCDF-4 is provided below, followed by a checklist and some how-to guidance for assessing file layout.
 
@@ -35,11 +39,33 @@ Note: NetCDF-4 are valid HDF5 files, see [Reading and Editing NetCDF-4 Files wit
 
 To be considered cloud-optimized, the format should support chunking and compression. [NetCDF3](https://docs.unidata.ucar.edu/netcdf-c/current/faq.html) and [HDF4 prior to v4.1](https://docs.hdfgroup.org/archive/support/products/hdf4/HDF-FAQ.html#18) do not support chunking and chunk-level compression, and thus cannot be reformatted to be cloud optimized. The lack of support for chunking and compression along with [other limitations](https://docs.hdfgroup.org/archive/support/products/hdf5_tools/h4toh5/h4vsh5.html) led to the development of NetCDF-4 and HDF5.
 
+## Consolidated Internal File Metadata
+
+Consolidated metadata is a key characteristic of cloud-optimized data and enables "lazy loading" (see `Note` below). Client libraries use file metadata to understand what's in the file and where it is stored. When metadata is scattered across a file (which is the default for HDF5 writing), client applications have to make multiple requests for metadata information.
+
+For HDF5 files, to consolidate metadata, files should be written with the paged aggregation file space management strategy (see also [H5F_FSPACE_STRATEGY_PAGE](https://support.hdfgroup.org/documentation/hdf5-docs/advanced_topics/FileSpaceManagement.html#strategies)). When using this strategy, HDF5 will write data in pages where metadata is separated from raw data chunks. Further, only files using paged aggregation can use HDF5 library’s page buffer cache to reduce subsequent data access.
+
+::: {.callout-note}
+### HDF5 File Space Management Strategies
+
+How data and metadata are organized in an HDF5 file is the result of the file space management strategy used when writing the file. An explanation of the  different strategies can be found in [HDF Support: File Space Management](https://support.hdfgroup.org/documentation/hdf5-docs/advanced_topics/FileSpaceManagement.html). 
+
+Here are a few additional considerations for understanding and implementing the `H5F_FSPACE_STRATEGY_PAGE` strategy:
+
+* **Chunks vs. Pages:** In HDF5, datasets can be chunked, meaning the dataset is divided into smaller blocks of data that can be individually compressed. Pages, on the other hand, represent the smallest unit that HDF5 uses for reading and writing data. To optimize performance, the chunk size should ideally align with the page size or be a multiple thereof. However, a chunk does not have to fit within a single page. Misalignment leads to fragmented chunks spanning multiple pages, which increases read latency. Entire pages are read into memory when accessing chunks or metadata. Only the relevant data (e.g., a specific chunk) is decompressed or used, while other data in the page remains cached in its raw form.
+* **Page Size Considerations:** In HDF5, the page size applies to both metadata and raw data. Therefore, the chosen page size should strike a balance: it must consolidate metadata efficiently while minimizing unused space in raw data chunks. Excess unused space can significantly increase file size, which is typically not a concern for I/O performance when accessing parts of a file. However, increased file size can become a concern for storage costs.
+:::
+
+::: {.callout-note}
+### Lazy loading
+Lazy loading is a common term for first loading only metadata, and deferring reading of data values until required by computation.
+:::
+
 ## Chunk Size
 
 ### How to determine chunk size
 
-The uncompressed chunk size is the multiple of the chunk dimensions and the number of bytes for the given data type. For example, given 3 dimensional chunks of 10x100x100 and a data type of `float64` (8 bytes), each uncompresed chunk will be 0.8 MB.
+The uncompressed chunk size is calculated by multiplying the chunk dimensions by the size of the data type. For example, a 3D chunk of 10x100x100 with a float64 data type (8 bytes) results in an uncompressed chunk size of 0.8 MB.
 
 ### How to choose a chunk size
 
@@ -63,20 +89,22 @@ Performance greatly depends on libraries used to access the data and how they ar
 :::
 
 ::: {.callout-note}
+### Uncompressed Chunk Size
+
 When designing chunk size, usually the size is for the _uncompressed_ chunk. This is because:
 
-1. **Compression variability:** Because of data variability, you cannot deterministically know the size of each compressed chunk.
+1. **Data variability:** Because of data variability, you cannot deterministically know the size of each compressed chunk.
 2. **Memory considerations:** The uncompressed size determines how much memory must be available for reading and writing each chunk.
 :::
 
 ### Chunk shape vs chunk size
 
-The chunk size - the number of values per chunk, often multiplied by their data type's size in bytes to get a total size in bytes - must be differentiated from the chunk shape, which is the number of values stored along each dimension in a given chunk. Recommended chunk size depends on a storage system's characteristics and its interaction with the data access library. That makes chunk size recommendations fairly universal when holding the storage system data access library constant.
+The chunk size must be differentiated from the chunk shape, which is the number of values stored along each dimension in a given chunk. Recommended chunk size depends on a storage system's (such as S3) characteristics and its interaction with the data access library.
 
 In contrast, an optimal chunk shape is use case dependent. For a 3-dimensional dataset (latitude, longitude, time) with a chunk size of 1000, chunk shapes could vary, such as:
 
 1. 10 lat x 10 lon x 10 time,
-2. 20 lat x 50 lon x 1 time, or
+2. 20 lat x 50 lon x 1 time, or,
 3. 5 lat x 5 lon x 40 time.
 
 Larger chunks in a given dimension improve read performance in that direction: (3) is best for time-series analysis, (2) for mapping, and (1) is balanced for both. Thus, chunk shape should be chosen based on how the data is expected to be used, as there are trade-offs. A useful approach is to think in terms of "aspect ratios" of chunks, adjusting relative chunk sizes to fit the expected balance of spatial versus time-series analyses (see [https://github.com/jbusecke/dynamic_chunks](https://github.com/jbusecke/dynamic_chunks)).
@@ -91,16 +119,6 @@ A best practice to help determine both chunk size and shape would be to specify
 * [Chunking Data: Choosing Shapes](https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes)
 * The [dynamic_target_chunks_from_schema](https://github.com/jbusecke/pangeo-forge-recipes/blob/dynamic_chunks_2/pangeo_forge_recipes/dynamic_target_chunks.py#L230-L343) function written by [Julius Busecke](https://github.com/jbusecke) may help in determing a chunk shape based on a target size and dimension aspect ratio.
 
-## Consolidated Internal File Metadata
-
-Consolidated metadata is a key characteristic of cloud-optimized data. Client libraries use file metadata to understand what's in the file and where it is stored. When metadata is scattered across a file (which is the default for HDF5 writing), client applications have to make multiple requests for metadata information.
-
-For HDF5 files, to consolidate metadata, files should be written with the paged aggregation file space management strategy. When using this strategy, HDF5 will write data in pages where metadata is separated from raw data chunks. Further, only files using paged aggregation can use HDF5 library’s page buffer cache to reduce subsequent data access.
-
-::: {.callout-note}
-_Lazy loading:_ Lazy loading is a common term for first loading only metadata, and deferring reading of data values until required by computation.
-:::
-
 ## Compression
 
 Compression is the process of minimizing the size of data stored on disk using an algorithm to reduce the overall size of the data. This can include scale and offset parameters which reduce the size of each byte that needs to be stored. There are many algorithms for compressing data and users can even define their own compression algorithms. Data product owners should evaluate what compression algorithm is right for their data.
@@ -112,12 +130,12 @@ NASA satellite data predominantly compressed with the zlib (a.k.a., gzip, deflat
 
 ## Data Product Usage Documentation (Tutorials and Examples)
 
-How users use the data is out of the producers' control. However, tutorials and examples can be starting points for many data product users. These documents should include information on how to read data directly from cloud storage (as opposed to downloading over HTTPS) and how to configure popular libraries for optimizing performance.
+Tutorials and examples are starting points for many data users. These documents should include information on how to read data directly from cloud storage (as opposed to downloading over HTTPS) and how to configure popular libraries for optimizing performance.
 
 For example, the following library defaults will impact performance and are important to consider:
 
 * HDF5 library: The size of the HDF5's chunk cache by default is 1MB. This value is configurable. Chunks that don't fit into the chunk cache are discarded and must be re-read from the storage location each time. Learn more: [Improve HDF5 performance using caching](https://www.hdfgroup.org/2022/10/17/improve-hdf5-performance-using-caching/).
-* S3FS library: The S3FS library is a popular library for accessing data on AWS's cloud object storage S3. It has a default block size of 5MB ([S3FS API docs](https://s3fs.readthedocs.io/en/stable/api.html#s3fs.core.S3FileSystem).
+* S3FS library: The S3FS library is a popular library for accessing data on AWS's cloud object storage S3. It has a default block size of 5MB ([S3FS API docs](https://s3fs.readthedocs.io/en/stable/api.html#s3fs.core.S3FileSystem)).
 * Additional guidance on h5py, fsspec, and ROS3 libraries for creating and reading HDF5 can be found in @jelenak2024.
 
 ### Additional research
@@ -127,7 +145,7 @@ Here is some additional research done on caching for specific libraries and data
 - In this issue [Optimize s3fs read cache settings for the GEDI Subsetter](https://github.com/MAAP-Project/gedi-subsetter/issues/77) (findings to be formalized), Chuck Daniels found the "all" cache type (cache entire contents), a block size of 8MB and fill cache=True to deliver the best performance. NOTE: This is for non-cloud-optimized data.
 - In [HDF at the Speed of Zarr](https://docs.google.com/presentation/d/1iYFvGt9Zz0iaTj0STIMbboRKcBGhpOH_LuLBLqsJAlk/edit?usp=sharing), Luis Lopez demonstrates, using ICESat-2 data, the importance of using similar arguments with fsspec (blockcache instead of all, but the results in the issue above were not significantly different between these 2 options) as well as the importance of using nearly equivalent arguments in for h5py (raw data chunk cache nbytes and page_buff_size).
 
-## Cloud-Optimized HDF/NetCDF Checklist
+# Cloud-Optimized HDF/NetCDF Checklist
 
 Please consider the following when preparing HDF/NetCDF data for use on the cloud:
 
@@ -140,7 +158,7 @@ Please consider the following when preparing HDF/NetCDF data for use on the clou
 
 # How tos
 
-The examples below require the HDF5 library package is installed on your system. These commands will also work for NetCDF-4 While you can check for chunk size and shape with h5py, h5py is a high-level interface primarily for accessing datasets, attributes, and other basic HDF5 functionalities. h5py does not expose lower-level file options directly.
+The examples below require the HDF5 library package is installed on your system. These commands will also work for NetCDF-4. While you can check for chunk size and shape with h5py, h5py is a high-level interface primarily for accessing datasets, attributes, and other basic HDF5 functionalities. h5py does not expose lower-level file options directly.
 
 ## Commands in brief:
 
@@ -151,7 +169,7 @@ The examples below require the HDF5 library package is installed on your system.
 
 ## How to check for consolidated metadata
 
-To be considered cloud-optimized, HDF5 files should be written with the paged file space management strategy. When using this strategy, HDF5 will write data in pages where metadata is separated from raw data chunks [@jelenak2023].
+To be considered cloud-optimized, HDF5 files should be written with the paged file space management strategy. When using this strategy, HDF5 will write aggregate metadata and raw data into fixed size pages [@jelenak2023].
 
 You can check the file space management strategy with the command line h5stat tool:
 
@@ -183,10 +201,11 @@ You can use the `h5repack` to reorganize the metadata [@jelenak2023]. When repac
 $ h5repack -S PAGE -G 4000000 infile.h5 outfile.h5
 ```
 
+The HDF5 library needs to be configured to use the page aggregated files. When using the HDF5 library you can set  [H5Pset_page_buffer_size](https://hdfgroup.github.io/hdf5/develop/group___f_a_p_l.html#title89) and for [h5py File objects](https://docs.h5py.org/en/stable/high/file.html) you can set `page_buf_size` when instantiating the File object.
+
 ::: {.callout-warning}
-## Library limitations
+### Library limitations
 
-* The HDF5 library needs to be configured to use the page aggregated files. When using the HDF5 library you can set  [H5Pset_page_buffer_size](https://hdfgroup.github.io/hdf5/develop/group___f_a_p_l.html#title89) and for [h5py File objects](https://docs.h5py.org/en/stable/high/file.html) you can set `page_buf_size` when instantiating the File object.
 * h5repack's aggregation is fast but rechunking is slow. You may want to use the h5py library directly to repack. See an example of how to do so in NSIDC's cloud-optimized ICESat-2 repo: [optimize-atl03.py](https://github.com/nsidc/cloud-optimized-icesat2/blob/main/notebooks/optimize-atl03.py).
 * The NetCDF library doesn't expose the low-level HDF5 API so one must first create the file with the NetCDF library and then repack it with h5repack or python. See: [Using the HDF5 file space strategy property Unidata/netcdf-c #2871](https://github.com/Unidata/netcdf-c/discussions/2871).
 
@@ -229,6 +248,10 @@ $ h5repack \
   /path/to/dataset:CHUNK=2000 infile.h5 outfile.h5
 ```
 
+# Closing Thoughts
+
+Many existing HDF5 and NetCDF4 collections use the default file space management strategy. While optimizing these collections would improve performance, it requires significant effort, including benchmark design and development, reprocessing and a deep understanding of HDF5 file space management and caching. When time and resources are limited, tools like [Kerchunk](../kerchunk/intro.qmd) and [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/) offer a practical alternative. These tools don’t rechunk the data but instead consolidate metadata, resulting in notable performance improvements.
+
 # References
 
 ::: {#refs}

diff --git a/images/why-hdf-on-cloud-is-slow.png b/images/why-hdf-on-cloud-is-slow.png