Skip to content

Commit

Permalink
fixing typos
Browse files Browse the repository at this point in the history
  • Loading branch information
betolink committed Oct 9, 2024
1 parent 57b255a commit 3f3e439
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 8 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,7 @@ docs
__pycache__/
/site_libs/manuscript-notebook/
.ipynb_checkpoints/
*.aux
*.log
*.pdf

14 changes: 7 additions & 7 deletions paper.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ date: last-modified

Scientific data from NASA and other agencies are increasingly being distributed from the commercial cloud. Cloud storage enables large-scale workflows and should reduce local storage costs. It also allows the use of scalable on-demand cloud computing resources by individual scientists and the broader scientific community. However, the majority of this scientific data is stored in a format that was not designed for the cloud: The Hierarchical Data format or HDF.

The most recent version of the Hierarchical Data Format is HDF5, a common archival format for n-dimensional scientific data; it has been utilized to store valuable information from astrophysics to earth sciences and everything in between. As flexible and powerful as HDF5 can be, it comes with big tradeoffs when it’s accessed from remote storage systems.
The most recent version of the Hierarchical Data Format is HDF5, a common archival format for n-dimensional scientific data; it has been utilized to store valuable information from astrophysics to earth sciences and everything in between. As flexible and powerful as HDF5 can be, it comes with big trade-offs when it’s accessed from remote storage systems.

HDF5 is a complex file format; we can think of it as a file system using a tree-like structure with multiple data types and native data structures. Because of this complexity, the most reliable way of accessing data stored in this format is using the HDF5 C API. Regardless of access pattern, nearly all tools ultimately rely on the HDF5-C library and this brings a couple issues that affect the efficiency of accessing this format over the network:

Expand Down Expand Up @@ -124,7 +124,7 @@ As a result of community feedback and “hack weeks” organized by NSIDC and UW

We tested access times to original and different configurations of cloud-optimized HDF5 [ATL03 files](https://its-live-data.s3.amazonaws.com/index.html#test-space/cloud-experiments/h5cloud/) stored in AWS S3 buckets in region us-west-2, the region hosting NASA’s Earthdata Cloud archives. Files were accessed using Python tools commonly used by Earth scientists: h5py and Xarray[@Hoyer2017-su]. h5py is a Python wrapper around the HDF5 C API. xarray^[`h5py` is a dependency of Xarray] is a widely used Python package for working with n-dimensional data. We also tested access times using h5coro, a python package optimized for reading HDF5 files from S3 buckets and kerchunk, a tool that creates an efficient lookup table for file chunks to allow performant partial reads of files.

The test files were originally cloud optimized by “repacking” them, using a relatively new feature in the HDF5 C API called “paged aggregation”. Page aggregation does 2 things: fisrt, it collects file-level metadata from datasets and stores it on dedicated metadata blocks at the front of the file; second, it forces the library to write both data and metadata using these fixed-size pages. Aggregation allows client libraries to read file metadata with only a few requests using the page size as a fixed request size, overriding the 1 request per chunk behavior.
The test files were originally cloud optimized by “repacking” them, using a relatively new feature in the HDF5 C API called “paged aggregation”. Page aggregation does 2 things: first, it collects file-level metadata from datasets and stores it on dedicated metadata blocks at the front of the file; second, it forces the library to write both data and metadata using these fixed-size pages. Aggregation allows client libraries to read file metadata with only a few requests using the page size as a fixed request size, overriding the 1 request per chunk behavior.

::: {#fig-2 fig-env="figure*"}

Expand All @@ -135,7 +135,7 @@ shows how file-level metadata and data gets internally packed once we use paged
:::

As we can see in [@fig-2], when we cloud optimize a file using paged-aggregation there are some considerations and behavior that we had to take into account. The first thing to observe is that
page aggregation will --as we mentioned-- consolidate the file-evel metadata at the front of the file and will add information in the so-called superblock^[The HDF5 superblock is a crucial component of the HDF5 file format, acting as the starting point for accessing all data within the file. It stores important metadata such as the version of the file format, pointers to the root group, and addresses for locating different file components]
page aggregation will --as we mentioned-- consolidate the file-level metadata at the front of the file and will add information in the so-called superblock^[The HDF5 superblock is a crucial component of the HDF5 file format, acting as the starting point for accessing all data within the file. It stores important metadata such as the version of the file format, pointers to the root group, and addresses for locating different file components]
The next thing to notice is that page size us uses across the board for metadata and data as of October 2024 and version 1.14 of the HDF5 library, page size cannot dynamically adjust to the total metadata size.

::: {#fig-3 fig-env="figure*"}
Expand All @@ -146,7 +146,7 @@ shows how file-level metadata and data packing inside aggregated pages leave unu

:::

This one page size for all approach simplifies how the HDF5 API reads the file (if instructucted) but it also brings unused page space and chunk over-reads. In the case of the ICESat-2 dataset (ATL03) the data itself has been partitioned and each granule
This one page size for all approach simplifies how the HDF5 API reads the file (if configured) but it also brings unused page space and chunk over-reads. In the case of the ICESat-2 dataset (ATL03) the data itself has been partitioned and each granule
represents a segment in the satellite orbit and within the file the most relevant dataset is chunked using 10,000 items per chunk, with data being float-32 and using a fast compression value, the resulting chunk size is on average under 40KB, which is really small
for an HTTP request, especially when we have to read them sequentially. Because of these considerations, we opted for testing different page sizes, and increase chunk size. The following table describes the different configurations used in our tests.

Expand All @@ -173,7 +173,7 @@ to reproduce the results is in the attached notebooks.

shows that using paged aggregation alone is not a complete solution. This behavior us caused by over-reads of data now distributed in pages and the internals of HDF5 not knowing how to optimize
the requests. This means that if we cloud optimize alone and use the same code, in some cases we'll make access to these files even slower. A very important thing to notice here is that rechunking the file, in this case using 10X bigger chunks results in a predictable 10X improvement in access times without any cloud optimization involved.
Having less chunks generates less metadata and bigger requests, in general is it recommended that chunk sizes should range between 1MB and 10MB[Add citation, S3 and HDF5] and if we have anough memory and bandwith even
Having less chunks generates less metadata and bigger requests, in general is it recommended that chunk sizes should range between 1MB and 10MB[Add citation, S3 and HDF5] and if we have enough memory and bandwidth even
bigger (Pangeo recommends up to 100MB chunks)[Add citation.]

:::
Expand Down Expand Up @@ -202,11 +202,11 @@ Create HDF5 files using paged aggregation by setting HDF5 library parameters:

1. File page strategy: H5F_FSPACE_STRATEGY_PAGE
2. File page size: 8000000
If repacking an existing file, h5repack contains the code to alter tese variables inside the file
If repacking an existing file, h5repack contains the code to alter these variables inside the file
```bash
h5repack -S PAGE -G 8000000 input.h5 output.h5
```
3. Avoid using unlimited dimensions when creating variables because the HDF5 API cannot support it inside bffered pages and representation of these variables is not supported by Kerchunk.
3. Avoid using unlimited dimensions when creating variables because the HDF5 API cannot support it inside buffered pages and representation of these variables is not supported by Kerchunk.

#### Reasoning

Expand Down
2 changes: 1 addition & 1 deletion site_libs/bootstrap/bootstrap.min.css

Large diffs are not rendered by default.

0 comments on commit 3f3e439

Please sign in to comment.