Skip to content

Commit

Permalink
Deploy preview for PR 121 🛫
Browse files Browse the repository at this point in the history
  • Loading branch information
abarciauskas-bgse committed Dec 17, 2024
1 parent aef216b commit 18a72e9
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 2 deletions.
3 changes: 2 additions & 1 deletion pr-preview/pr-121/cloud-optimized-netcdf4-hdf5/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -641,7 +641,8 @@ <h3 class="anchored" data-anchor-id="chunk-shape-vs-chunk-size">Chunk shape vs c
<section id="additional-chunk-shape-and-size-resources" class="level3">
<h3 class="anchored" data-anchor-id="additional-chunk-shape-and-size-resources">Additional chunk shape and size resources</h3>
<ul>
<li><a href="https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes">Chunking Data: Choosing Shapes</a></li>
<li><a href="https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes">Unidata Blog: “Chunking Data: Choosing Shapes”</a></li>
<li><a href="https://support.hdfgroup.org/documentation/hdf5-docs/advanced_topics/chunking_in_hdf5.html">HDF Support Site: “Chunking in HDF5”</a></li>
<li>The <a href="https://github.com/jbusecke/pangeo-forge-recipes/blob/dynamic_chunks_2/pangeo_forge_recipes/dynamic_target_chunks.py#L230-L343">dynamic_target_chunks_from_schema</a> function written by <a href="https://github.com/jbusecke">Julius Busecke</a> may help in determing a chunk shape based on a target size and dimension aspect ratio.</li>
</ul>
</section>
Expand Down
2 changes: 1 addition & 1 deletion pr-preview/pr-121/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -1385,7 +1385,7 @@
"href": "cloud-optimized-netcdf4-hdf5/index.html#chunk-size",
"title": "Cloud-Optimized HDF/NetCDF",
"section": "Chunk Size",
"text": "Chunk Size\n\nHow to determine chunk size\nThe uncompressed chunk size is calculated by multiplying the chunk dimensions by the size of the data type. For example, a 3D chunk of 10x100x100 with a float64 data type (8 bytes) results in an uncompressed chunk size of 0.8 MB.\n\n\nHow to choose a chunk size\nThere is no one-size-fits all chunk size and shape as use cases for different products vary. However, chunks should not be “too big” or “too small”.\n\n\nWhen chunks are too small:\n\nExtra metadata may increase file size.\nIt takes extra time to look up each chunk.\nMore network I/O is incurred because each chunk is stored and accessed independently (although contiguous chunks may be accessed by extending the byte range into one request).\n\n\n\nWhen chunks are too big:\n\nAn entire chunk must be read and decompressed to read even a small portion of the data.\nManaging large chunks in memory slows down processing and is more likely to exceed memory and chunk caches.\n\nA chunk size should be selected that is large enough to reduce the number of tasks that parallel schedulers have to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. The Amazon S3 Best Practices says the typical size for byte-range requests is 8-16MB. However, requests for data from contiguous chunks can be merged into 1 HTTP request, so chunks could be much smaller (one recommendation is 100kb to 2mb) (Jelenak 2024).\n\n\n\n\n\n\nNote\n\n\n\nPerformance greatly depends on libraries used to access the data and how they are configured to cache data as well.\n\n\n\n\n\n\n\n\nUncompressed Chunk Size\n\n\n\nWhen designing chunk size, usually the size is for the uncompressed chunk. This is because:\n\nData variability: Because of data variability, you cannot deterministically know the size of each compressed chunk.\nMemory considerations: The uncompressed size determines how much memory must be available for reading and writing each chunk.\n\n\n\n\n\nChunk shape vs chunk size\nThe chunk size must be differentiated from the chunk shape, which is the number of values stored along each dimension in a given chunk. Recommended chunk size depends on a storage system’s (such as S3) characteristics and its interaction with the data access library.\nIn contrast, an optimal chunk shape is use case dependent. For a 3-dimensional dataset (latitude, longitude, time) with a chunk size of 1000, chunk shapes could vary, such as:\n\n10 lat x 10 lon x 10 time,\n20 lat x 50 lon x 1 time, or,\n5 lat x 5 lon x 40 time.\n\nLarger chunks in a given dimension improve read performance in that direction: (3) is best for time-series analysis, (2) for mapping, and (1) is balanced for both. Thus, chunk shape should be chosen based on how the data is expected to be used, as there are trade-offs. A useful approach is to think in terms of “aspect ratios” of chunks, adjusting relative chunk sizes to fit the expected balance of spatial versus time-series analyses (see https://github.com/jbusecke/dynamic_chunks).\n (Shiklomanov 2024)\nA best practice to help determine both chunk size and shape would be to specify some “benchmark use cases” for the data. With these use cases in mind, evaluate what chunk shape and size is large enough such that the computation doesn’t result in thousands of jobs and small enough that multiple chunks can be stored in-memory and a library’s buffer cache, such as HDF5’s buffer cache.\n\n\nAdditional chunk shape and size resources\n\nChunking Data: Choosing Shapes\nThe dynamic_target_chunks_from_schema function written by Julius Busecke may help in determing a chunk shape based on a target size and dimension aspect ratio.",
"text": "Chunk Size\n\nHow to determine chunk size\nThe uncompressed chunk size is calculated by multiplying the chunk dimensions by the size of the data type. For example, a 3D chunk of 10x100x100 with a float64 data type (8 bytes) results in an uncompressed chunk size of 0.8 MB.\n\n\nHow to choose a chunk size\nThere is no one-size-fits all chunk size and shape as use cases for different products vary. However, chunks should not be “too big” or “too small”.\n\n\nWhen chunks are too small:\n\nExtra metadata may increase file size.\nIt takes extra time to look up each chunk.\nMore network I/O is incurred because each chunk is stored and accessed independently (although contiguous chunks may be accessed by extending the byte range into one request).\n\n\n\nWhen chunks are too big:\n\nAn entire chunk must be read and decompressed to read even a small portion of the data.\nManaging large chunks in memory slows down processing and is more likely to exceed memory and chunk caches.\n\nA chunk size should be selected that is large enough to reduce the number of tasks that parallel schedulers have to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. The Amazon S3 Best Practices says the typical size for byte-range requests is 8-16MB. However, requests for data from contiguous chunks can be merged into 1 HTTP request, so chunks could be much smaller (one recommendation is 100kb to 2mb) (Jelenak 2024).\n\n\n\n\n\n\nNote\n\n\n\nPerformance greatly depends on libraries used to access the data and how they are configured to cache data as well.\n\n\n\n\n\n\n\n\nUncompressed Chunk Size\n\n\n\nWhen designing chunk size, usually the size is for the uncompressed chunk. This is because:\n\nData variability: Because of data variability, you cannot deterministically know the size of each compressed chunk.\nMemory considerations: The uncompressed size determines how much memory must be available for reading and writing each chunk.\n\n\n\n\n\nChunk shape vs chunk size\nThe chunk size must be differentiated from the chunk shape, which is the number of values stored along each dimension in a given chunk. Recommended chunk size depends on a storage system’s (such as S3) characteristics and its interaction with the data access library.\nIn contrast, an optimal chunk shape is use case dependent. For a 3-dimensional dataset (latitude, longitude, time) with a chunk size of 1000, chunk shapes could vary, such as:\n\n10 lat x 10 lon x 10 time,\n20 lat x 50 lon x 1 time, or,\n5 lat x 5 lon x 40 time.\n\nLarger chunks in a given dimension improve read performance in that direction: (3) is best for time-series analysis, (2) for mapping, and (1) is balanced for both. Thus, chunk shape should be chosen based on how the data is expected to be used, as there are trade-offs. A useful approach is to think in terms of “aspect ratios” of chunks, adjusting relative chunk sizes to fit the expected balance of spatial versus time-series analyses (see https://github.com/jbusecke/dynamic_chunks).\n (Shiklomanov 2024)\nA best practice to help determine both chunk size and shape would be to specify some “benchmark use cases” for the data. With these use cases in mind, evaluate what chunk shape and size is large enough such that the computation doesn’t result in thousands of jobs and small enough that multiple chunks can be stored in-memory and a library’s buffer cache, such as HDF5’s buffer cache.\n\n\nAdditional chunk shape and size resources\n\nUnidata Blog: “Chunking Data: Choosing Shapes”\nHDF Support Site: “Chunking in HDF5”\nThe dynamic_target_chunks_from_schema function written by Julius Busecke may help in determing a chunk shape based on a target size and dimension aspect ratio.",
"crumbs": [
"Formats",
"Cloud-Optimized HDF/NetCDF",
Expand Down

0 comments on commit 18a72e9

Please sign in to comment.