CERA approach for chunking STOFS data #29

AtiehAlipour-NOAA · 2024-06-06T16:50:18Z

CERA uses a code to chunk STOFS .nc files before visualization, which makes it more efficient. Perhaps we can implement the same code before subsetting STOFS data. The code is in a private repository, but I have access to it and the permission to exclusively share it with the STOFS Subsetting Tool development team.

ChrisBarker-NOAA · 2024-06-07T15:56:53Z

It's on our "future" list to look into performance and chunking, so this is great.

The challenge, IIUC, is that to rechunk the data, you need to make a copy of it -- and that can be pretty expensive.

Potentially, the goal could be for STOFS (and other OFSs!) to be re-chunked before being uploaded to the NODD (or even in the original output).

The challenge with that is that an optimum chunking strategy is different depending on the use case, so there may not be a consensus on one "best" way to chunk the data.

Also -- for an unstructured grid, the ordering of the nodes can have a big impact -- does the CERA code reorder the nodes, in addition to re-chunking?

AtiehAlipour-NOAA · 2024-06-07T16:25:05Z

I agree that copying the file might not be a good idea, but I thought if working with STOFS data was slow, that might be an idea. I also heard in that meeting that they transpose the dimension files before chunking the data, but I couldn't figure that out from the code.
I do not think they do reordering of the nodes. We might find some relevant material in the JRC code: #19

AtiehAlipour-NOAA · 2024-06-07T16:32:00Z

This is also a relevant library that @SorooshMani-NOAA has shared:
https://medium.com/pangeo/rechunker-the-missing-link-for-chunked-array-analytics-5b2359e9dc11

omkar-334 · 2024-07-10T22:35:06Z

Do we need to reorder the nodes/chunks in a certain way, or just the most efficient way?

The xarray docs suggests to perform subsetting operations and then rechunk the dataset before saving.
So I had subsetted a adcirc dataset (which came out to be 500mb) and then rechunking and saving it to disk.

ds..to_netcdf('ds.nc') - The basic method takes ~6min to save to disk
ds.load().to_netcdf('ds.nc') - Loading the dataset to disk (since xarray.op_dataset uses lazy loading). This also took 5-6 min
ads.chunk('auto').to_netcdf('ds.nc') - Automatic chunking - This takes around 7-9 min.

Most docs say 1 million chunks is an optimal amount of chunks , so if we had to chunk the dataset based on time, do we divide 1 mil by ds.time.size to get the number of chunks?

I had looked into the CERA code and I'll try chunking with that too. I think we'll have to modify it to include all variables/attributes of the dataset though.

ChrisBarker-NOAA · 2024-07-11T05:20:06Z

1 million chunks is an optimal amount of chunks

There is no optimum for the number of chunks -- the bigger the dataset, the more chunks.

What you want to do is chunk the resulting dataset in a way that is suitable for the intended use-case. I don't think it makes much difference to writing speed, and in any case, that's not what should be optimized for anyway.

The constant challenge with chunking is that there is not one "optimum" chunking -- it all depends on the expected access patterns, and who knows how people may want to access the data. That being said, my suggestion (suitable for my access patterns :-)).

size-one per timestep
size-one per depth level (makes accessing one level much faster)
"optimum" size along teh node dimension

As to optimum sinze -- the total number of chunks is irrelevent, it's the size of each chunk that matters. However, the optimum size of chunks is dependent on things like disk cache size and the like that you can't know up front :-(

The good news is that it may not matter much, as least for accessing files on disk.

I did some experiments years ago when working out the default chunking for netCDF4 (it was very broken in its first version), and came to this conclusion:

very small chunks (< 125bytes or so) are VERY BAD.
HUGE chunks (100s of MB) are bad.
anything else in between (at the time I tested 1kB -- 1MB) didn't make a huge difference.

I concluded a minimum chunk size of 1k or so at the time -- though computers are bigger now -- minimum 1MB chunks?

I found this: "The Pangeo project has been recommending a chunk size of about 100MB, which originated from the Dask Best Practices."

maybe that's good? However, my experience shows that the performance of many mid-size chunks was similar ot more larger ones -- e.g.:

if you need to access 10 10MB chunks, rather than 1 100MB chunk, you won't notice the difference.

The other issue here is re-ordering nodes -- if the nodes aren't well ordered (i.e.e nodes nearby each-other in space are nearby node-number) then you'll need to access the entire domain every time anyway, making the chunk sizes less critical.

Experimentation will be needed, but I'd try 10MB -- 100MB chunks, and see how it goes -- if performance is similar, I'd go with the smaller ones, I think that's safer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CERA approach for chunking STOFS data #29

CERA approach for chunking STOFS data #29

AtiehAlipour-NOAA commented Jun 6, 2024 •

edited

Loading

ChrisBarker-NOAA commented Jun 7, 2024

AtiehAlipour-NOAA commented Jun 7, 2024

AtiehAlipour-NOAA commented Jun 7, 2024

omkar-334 commented Jul 10, 2024

ChrisBarker-NOAA commented Jul 11, 2024 •

edited

Loading

CERA approach for chunking STOFS data #29

CERA approach for chunking STOFS data #29

Comments

AtiehAlipour-NOAA commented Jun 6, 2024 • edited Loading

ChrisBarker-NOAA commented Jun 7, 2024

AtiehAlipour-NOAA commented Jun 7, 2024

AtiehAlipour-NOAA commented Jun 7, 2024

omkar-334 commented Jul 10, 2024

ChrisBarker-NOAA commented Jul 11, 2024 • edited Loading

AtiehAlipour-NOAA commented Jun 6, 2024 •

edited

Loading

ChrisBarker-NOAA commented Jul 11, 2024 •

edited

Loading