-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CERA approach for chunking STOFS data #29
Comments
It's on our "future" list to look into performance and chunking, so this is great. The challenge, IIUC, is that to rechunk the data, you need to make a copy of it -- and that can be pretty expensive. Potentially, the goal could be for STOFS (and other OFSs!) to be re-chunked before being uploaded to the NODD (or even in the original output). The challenge with that is that an optimum chunking strategy is different depending on the use case, so there may not be a consensus on one "best" way to chunk the data. Also -- for an unstructured grid, the ordering of the nodes can have a big impact -- does the CERA code reorder the nodes, in addition to re-chunking? |
I agree that copying the file might not be a good idea, but I thought if working with STOFS data was slow, that might be an idea. I also heard in that meeting that they transpose the dimension files before chunking the data, but I couldn't figure that out from the code. |
This is also a relevant library that @SorooshMani-NOAA has shared: |
Do we need to reorder the nodes/chunks in a certain way, or just the most efficient way? The xarray docs suggests to perform subsetting operations and then rechunk the dataset before saving.
Most docs say 1 million chunks is an optimal amount of chunks , so if we had to chunk the dataset based on I had looked into the CERA code and I'll try chunking with that too. I think we'll have to modify it to include all variables/attributes of the dataset though. |
There is no optimum for the number of chunks -- the bigger the dataset, the more chunks. What you want to do is chunk the resulting dataset in a way that is suitable for the intended use-case. I don't think it makes much difference to writing speed, and in any case, that's not what should be optimized for anyway. The constant challenge with chunking is that there is not one "optimum" chunking -- it all depends on the expected access patterns, and who knows how people may want to access the data. That being said, my suggestion (suitable for my access patterns :-)).
As to optimum sinze -- the total number of chunks is irrelevent, it's the size of each chunk that matters. However, the optimum size of chunks is dependent on things like disk cache size and the like that you can't know up front :-( The good news is that it may not matter much, as least for accessing files on disk. I did some experiments years ago when working out the default chunking for netCDF4 (it was very broken in its first version), and came to this conclusion:
I concluded a minimum chunk size of 1k or so at the time -- though computers are bigger now -- minimum 1MB chunks? I found this: "The Pangeo project has been recommending a chunk size of about 100MB, which originated from the Dask Best Practices." maybe that's good? However, my experience shows that the performance of many mid-size chunks was similar ot more larger ones -- e.g.: if you need to access 10 10MB chunks, rather than 1 100MB chunk, you won't notice the difference. The other issue here is re-ordering nodes -- if the nodes aren't well ordered (i.e.e nodes nearby each-other in space are nearby node-number) then you'll need to access the entire domain every time anyway, making the chunk sizes less critical. Experimentation will be needed, but I'd try 10MB -- 100MB chunks, and see how it goes -- if performance is similar, I'd go with the smaller ones, I think that's safer. |
CERA uses a code to chunk STOFS .nc files before visualization, which makes it more efficient. Perhaps we can implement the same code before subsetting STOFS data. The code is in a private repository, but I have access to it and the permission to exclusively share it with the STOFS Subsetting Tool development team.
The text was updated successfully, but these errors were encountered: