Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CERA approach for chunking STOFS data #29

Open
AtiehAlipour-NOAA opened this issue Jun 6, 2024 · 5 comments
Open

CERA approach for chunking STOFS data #29

AtiehAlipour-NOAA opened this issue Jun 6, 2024 · 5 comments

Comments

@AtiehAlipour-NOAA
Copy link

AtiehAlipour-NOAA commented Jun 6, 2024

CERA uses a code to chunk STOFS .nc files before visualization, which makes it more efficient. Perhaps we can implement the same code before subsetting STOFS data. The code is in a private repository, but I have access to it and the permission to exclusively share it with the STOFS Subsetting Tool development team.

@ChrisBarker-NOAA
Copy link
Collaborator

It's on our "future" list to look into performance and chunking, so this is great.

The challenge, IIUC, is that to rechunk the data, you need to make a copy of it -- and that can be pretty expensive.

Potentially, the goal could be for STOFS (and other OFSs!) to be re-chunked before being uploaded to the NODD (or even in the original output).

The challenge with that is that an optimum chunking strategy is different depending on the use case, so there may not be a consensus on one "best" way to chunk the data.

Also -- for an unstructured grid, the ordering of the nodes can have a big impact -- does the CERA code reorder the nodes, in addition to re-chunking?

@AtiehAlipour-NOAA
Copy link
Author

I agree that copying the file might not be a good idea, but I thought if working with STOFS data was slow, that might be an idea. I also heard in that meeting that they transpose the dimension files before chunking the data, but I couldn't figure that out from the code.
I do not think they do reordering of the nodes. We might find some relevant material in the JRC code: #19

@AtiehAlipour-NOAA
Copy link
Author

@omkar-334
Copy link
Contributor

Do we need to reorder the nodes/chunks in a certain way, or just the most efficient way?

The xarray docs suggests to perform subsetting operations and then rechunk the dataset before saving.
So I had subsetted a adcirc dataset (which came out to be 500mb) and then rechunking and saving it to disk.

  1. ds..to_netcdf('ds.nc') - The basic method takes ~6min to save to disk
  2. ds.load().to_netcdf('ds.nc') - Loading the dataset to disk (since xarray.op_dataset uses lazy loading). This also took 5-6 min
  3. ads.chunk('auto').to_netcdf('ds.nc') - Automatic chunking - This takes around 7-9 min.

Most docs say 1 million chunks is an optimal amount of chunks , so if we had to chunk the dataset based on time, do we divide 1 mil by ds.time.size to get the number of chunks?

I had looked into the CERA code and I'll try chunking with that too. I think we'll have to modify it to include all variables/attributes of the dataset though.

@ChrisBarker-NOAA
Copy link
Collaborator

ChrisBarker-NOAA commented Jul 11, 2024

1 million chunks is an optimal amount of chunks

There is no optimum for the number of chunks -- the bigger the dataset, the more chunks.

What you want to do is chunk the resulting dataset in a way that is suitable for the intended use-case. I don't think it makes much difference to writing speed, and in any case, that's not what should be optimized for anyway.

The constant challenge with chunking is that there is not one "optimum" chunking -- it all depends on the expected access patterns, and who knows how people may want to access the data. That being said, my suggestion (suitable for my access patterns :-)).

  • size-one per timestep
  • size-one per depth level (makes accessing one level much faster)
  • "optimum" size along teh node dimension

As to optimum sinze -- the total number of chunks is irrelevent, it's the size of each chunk that matters. However, the optimum size of chunks is dependent on things like disk cache size and the like that you can't know up front :-(

The good news is that it may not matter much, as least for accessing files on disk.

I did some experiments years ago when working out the default chunking for netCDF4 (it was very broken in its first version), and came to this conclusion:

  • very small chunks (< 125bytes or so) are VERY BAD.
  • HUGE chunks (100s of MB) are bad.
  • anything else in between (at the time I tested 1kB -- 1MB) didn't make a huge difference.

I concluded a minimum chunk size of 1k or so at the time -- though computers are bigger now -- minimum 1MB chunks?

I found this: "The Pangeo project has been recommending a chunk size of about 100MB, which originated from the Dask Best Practices."

maybe that's good? However, my experience shows that the performance of many mid-size chunks was similar ot more larger ones -- e.g.:

if you need to access 10 10MB chunks, rather than 1 100MB chunk, you won't notice the difference.

The other issue here is re-ordering nodes -- if the nodes aren't well ordered (i.e.e nodes nearby each-other in space are nearby node-number) then you'll need to access the entire domain every time anyway, making the chunk sizes less critical.

Experimentation will be needed, but I'd try 10MB -- 100MB chunks, and see how it goes -- if performance is similar, I'd go with the smaller ones, I think that's safer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants