some comments on xarray integration #1016

rabernat · 2018-09-30T14:31:14Z

rabernat
Sep 30, 2018

Hi Parcels Folks,

Congratulations on this amazing package! It is awesome to see how far it has evolved since Erik's Lagrangian meeting at Imperial a few years ago. Both the code itself and the documentation / example sets are just beautiful.

I have been reading through the code and documentation, and I have a few comments that I want to share. Please take these comments with a giant grain of salt...they are meant purely for discussion, not as a criticism of your package, which, as I said above, is awesome and amazing.

Parcels frequently seems to make the assumption that the velocity data resides in netCDF files. This may lead to some problems down the line. There are many cases when the velocities might be in other formats. For example:

in MITgcm, we have data files in a weird MDS format
in Pangeo, we are putting our data in the cloud in zarr format

However, all of these formats can be read into xarray. As a specific example, in the Pangeo Sea Surface Height example, I can load the entire AVISO dataset from zarr format from google cloud storage in one line. I would love to be able to just call ocean parcels on this data and compute Lagrangian trajectories. That is not trivial right now.

So here are some ideas, in increasing order of difficulty / complexity

I believe it might make sense for you to accept xarray objects when creating velocity FieldSet objects, i.e. fieldset.from_xarray(ds). Looking at this code, I don't think this would be too hard. In fact, it might even already work with fieldset.from_data, since it xarray datasets provide a dictionary-like interface. This would solve some of the problems I described above (but not all). You also use xarray internally to read netcdf data, so maybe that path could be followed instead.
It would might not work with the Pangeo AVISO zarr dataset, because you also frequently coerce inputs to numpy arrays. This would trigger dask arrays to compute. It would be great to operate lazily until data is actually required for computation. This might be easy with duck typing, simply by avoiding explicit coercion to numpy arrays. Maybe the netcdf code path is lazier, but, as noted above, it only accepts netcdf files, not generic objects.
More generally, it looks like you are essentially re-implementing large portions of xarray in this package. For example FieldSet is conceptually similar to xarray.Dataset and Field is conceptually similar to xarray.DataArray. Labelled multi-dimensional arrays are an extremely common pattern in geoscience code. Many packages that once had their own implementations of these data structures have refactored to just use xarray (satpy is a great example). Your various Field.search_indices methods are very similar to xarray's indexing operations. Xarray now has multidimensional interpolation based on scipy, which you also implement in parcels. This would of course imply a major refactor of your internal structure, so this is a very presumptuous suggestion on my part. However, there could be major advantages, including
- way less code to maintain
- use xarray's indexers and interpolaters where possible
- dask integration for free

Of course, xarray does not provide all the functionality you need for working with GCM data. Operations related to grid cells are not part of xarray. I am trying (mostly without success) to rally folks to use and contribute to xgcm for this purpose. (So that is part of my ulterior motive here.)

Anyway, of course I don't expect any of these things (other than perhaps 1) to make it onto your todo list any time soon. Clearly this package is amazing and very successful in what it does. But as an xarray developer, oceanographer, and concerned citizen of the scientific python ecosystem, I just can't resist pointing out this opportunity where using xarray more could significantly reduce your development burden and lead to enhanced interoperability.

-Ryan

erikvansebille · 2018-10-03T14:43:30Z

erikvansebille
Oct 3, 2018
Maintainer

Dear Ryan,

Thank you very much for these kind words about parcels, and for your thoughtful comments. Philippe and I have discussed these, and have the following responses and questions on your three ideas

You are absolutely right that we could do a FieldSet.from_xarray(). This would be relatively simple to implement, and we would do it in the coming weeks. Note that we would likely still require a dictionary mapping the ds dimensions to our 'special' lon, lat, time and depth dimensions, as we don’t want to guess which dimension is which. Or do you see a way to automatically detect this?
For the simple FieldSet.from_xarray(), we would indeed coerce into numpy arrays. We do that mostly because of our JIT mode, where the particle interpolation is done in C. So we need to be able to pass arrays between python and C, and I’m not sure how easy that would be in lazy-loaded xarray. However, in the medium term, we’d be happy to explore
You are right that the overhead of needing to maintain the Field and FieldSet objects is a burden. We would love to be able to use xarray.Datasets and xarray.DataArrays. Again, we’re not entirely sure of the status of xarray development, but the reason we implemented our own Field and FieldSets initially was because scipy interpolation was too inefficient for us. In particular, we want local, rather than global interpolation, because we can make the assumption that a particle doesn’t move very far within a timestep. So our Field.search_indices takes the old location as input, and starts searching in the vicinity of that old location. A few years ago, scipy interpolation didn’t have this option, which is why we went down the route of C-interpolation (and hence implementing Just-In-Time compilation and our own codeconverter/compiler, which are a huge effort to maintain too!) Do you know if this is now possible?
Furthermore, as you said, we need special interpolators, especially for C grids. So we would need to think about how to implement these into xarray (or xgcm?)

We would very much welcome the opportunity to work with you and the rest of the xarray and xgcm teams to streamline Parcels and integrate it more. As you say, we citizens of the scientific python ecosystem should aim to reduce unnecessary repetition.

We’ll open a separate Issue for 1), and leave this one open for discussion on the other two points

-Erik and Philippe

0 replies

rabernat · 2018-10-04T19:59:41Z

rabernat
Oct 4, 2018
Author

You are absolutely right that we could do a FieldSet.from_xarray(). This would be relatively simple to implement, and we would do it in the coming weeks. Note that we would likely still require a dictionary mapping the ds dimensions to our 'special' lon, lat, time and depth dimensions, as we don’t want to guess which dimension is which. Or do you see a way to automatically detect this?

One way to auto-detect would be to use CF conventions and examine the standard_name attribute. There is a standard vocabulary for latitude and longitude. In CF conventions, the variable names themselves should not matter. But of course not everyone follows this. In contrast, xesmf just requires the variables to be named lon and lat and leaves this up to the user: https://xesmf.readthedocs.io/en/latest/Rectilinear_grid.html.
Anything you choose will work, as long as it is documented clearly.

2. We do that mostly because of our JIT mode, where the particle interpolation is done in C. So we need to be able to pass arrays between python and C, and I’m not sure how easy that would be in lazy-loaded xarray.

It makes total sense that that, when it comes time to actually integrate the trajectories, you need numpy arrays. But rather than just calling .load() or .values immediately on xarray objects, you could instead wait until just before you need the data, use indexing to select just what is needed at that integration step, and then call .values. In particular, you probably only need a few timesteps at a time for each integration step. (This would bring performance benefits with regular netcdf4 as well.)

3. Furthermore, as you said, we need special interpolators, especially for C grids

I totally agree that 3 is hard and probably impractical. The ideal path, given lots of developer time and excellent coordination between packages, would involve the following:

Factor out fast, "local," spherical-geometry-aware interpolators into a standalone package
Use these interpolators to implement a custom index object for xarray. This sort of thing is explicitly called out in the xarray roadmap.
Call those indexing functions from parcels

This would be awesome, because now any xarray dataset could use the cool, fast interpolators you have developed, which are currently accessible only from deep inside parcels. Not claiming this would be easy, but I think it is, in some sense, the "right" way.

In the meantime, it might be worth reviewing some of the recent changes in xarray, including

This image in particular seems to capture what one wants to do with particles

cc @shoyer and @jhamman, who might be able to weigh in on the timeframe for the needed changes to xarray indexing.

0 replies

willirath · 2018-10-30T14:12:59Z

willirath
Oct 30, 2018
Collaborator

This shows how to use dask for distributing potentially many similar parcels experiments:

https://nbviewer.jupyter.org/gist/willirath/6b5c4654ca6be3774fa76acf4a266b96

The basic idea is to wrap the parcels experiment in a function and map it to a dask bag with the parameters.

It would be very interesting to use the xarray / zarr backend and run this on distributed resources on pangeo.

My feeling (somewhat informed by a few tests though) is that there's only very few obstacles to overcome before splitting parcels particle sets and distributing them more in a more transparent way than with this relatively crude approach.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some comments on xarray integration #1016

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

some comments on xarray integration #1016

rabernat Sep 30, 2018

Replies: 3 comments

erikvansebille Oct 3, 2018 Maintainer

rabernat Oct 4, 2018 Author

willirath Oct 30, 2018 Collaborator

rabernat
Sep 30, 2018

erikvansebille
Oct 3, 2018
Maintainer

rabernat
Oct 4, 2018
Author

willirath
Oct 30, 2018
Collaborator