Think about interoperability of dataframes/dataarrays #81

Peter9192 · 2022-08-17T10:15:34Z

SKlearn typically works with pandas dataframes. One advantage of this is that you can easily chain operations/models in a pipeline, because the input and output is always a dataframe. We, on the other hand, usually work with xarray dataarrays. This makes it difficult to reuse existing SKlearn functionality like pipelines to chain our operations.

A few thoughts:

We could remodel the SKLearn utilities we want to use to operate on dataframes. This seems to be done already in https://github.com/AI4S2S/s2spy/issues/new, but it seems a bit inactive. Might be worth checking out.
A related issue is whether we want to keep supporting dataframes altogether? We already have some duplicate functions to be able to operate on both dataframes and dataarrays. Could we drop support for dataframes altogether?

geek-yang · 2022-08-19T16:07:53Z

This is a very good question. I prefer we only support dataarrays. Our target user for this package all come from the climate community, which means they deal with multi-dimensional data more often. Most of them may already have quite some experience with xarray. It is not practical to deal with their problems with pandas dataframe. We can drop the support for dataframes and focus on xarray.

Regarding the "incompatibility" between sklearn and xarray, I google a lot and find that currently the best solution would be sklearn-xarray (indeed the package you found, did not find any solution better than this). Although it seems this package is a bit poorly maintained for now (last PR in 2021). But this package is mentioned/suggested by both xarray and sklearn on their documentation page (see this for xarray and this link for sklearn), which looks like a good sign.

I play with this package sklearn-xarray and it works well so far. It supports pipeline and crossvalidation nicely (also a link to their example. We can give a try when we need to use some features from them. I can also ask them to see the status of this package, and the possibility to contribute to it. I think for now some functionalities in their package already appeals to our work, which can prevent us from re-inventing the wheel.

We can further discuss this later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Think about interoperability of dataframes/dataarrays #81

Think about interoperability of dataframes/dataarrays #81

Peter9192 commented Aug 17, 2022 •

edited

Loading

geek-yang commented Aug 19, 2022

Think about interoperability of dataframes/dataarrays #81

Think about interoperability of dataframes/dataarrays #81

Comments

Peter9192 commented Aug 17, 2022 • edited Loading

geek-yang commented Aug 19, 2022

Peter9192 commented Aug 17, 2022 •

edited

Loading