Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Think about interoperability of dataframes/dataarrays #81

Open
Peter9192 opened this issue Aug 17, 2022 · 1 comment
Open

Think about interoperability of dataframes/dataarrays #81

Peter9192 opened this issue Aug 17, 2022 · 1 comment

Comments

@Peter9192
Copy link
Contributor

Peter9192 commented Aug 17, 2022

SKlearn typically works with pandas dataframes. One advantage of this is that you can easily chain operations/models in a pipeline, because the input and output is always a dataframe. We, on the other hand, usually work with xarray dataarrays. This makes it difficult to reuse existing SKlearn functionality like pipelines to chain our operations.

A few thoughts:

  • We could remodel the SKLearn utilities we want to use to operate on dataframes. This seems to be done already in https://github.com/AI4S2S/s2spy/issues/new, but it seems a bit inactive. Might be worth checking out.
  • A related issue is whether we want to keep supporting dataframes altogether? We already have some duplicate functions to be able to operate on both dataframes and dataarrays. Could we drop support for dataframes altogether?
@geek-yang
Copy link
Member

This is a very good question. I prefer we only support dataarrays. Our target user for this package all come from the climate community, which means they deal with multi-dimensional data more often. Most of them may already have quite some experience with xarray. It is not practical to deal with their problems with pandas dataframe. We can drop the support for dataframes and focus on xarray.

Regarding the "incompatibility" between sklearn and xarray, I google a lot and find that currently the best solution would be sklearn-xarray (indeed the package you found, did not find any solution better than this). Although it seems this package is a bit poorly maintained for now (last PR in 2021). But this package is mentioned/suggested by both xarray and sklearn on their documentation page (see this for xarray and this link for sklearn), which looks like a good sign.

I play with this package sklearn-xarray and it works well so far. It supports pipeline and crossvalidation nicely (also a link to their example. We can give a try when we need to use some features from them. I can also ask them to see the status of this package, and the possibility to contribute to it. I think for now some functionalities in their package already appeals to our work, which can prevent us from re-inventing the wheel.

We can further discuss this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants