You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SKlearn typically works with pandas dataframes. One advantage of this is that you can easily chain operations/models in a pipeline, because the input and output is always a dataframe. We, on the other hand, usually work with xarray dataarrays. This makes it difficult to reuse existing SKlearn functionality like pipelines to chain our operations.
A few thoughts:
We could remodel the SKLearn utilities we want to use to operate on dataframes. This seems to be done already in https://github.com/AI4S2S/s2spy/issues/new, but it seems a bit inactive. Might be worth checking out.
A related issue is whether we want to keep supporting dataframes altogether? We already have some duplicate functions to be able to operate on both dataframes and dataarrays. Could we drop support for dataframes altogether?
The text was updated successfully, but these errors were encountered:
This is a very good question. I prefer we only support dataarrays. Our target user for this package all come from the climate community, which means they deal with multi-dimensional data more often. Most of them may already have quite some experience with xarray. It is not practical to deal with their problems with pandas dataframe. We can drop the support for dataframes and focus on xarray.
Regarding the "incompatibility" between sklearn and xarray, I google a lot and find that currently the best solution would be sklearn-xarray (indeed the package you found, did not find any solution better than this). Although it seems this package is a bit poorly maintained for now (last PR in 2021). But this package is mentioned/suggested by both xarray and sklearn on their documentation page (see this for xarray and this link for sklearn), which looks like a good sign.
I play with this package sklearn-xarray and it works well so far. It supports pipeline and crossvalidation nicely (also a link to their example. We can give a try when we need to use some features from them. I can also ask them to see the status of this package, and the possibility to contribute to it. I think for now some functionalities in their package already appeals to our work, which can prevent us from re-inventing the wheel.
SKlearn typically works with pandas dataframes. One advantage of this is that you can easily chain operations/models in a pipeline, because the input and output is always a dataframe. We, on the other hand, usually work with xarray dataarrays. This makes it difficult to reuse existing SKlearn functionality like pipelines to chain our operations.
A few thoughts:
The text was updated successfully, but these errors were encountered: