Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider extending traintest to mimic sklearn splitter classes #46

Open
Peter9192 opened this issue Aug 11, 2022 · 4 comments
Open

Consider extending traintest to mimic sklearn splitter classes #46

Peter9192 opened this issue Aug 11, 2022 · 4 comments

Comments

@Peter9192
Copy link
Contributor

Currently, our train-test splitting function mostly serves to show the result of train-test splitting. Though the output datasets can be used directly, it would require a custom workflow e.g. for cross-validation.

In the future, it might be useful if we could rework them in the form of a class, similar to sklearns existing splitter classes. The main feature we'd add would be that we're a bit more restrictive in how groups are made; e.g. we don't allow splitting up rows from the same anchor year.

Then, it'd be possible to use them in conjunction with existing cross-validation code, e.g. sklearn cross-validate. Something like:

calendar = s2spy.Calendar(...)

ds_target = xr.open_dataset(...)
ds_features = xr.open_dataset(...)

target = s2spy.Resample(ds_target, calendar)
features = s2spy.Resample(ds_features, calendar)

traintest = s2spy.TrainTest(splitter=sklearn.model_selection.KFold, calendar)
model = sklearn.linear_model.Lasso()

cv_results = sklearn.cross_validate(model, features, target, cv=traintest)
@Peter9192
Copy link
Contributor Author

The current functionality of traintest is to explicitly add the created splits to a given dataframe/array. We could keep that as a separate method: s2spy.TrainTest(KFold(...)).add_labels(data).

With respect to the iterator discussed in AI4S2S/s2spy#71, we could have something like

splitter = sklearn.KFold(...)
traintest = s2spy.TrainTest(splitter)
for train, test in traintest.iterate(X, y):
     # do stuff

@geek-yang
Copy link
Member

Very good suggestion. This looks quite clean and logical. To have a function like this cv_results = sklearn.cross_validate(model, features, target, cv=traintest), we need to know a bit more about the models and relevant metrics for our usecases. For now I will try to come up with something simpler (e.g. linear regression and MSE as metrics) to explore this workflow.

@geek-yang
Copy link
Member

geek-yang commented Aug 12, 2022

Just some thoughts based on our discussion after exploring the common practice of ML for timeseries.

From a user perspective, this is something I think is quite logical and is also consistent to my experience of using other ml packages for cross validation and model training:
(ps: not sure about the names of functions, lots of improvising, feel free to drop your comments 😸.)

import s2spy.time
import s2spy.traintest
import xarray as xr
from s2spy import RGDR

# assume that I want to explore the causal relation between sea surface temperature and
# the change of Atlantic Meridional Overturning Circulation (AMOC)
# and use sst to predict AMOC

# load data
sst = xr.open_dataset("sst_field_from_2010_to_2020.nc") # daily data [time, lat, lon]
amoc = xr.open_dataset("amoc_rapid_array_obs_from_2010_to_2020.nc")  # daily data [time]

# create calendar using s2spy based on my interest of timescales
calendar = s2spy.time.AdventCalendar(anchor=(10, 15), freq="180d")
# map to data
calendar.map_to_data(sst)

# resample my data to the preferred timescales
sst_resample = s2spy.time.resample(calendar, sst)
amoc_resample = s2spy.time.resample(calendar, amoc)

######################## cross validation ###########################
# train/test splits using kfold
from sklearn.model_selection import KFold

splitter = KFold(n_splits=3)
traintest_splits = s2spy.traintest.split_groups(splitter, calendar) # here we make `traintest_splits` a class
# add labels to the data if the user wants to have an overview of the splits - data pairing
sst_traintest_summary = traintest_splits.add_label(sst_resample)

# cross-validation
# we use linear regression model and mse as metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

scores = []

for train_data, test_data in traintest_splits.iterate((sst_resample, amoc_resample)):
    sst_train, amoc_train = train_data
    sst_test, amoc_test = test_data
    # perform dimensionality reduction using RGDR
    rgdr = RGDR(amoc_train, eps_km=600, alpha=0.05, min_area_km2=3000**2)
    sst_clustered_train = rgdr.fit(sst_train)

    # train model
    sst_X_train = sst_clustered_train.sel(target = "False")
    amoc_y_train = amoc_train.sel(target = "True")
    model_ols =  LinearRegression(normalize=True)
    model_ols.fit(sst_X_train, amoc_y_train) 

    # apply clusters to test data
    sst_clustered_test = rgdr.transform(sst_test)
    sst_X_test = sst_clustered_test.sel(target = "False")
    amoc_y_test = amoc_test.sel(target = "True")
    # make predictions using test data
    predict_amoc = model_ols.predict(sst_X_test)
    # calculate score with mse
    scores.append(mean_squared_error(amoc_y_test, predict_amoc))
######################## cross validation ###########################
# plot scores to check the results from cross validation
import matplotlib.pyplot as plt
plt.plot(scores)

I will explain the code in the posts below (it gets a bit too long...).

@geek-yang
Copy link
Member

geek-yang commented Aug 12, 2022

A few concerns relate to the workflow above:

  • I like the suggestion from @Peter9192 about making the train/test labels an option for the user by adding add_labels function. In this case, I would suggest to make the train/test split a class and the train/test splitting is only based on the calendar and splitter traintest_splits = s2spy.traintest.split_groups(splitter, calendar). A good thing about it is, the splits only depends on the calendar (we expect the user to use the calendar to prepare their data, so all the given variables/fields should follow this calendar) and then it can be applied to multiple fields that the user wants to iterate, simply like for train_data, test_data in traintest_splits.iterate((sst_resample, amoc_resample, ...)):. Otherwise, the user may need to do this multiple times (e.g. sst_splits = s2spy.traintest.split_groups(splitter, sst_resample)) and this makes the iteration difficult as well.
  • We need to clarify a bit about train/test and feature/label. In some popular machine learning packages (e.g. pytorch, sklearn), they always provide a function to split train test data with feature and label as input separately (e.g. 'X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.4, random_state=0)', https://scikit-learn.org/stable/modules/cross_validation.html). This works well for classification tasks, since the data is given separately (e.g. think about a image recognition task, the inputs are images and labels). However, for timeseries, things are bit different. Literally, there is no feature and label, but only data points. So I would suggest that we stick to the target labels and use it to get X and y for training, therefore for I wrote traintest_splits.iterate(resampled data....)
  • The current design seems to restrict the user to make train/test splits only based on anchor years. If the user wants to have splits inside each year, then basically they don't have anchor years. Instead, they should use anchor month or anchor day. Therefore, it makes more sense to use a different calendar, rather than cross the border 😉.

We could make the codes in the #### cross validation #### block a cross-validator (e.g. scores = s2spy.traintest.cross_validator([data1, data2,...], calendar, traintest_splitter, dimensionality_reduction(target_series), model, metrics), similar to sklearn cross validation https://scikit-learn.org/stable/modules/cross_validation.html, but also with the flexibility to add dimensionality reduction module. And this design could also fit for the pipeline of sklearn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants