Consider extending traintest to mimic sklearn splitter classes #46

Peter9192 · 2022-08-11T10:12:05Z

Currently, our train-test splitting function mostly serves to show the result of train-test splitting. Though the output datasets can be used directly, it would require a custom workflow e.g. for cross-validation.

In the future, it might be useful if we could rework them in the form of a class, similar to sklearns existing splitter classes. The main feature we'd add would be that we're a bit more restrictive in how groups are made; e.g. we don't allow splitting up rows from the same anchor year.

Then, it'd be possible to use them in conjunction with existing cross-validation code, e.g. sklearn cross-validate. Something like:

calendar = s2spy.Calendar(...)

ds_target = xr.open_dataset(...)
ds_features = xr.open_dataset(...)

target = s2spy.Resample(ds_target, calendar)
features = s2spy.Resample(ds_features, calendar)

traintest = s2spy.TrainTest(splitter=sklearn.model_selection.KFold, calendar)
model = sklearn.linear_model.Lasso()

cv_results = sklearn.cross_validate(model, features, target, cv=traintest)

Peter9192 · 2022-08-11T15:28:45Z

The current functionality of traintest is to explicitly add the created splits to a given dataframe/array. We could keep that as a separate method: s2spy.TrainTest(KFold(...)).add_labels(data).

With respect to the iterator discussed in AI4S2S/s2spy#71, we could have something like

splitter = sklearn.KFold(...)
traintest = s2spy.TrainTest(splitter)
for train, test in traintest.iterate(X, y):
     # do stuff

geek-yang · 2022-08-12T08:09:11Z

Very good suggestion. This looks quite clean and logical. To have a function like this cv_results = sklearn.cross_validate(model, features, target, cv=traintest), we need to know a bit more about the models and relevant metrics for our usecases. For now I will try to come up with something simpler (e.g. linear regression and MSE as metrics) to explore this workflow.

geek-yang · 2022-08-12T15:10:24Z

Just some thoughts based on our discussion after exploring the common practice of ML for timeseries.

From a user perspective, this is something I think is quite logical and is also consistent to my experience of using other ml packages for cross validation and model training:
(ps: not sure about the names of functions, lots of improvising, feel free to drop your comments 😸.)

import s2spy.time
import s2spy.traintest
import xarray as xr
from s2spy import RGDR

# assume that I want to explore the causal relation between sea surface temperature and
# the change of Atlantic Meridional Overturning Circulation (AMOC)
# and use sst to predict AMOC

# load data
sst = xr.open_dataset("sst_field_from_2010_to_2020.nc") # daily data [time, lat, lon]
amoc = xr.open_dataset("amoc_rapid_array_obs_from_2010_to_2020.nc")  # daily data [time]

# create calendar using s2spy based on my interest of timescales
calendar = s2spy.time.AdventCalendar(anchor=(10, 15), freq="180d")
# map to data
calendar.map_to_data(sst)

# resample my data to the preferred timescales
sst_resample = s2spy.time.resample(calendar, sst)
amoc_resample = s2spy.time.resample(calendar, amoc)

######################## cross validation ###########################
# train/test splits using kfold
from sklearn.model_selection import KFold

splitter = KFold(n_splits=3)
traintest_splits = s2spy.traintest.split_groups(splitter, calendar) # here we make `traintest_splits` a class
# add labels to the data if the user wants to have an overview of the splits - data pairing
sst_traintest_summary = traintest_splits.add_label(sst_resample)

# cross-validation
# we use linear regression model and mse as metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

scores = []

for train_data, test_data in traintest_splits.iterate((sst_resample, amoc_resample)):
    sst_train, amoc_train = train_data
    sst_test, amoc_test = test_data
    # perform dimensionality reduction using RGDR
    rgdr = RGDR(amoc_train, eps_km=600, alpha=0.05, min_area_km2=3000**2)
    sst_clustered_train = rgdr.fit(sst_train)

    # train model
    sst_X_train = sst_clustered_train.sel(target = "False")
    amoc_y_train = amoc_train.sel(target = "True")
    model_ols =  LinearRegression(normalize=True)
    model_ols.fit(sst_X_train, amoc_y_train) 

    # apply clusters to test data
    sst_clustered_test = rgdr.transform(sst_test)
    sst_X_test = sst_clustered_test.sel(target = "False")
    amoc_y_test = amoc_test.sel(target = "True")
    # make predictions using test data
    predict_amoc = model_ols.predict(sst_X_test)
    # calculate score with mse
    scores.append(mean_squared_error(amoc_y_test, predict_amoc))
######################## cross validation ###########################
# plot scores to check the results from cross validation
import matplotlib.pyplot as plt
plt.plot(scores)

I will explain the code in the posts below (it gets a bit too long...).

geek-yang · 2022-08-12T15:42:45Z

A few concerns relate to the workflow above:

I like the suggestion from @Peter9192 about making the train/test labels an option for the user by adding add_labels function. In this case, I would suggest to make the train/test split a class and the train/test splitting is only based on the calendar and splitter traintest_splits = s2spy.traintest.split_groups(splitter, calendar). A good thing about it is, the splits only depends on the calendar (we expect the user to use the calendar to prepare their data, so all the given variables/fields should follow this calendar) and then it can be applied to multiple fields that the user wants to iterate, simply like for train_data, test_data in traintest_splits.iterate((sst_resample, amoc_resample, ...)):. Otherwise, the user may need to do this multiple times (e.g. sst_splits = s2spy.traintest.split_groups(splitter, sst_resample)) and this makes the iteration difficult as well.
We need to clarify a bit about train/test and feature/label. In some popular machine learning packages (e.g. pytorch, sklearn), they always provide a function to split train test data with feature and label as input separately (e.g. 'X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.4, random_state=0)', https://scikit-learn.org/stable/modules/cross_validation.html). This works well for classification tasks, since the data is given separately (e.g. think about a image recognition task, the inputs are images and labels). However, for timeseries, things are bit different. Literally, there is no feature and label, but only data points. So I would suggest that we stick to the target labels and use it to get X and y for training, therefore for I wrote traintest_splits.iterate(resampled data....)
The current design seems to restrict the user to make train/test splits only based on anchor years. If the user wants to have splits inside each year, then basically they don't have anchor years. Instead, they should use anchor month or anchor day. Therefore, it makes more sense to use a different calendar, rather than cross the border 😉.

We could make the codes in the #### cross validation #### block a cross-validator (e.g. scores = s2spy.traintest.cross_validator([data1, data2,...], calendar, traintest_splitter, dimensionality_reduction(target_series), model, metrics), similar to sklearn cross validation https://scikit-learn.org/stable/modules/cross_validation.html, but also with the flexibility to add dimensionality reduction module. And this design could also fit for the pipeline of sklearn.

Peter9192 mentioned this issue Aug 11, 2022

Add iterator feature to traintest AI4S2S/s2spy#71

Closed

This was referenced Aug 23, 2022

Re-initializing train-test split fails #45

Open

Nested cross-validation AI4S2S/s2spy#90

Open

Implement train/test splits iterator AI4S2S/s2spy#74

Merged

BSchilperoort transferred this issue from AI4S2S/s2spy Mar 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider extending traintest to mimic sklearn splitter classes #46

Consider extending traintest to mimic sklearn splitter classes #46

Peter9192 commented Aug 11, 2022

Peter9192 commented Aug 11, 2022

geek-yang commented Aug 12, 2022

geek-yang commented Aug 12, 2022 •

edited

Loading

geek-yang commented Aug 12, 2022 •

edited

Loading

Consider extending traintest to mimic sklearn splitter classes #46

Consider extending traintest to mimic sklearn splitter classes #46

Comments

Peter9192 commented Aug 11, 2022

Peter9192 commented Aug 11, 2022

geek-yang commented Aug 12, 2022

geek-yang commented Aug 12, 2022 • edited Loading

geek-yang commented Aug 12, 2022 • edited Loading

geek-yang commented Aug 12, 2022 •

edited

Loading

geek-yang commented Aug 12, 2022 •

edited

Loading