Add iterator feature to traintest #71

geek-yang · 2022-08-10T10:55:40Z

It is necessary to have a iterator in the traintest.py, which enables the user to loop through the splits and perform dimensionality reduction (or machine learning) for each train/test group.

The text was updated successfully, but these errors were encountered:

geek-yang · 2022-08-10T11:34:35Z

An outline about this feature/function would look like (I think):

def splits_iter(data, dr_func = None)
    #loop through all train/test splits (pseudo code below)
    for split in splits: # loop through splits
         clustered_data_train = dr_func.fit(train_data) # get train data from certain split
         clustered_data_test = dr_func.transform(test_data) # same for test data
         # combine data
         data_splits_dr = combine(clustered_data_train, clustered_data_test) # combine all data to a single data-array
    return data_splits_dr # note that the returned data has no lat/lon dimensions but only clustered timeseries
      
# assume the user wants to perform dimensionality reduction on each split
rgdr = RGDR(target_timeseries, eps_km=600, alpha=0.05, min_area_km2=3000**2)
# assume we got our resampled data array `da_splits` with train/test splits using s2spy
da_splits_dr = splits_iter(da_splits, dr_func = rgdr)

The pros are, the user can get all the clustered results for each train/test split in one pack. However, this limits the flexibility for other operations (e.g. ML) which could have different workflows.

If we opt for flexibility, we could have a enumeration thingy:

for train_split, test split in traintest.splits_iter(data):
    # user defined operations, e.g. RGDR

But then the user needs to manage the data manually.

Any thoughts about it @Peter9192 @BSchilperoort @semvijverberg ?

Peter9192 · 2022-08-10T11:45:12Z

It is necessary to have a iterator in the traintest.py, which enables the user to loop through the splits and perform dimensionality reduction (or machine learning) for each train/test group.

I'm not sure I understand this, so far. Why is this necessary? So you can subsequently apply a ML algorithm to each train/test group?

Also, note that the function in your pseudo code only returns the result of the last iteration.

Let's take the discussion offline, shall we?

Peter9192 · 2022-08-10T12:49:15Z

@geek-yang and I just had a very nice discussion, and we came up with a slightly more elaborate example. Initially, we identify three main use cases for making it easy to iterate over the train/test groups:

Assessing whether the clusters identified by RGDR are good/robust
Performing cross-validation to see if the scores of our ML pipeline are any good and robust
Performing tuning of hyperparameters for our model

For each of these use cases, we need to loop over the train-test groups, but it is a bit of a pain to obtain individual groups from our train-test dataframe. Therefore, a generator could come in really handy. Something like (pseudo code):

def iterate(traintest_splits, data):
    for i in range(n_splits):
        if type(traintest_splits) == pd.DataFrame:
            train = data.where(traintest_splits[f"split_{i}"] == "train").dropna()
            test = data.where(traintest_splits[f"split_{i}"] == "test").dropna()
        else:
            # xarray
            train = data.sel(split=i, traintest=="train")	
            test = data.sel(split=i, traintest=="test")	
        yield train, test

This could then be used like so:

data = xr.open_dataset(...)
calendar = s2spy.calendar.MonthlyCalendar(...).show()
splitter = sklearn.model_selection.KFold(...)
traintest = s2spy.traintest.split_groups(splitter, calendar)
data = s2spy.resample(data, calendar)
RGDR = s2spy.dimensionality.RGDR(...)

### 1. Inspecting whether you get robust clusters
for train, test in iterate(traintest, data):
    # Note: test is not needed in this case
    result = RGDR.fit(data.sel(train))
    RDGR.plot()
########################################

### 2. Cross-validation use case
RF = sklearn.models.RF(...)
pipeline = sklearn.PipeLine([RGDR, RF])

# Calculate score for each of the test groups
scores = []
for train, test in iterate(traintest, data):
    pipeline.fit(data.sel(train)))
    score = pipeline.score(data.sel(test))
    scores.append(score)

# See the scores for each train/test group
pd.DataFrame(scores, columns=['score']).plot(kind='bar')
################################


### 3. Tuning hyperparameters (similar to sklearn.model_selection.grid_search)
for parameters in hyper_parameters:
    for train, test in iterate(traintest, data):  
        pipeline.set_params(**parameters)
        pipeline.fit(data.sel(train))
        score = pipeline.score(data.sel(test))
        scores.append(score)
    
    # For now just plot bar graphs for each set of hyperparameters
    pd.DataFrame(scores, columns=['score']).plot(kind='bar')
##############################################################

BSchilperoort · 2022-08-11T12:59:01Z

Looks good, it seems like this (again) will result in very little code/routines for us to maintain.

Just one small remark on your pseudocode Peter, you seem to treat the splits and the data itself separately;

def iterate(traintest_splits, data):

While the train/test labels are added to the data in our current implementation. Do we want to keep it this way?
I do see that Yang's implementation in #74 just has;

def split_iterate(data):

Peter9192 · 2022-08-11T15:22:19Z

Wel-spotted! I typed this from the top of my head, already had some doubts about it. With respect to keeping this: I guess that's up for discussion. I opened AI4S2S/lilio#46, perhaps we can take the discussion there. Something to consider is that, currently, we would have to apply traintest to both features and labels. Not sure if that is the most elegant approach. Alternatively, we'd have to be able to call the iterator with both labels and features.

semvijverberg · 2022-08-23T16:07:15Z

Dear all,

I just committed my draft implementation of a traintest splitter, where I tried to address (to some extent) what has been discussed here in #71 and AI4S2S/lilio#46. I tried to keep only core functionality, so here's what I did.

traintest_splits is now a class:
the class has the method split_iterate(*X, y):

Note that I allow to pass a list of arguments for X. This is because there is a difference between our pipeline workflow and the pipeline workflow of scikitlearn.

E.g.,
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

pipe.fit(X_train, y_train)

Note, they have X_train as a simple np.ndarray with shape (samples, features). This is something we do not have. Our X is generally a number of resampled xr.Datasets.

For us, a realistic pipeline would look like:
RF = sklearn.models.RF(...)
Pipeline([RGDR(y).fit(sst_precursor), RGDR(y).fit(z200_precursor), EOF.fit(OLR_precursor), 'merger_of_features', 'feature_selection', RF])

I hope I'm not going way too fast! But I also feel like we need to take some steps to get ready for the workshop. Also, it is not my intention to get the Pipeline functionality working before the workshop, but I'm just trying to think ahead.

Peter9192 · 2022-08-24T11:47:25Z

Notice that sklearn splitter internally also implement an interator, e.g. for the shuffle split class: https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/model_selection/_split.py#L1728

Peter9192 · 2022-08-24T11:56:30Z

Notice that sklearn splitter internally also implement an interator, e.g. for the shuffle split class: https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/model_selection/_split.py#L1728

and also the default split methods return iterators: https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/model_selection/_split.py#L1579

This was referenced Aug 10, 2022

Implement train/test splits iterator #74

Merged

Support grouping over splits in rgdr #58

Open

Peter9192 mentioned this issue Aug 11, 2022

Consider extending traintest to mimic sklearn splitter classes AI4S2S/lilio#46

Open

BSchilperoort added the train/test Issues relating to the train-test splitting label Aug 12, 2022

Peter9192 mentioned this issue Aug 23, 2022

Re-initializing train-test split fails AI4S2S/lilio#45

Open

This was referenced Aug 24, 2022

Different pipelines for different variables #89

Open

Nested cross-validation #90

Open

geek-yang closed this as completed in #74 Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add iterator feature to traintest #71

Add iterator feature to traintest #71

geek-yang commented Aug 10, 2022

geek-yang commented Aug 10, 2022

Peter9192 commented Aug 10, 2022

Peter9192 commented Aug 10, 2022 •

edited

Loading

BSchilperoort commented Aug 11, 2022

Peter9192 commented Aug 11, 2022 •

edited

Loading

semvijverberg commented Aug 23, 2022

Peter9192 commented Aug 24, 2022

Peter9192 commented Aug 24, 2022

Add iterator feature to traintest #71

Add iterator feature to traintest #71

Comments

geek-yang commented Aug 10, 2022

geek-yang commented Aug 10, 2022

Peter9192 commented Aug 10, 2022

Peter9192 commented Aug 10, 2022 • edited Loading

BSchilperoort commented Aug 11, 2022

Peter9192 commented Aug 11, 2022 • edited Loading

semvijverberg commented Aug 23, 2022

Peter9192 commented Aug 24, 2022

Peter9192 commented Aug 24, 2022

Peter9192 commented Aug 10, 2022 •

edited

Loading

Peter9192 commented Aug 11, 2022 •

edited

Loading