Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iterator feature to traintest #71

Closed
geek-yang opened this issue Aug 10, 2022 · 8 comments · Fixed by #74
Closed

Add iterator feature to traintest #71

geek-yang opened this issue Aug 10, 2022 · 8 comments · Fixed by #74
Labels
train/test Issues relating to the train-test splitting

Comments

@geek-yang
Copy link
Member

It is necessary to have a iterator in the traintest.py, which enables the user to loop through the splits and perform dimensionality reduction (or machine learning) for each train/test group.

@geek-yang
Copy link
Member Author

An outline about this feature/function would look like (I think):

def splits_iter(data, dr_func = None)
    #loop through all train/test splits (pseudo code below)
    for split in splits: # loop through splits
         clustered_data_train = dr_func.fit(train_data) # get train data from certain split
         clustered_data_test = dr_func.transform(test_data) # same for test data
         # combine data
         data_splits_dr = combine(clustered_data_train, clustered_data_test) # combine all data to a single data-array
    return data_splits_dr # note that the returned data has no lat/lon dimensions but only clustered timeseries
      
# assume the user wants to perform dimensionality reduction on each split
rgdr = RGDR(target_timeseries, eps_km=600, alpha=0.05, min_area_km2=3000**2)
# assume we got our resampled data array `da_splits` with train/test splits using s2spy
da_splits_dr = splits_iter(da_splits, dr_func = rgdr)

The pros are, the user can get all the clustered results for each train/test split in one pack. However, this limits the flexibility for other operations (e.g. ML) which could have different workflows.

If we opt for flexibility, we could have a enumeration thingy:

for train_split, test split in traintest.splits_iter(data):
    # user defined operations, e.g. RGDR

But then the user needs to manage the data manually.

Any thoughts about it @Peter9192 @BSchilperoort @semvijverberg ?

@Peter9192
Copy link
Contributor

It is necessary to have a iterator in the traintest.py, which enables the user to loop through the splits and perform dimensionality reduction (or machine learning) for each train/test group.

I'm not sure I understand this, so far. Why is this necessary? So you can subsequently apply a ML algorithm to each train/test group?

Also, note that the function in your pseudo code only returns the result of the last iteration.

Let's take the discussion offline, shall we?

@Peter9192
Copy link
Contributor

Peter9192 commented Aug 10, 2022

@geek-yang and I just had a very nice discussion, and we came up with a slightly more elaborate example. Initially, we identify three main use cases for making it easy to iterate over the train/test groups:

  1. Assessing whether the clusters identified by RGDR are good/robust
  2. Performing cross-validation to see if the scores of our ML pipeline are any good and robust
  3. Performing tuning of hyperparameters for our model

For each of these use cases, we need to loop over the train-test groups, but it is a bit of a pain to obtain individual groups from our train-test dataframe. Therefore, a generator could come in really handy. Something like (pseudo code):

def iterate(traintest_splits, data):
    for i in range(n_splits):
        if type(traintest_splits) == pd.DataFrame:
            train = data.where(traintest_splits[f"split_{i}"] == "train").dropna()
            test = data.where(traintest_splits[f"split_{i}"] == "test").dropna()
        else:
            # xarray
            train = data.sel(split=i, traintest=="train")	
            test = data.sel(split=i, traintest=="test")	
        yield train, test

This could then be used like so:

data = xr.open_dataset(...)
calendar = s2spy.calendar.MonthlyCalendar(...).show()
splitter = sklearn.model_selection.KFold(...)
traintest = s2spy.traintest.split_groups(splitter, calendar)
data = s2spy.resample(data, calendar)
RGDR = s2spy.dimensionality.RGDR(...)

### 1. Inspecting whether you get robust clusters
for train, test in iterate(traintest, data):
    # Note: test is not needed in this case
    result = RGDR.fit(data.sel(train))
    RDGR.plot()
########################################

### 2. Cross-validation use case
RF = sklearn.models.RF(...)
pipeline = sklearn.PipeLine([RGDR, RF])

# Calculate score for each of the test groups
scores = []
for train, test in iterate(traintest, data):
    pipeline.fit(data.sel(train)))
    score = pipeline.score(data.sel(test))
    scores.append(score)

# See the scores for each train/test group
pd.DataFrame(scores, columns=['score']).plot(kind='bar')
################################


### 3. Tuning hyperparameters (similar to sklearn.model_selection.grid_search)
for parameters in hyper_parameters:
    for train, test in iterate(traintest, data):  
        pipeline.set_params(**parameters)
        pipeline.fit(data.sel(train))
        score = pipeline.score(data.sel(test))
        scores.append(score)
    
    # For now just plot bar graphs for each set of hyperparameters
    pd.DataFrame(scores, columns=['score']).plot(kind='bar')
##############################################################

@BSchilperoort
Copy link
Contributor

Looks good, it seems like this (again) will result in very little code/routines for us to maintain.

Just one small remark on your pseudocode Peter, you seem to treat the splits and the data itself separately;

def iterate(traintest_splits, data):

While the train/test labels are added to the data in our current implementation. Do we want to keep it this way?
I do see that Yang's implementation in #74 just has;

def split_iterate(data):

@Peter9192
Copy link
Contributor

Peter9192 commented Aug 11, 2022

Wel-spotted! I typed this from the top of my head, already had some doubts about it. With respect to keeping this: I guess that's up for discussion. I opened AI4S2S/lilio#46, perhaps we can take the discussion there. Something to consider is that, currently, we would have to apply traintest to both features and labels. Not sure if that is the most elegant approach. Alternatively, we'd have to be able to call the iterator with both labels and features.

@semvijverberg
Copy link
Member

Dear all,

I just committed my draft implementation of a traintest splitter, where I tried to address (to some extent) what has been discussed here in #71 and AI4S2S/lilio#46. I tried to keep only core functionality, so here's what I did.

  • traintest_splits is now a class:
    image
  • the class has the method split_iterate(*X, y):
    image

Note that I allow to pass a list of arguments for X. This is because there is a difference between our pipeline workflow and the pipeline workflow of scikitlearn.

E.g.,
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])

pipe.fit(X_train, y_train)

Note, they have X_train as a simple np.ndarray with shape (samples, features). This is something we do not have. Our X is generally a number of resampled xr.Datasets.

For us, a realistic pipeline would look like:
RF = sklearn.models.RF(...)
Pipeline([RGDR(y).fit(sst_precursor), RGDR(y).fit(z200_precursor), EOF.fit(OLR_precursor), 'merger_of_features', 'feature_selection', RF])

I hope I'm not going way too fast! But I also feel like we need to take some steps to get ready for the workshop. Also, it is not my intention to get the Pipeline functionality working before the workshop, but I'm just trying to think ahead.

@Peter9192
Copy link
Contributor

Notice that sklearn splitter internally also implement an interator, e.g. for the shuffle split class: https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/model_selection/_split.py#L1728

@Peter9192
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
train/test Issues relating to the train-test splitting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants