-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add iterator feature to traintest #71
Comments
An outline about this feature/function would look like (I think): def splits_iter(data, dr_func = None)
#loop through all train/test splits (pseudo code below)
for split in splits: # loop through splits
clustered_data_train = dr_func.fit(train_data) # get train data from certain split
clustered_data_test = dr_func.transform(test_data) # same for test data
# combine data
data_splits_dr = combine(clustered_data_train, clustered_data_test) # combine all data to a single data-array
return data_splits_dr # note that the returned data has no lat/lon dimensions but only clustered timeseries
# assume the user wants to perform dimensionality reduction on each split
rgdr = RGDR(target_timeseries, eps_km=600, alpha=0.05, min_area_km2=3000**2)
# assume we got our resampled data array `da_splits` with train/test splits using s2spy
da_splits_dr = splits_iter(da_splits, dr_func = rgdr) The pros are, the user can get all the clustered results for each train/test split in one pack. However, this limits the flexibility for other operations (e.g. ML) which could have different workflows. If we opt for flexibility, we could have a enumeration thingy: for train_split, test split in traintest.splits_iter(data):
# user defined operations, e.g. RGDR But then the user needs to manage the data manually. Any thoughts about it @Peter9192 @BSchilperoort @semvijverberg ? |
I'm not sure I understand this, so far. Why is this necessary? So you can subsequently apply a ML algorithm to each train/test group? Also, note that the function in your pseudo code only returns the result of the last iteration. Let's take the discussion offline, shall we? |
@geek-yang and I just had a very nice discussion, and we came up with a slightly more elaborate example. Initially, we identify three main use cases for making it easy to iterate over the train/test groups:
For each of these use cases, we need to loop over the train-test groups, but it is a bit of a pain to obtain individual groups from our train-test dataframe. Therefore, a generator could come in really handy. Something like (pseudo code): def iterate(traintest_splits, data):
for i in range(n_splits):
if type(traintest_splits) == pd.DataFrame:
train = data.where(traintest_splits[f"split_{i}"] == "train").dropna()
test = data.where(traintest_splits[f"split_{i}"] == "test").dropna()
else:
# xarray
train = data.sel(split=i, traintest=="train")
test = data.sel(split=i, traintest=="test")
yield train, test This could then be used like so: data = xr.open_dataset(...)
calendar = s2spy.calendar.MonthlyCalendar(...).show()
splitter = sklearn.model_selection.KFold(...)
traintest = s2spy.traintest.split_groups(splitter, calendar)
data = s2spy.resample(data, calendar)
RGDR = s2spy.dimensionality.RGDR(...)
### 1. Inspecting whether you get robust clusters
for train, test in iterate(traintest, data):
# Note: test is not needed in this case
result = RGDR.fit(data.sel(train))
RDGR.plot()
########################################
### 2. Cross-validation use case
RF = sklearn.models.RF(...)
pipeline = sklearn.PipeLine([RGDR, RF])
# Calculate score for each of the test groups
scores = []
for train, test in iterate(traintest, data):
pipeline.fit(data.sel(train)))
score = pipeline.score(data.sel(test))
scores.append(score)
# See the scores for each train/test group
pd.DataFrame(scores, columns=['score']).plot(kind='bar')
################################
### 3. Tuning hyperparameters (similar to sklearn.model_selection.grid_search)
for parameters in hyper_parameters:
for train, test in iterate(traintest, data):
pipeline.set_params(**parameters)
pipeline.fit(data.sel(train))
score = pipeline.score(data.sel(test))
scores.append(score)
# For now just plot bar graphs for each set of hyperparameters
pd.DataFrame(scores, columns=['score']).plot(kind='bar')
############################################################## |
Looks good, it seems like this (again) will result in very little code/routines for us to maintain. Just one small remark on your pseudocode Peter, you seem to treat the splits and the data itself separately; def iterate(traintest_splits, data): While the train/test labels are added to the data in our current implementation. Do we want to keep it this way? def split_iterate(data): |
Wel-spotted! I typed this from the top of my head, already had some doubts about it. With respect to keeping this: I guess that's up for discussion. I opened AI4S2S/lilio#46, perhaps we can take the discussion there. Something to consider is that, currently, we would have to apply traintest to both features and labels. Not sure if that is the most elegant approach. Alternatively, we'd have to be able to call the iterator with both labels and features. |
Dear all, I just committed my draft implementation of a traintest splitter, where I tried to address (to some extent) what has been discussed here in #71 and AI4S2S/lilio#46. I tried to keep only core functionality, so here's what I did. Note that I allow to pass a list of arguments for X. This is because there is a difference between our E.g.,
Note, they have X_train as a simple np.ndarray with shape (samples, features). This is something we do not have. Our X is generally a number of resampled xr.Datasets. For us, a realistic pipeline would look like: I hope I'm not going way too fast! But I also feel like we need to take some steps to get ready for the workshop. Also, it is not my intention to get the Pipeline functionality working before the workshop, but I'm just trying to think ahead. |
Notice that sklearn splitter internally also implement an interator, e.g. for the shuffle split class: https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/model_selection/_split.py#L1728 |
and also the default |
It is necessary to have a iterator in the
traintest.py
, which enables the user to loop through the splits and perform dimensionality reduction (or machine learning) for each train/test group.The text was updated successfully, but these errors were encountered: