Re-initializing train-test split fails #45

semvijverberg · 2022-08-23T09:17:01Z

When re-initializing the train-test split, it fails because the dimension 'split' already exists.

This is easily solved by:
if 'split' not in data.dims: data.expand_dims("split")

However, it also exposes an important issue. When re-initializing the traintest split, it should give feedback when traintest split is changed. This communication is purely for clarification purposes. The user might not be expecting the traintest to change when using sklearn.model_selection.ShuffleSplit (because the user might assume the seed is fixed), but it will do that. Hence, there will need to be a test, checking whether the splits that were already present in data are identical the ones that were generated.

The text was updated successfully, but these errors were encountered:

Peter9192 · 2022-08-23T13:33:08Z

Related to #46 and AI4S2S/s2spy#71. I think eventually "reinitializing" the traintest splits is not something you should want to do. You just instantiate a new instance of a splitter class. Regarding the labels, it might be better if the splitter class (which could have a method that gives back labels aligned with the data) doesn't add dimensions to the data, but rather returns a separate data-array (which the user could then choose to append to his data).

geek-yang · 2022-08-23T13:48:27Z

Just checked the source code, even with the current implementation, it seems the function makes a copy of the data and return it to the user (see code line 57)
https://github.com/AI4S2S/s2spy/blob/d228782c7fa9a6af70cec3f58904bdba48708e56/s2spy/traintest.py#L50-L60

So the user is supposed to call this function multiple times if you want to generate train/test splits in different ways, for instance:

from sklearn.model_selection import ShuffleSplit, KFold
splitter = KFold(n_splits=3)
ds1 = s2spy.traintest.split_groups(splitter, ds)

splitter = ShuffleSplit(n_splits=3)
ds2 = s2spy.traintest.split_groups(splitter, ds)

geek-yang · 2022-08-23T13:55:34Z

But I do notice that for pandas dataframe the changes are made on the input data. We might want to fix this. But given that we would like to make add_labels a feature to be called by the user, we can change it later, I assume.

semvijverberg · 2022-08-23T14:03:22Z

Sorry this issue might be obsolete soon.

I agree that the dimension 'split' with labels (train, test ,skip) do not have be added by default.

I'm currently working on a new solution related to #46, where traintest_split is a class and adding labels is optional via add_labels.

Can we pause this discussion until I have a new draft traintest.py?

semvijverberg mentioned this issue Aug 23, 2022

reinitialization issue fixed and warning when traintest split is updated AI4S2S/s2spy#88

Closed

BSchilperoort transferred this issue from AI4S2S/s2spy Mar 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-initializing train-test split fails #45

Re-initializing train-test split fails #45

semvijverberg commented Aug 23, 2022 •

edited

Loading

Peter9192 commented Aug 23, 2022

geek-yang commented Aug 23, 2022

geek-yang commented Aug 23, 2022

semvijverberg commented Aug 23, 2022

Re-initializing train-test split fails #45

Re-initializing train-test split fails #45

Comments

semvijverberg commented Aug 23, 2022 • edited Loading

Peter9192 commented Aug 23, 2022

geek-yang commented Aug 23, 2022

geek-yang commented Aug 23, 2022

semvijverberg commented Aug 23, 2022

semvijverberg commented Aug 23, 2022 •

edited

Loading