-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-initializing train-test split fails #45
Comments
Related to #46 and AI4S2S/s2spy#71. I think eventually "reinitializing" the traintest splits is not something you should want to do. You just instantiate a new instance of a splitter class. Regarding the labels, it might be better if the splitter class (which could have a method that gives back labels aligned with the data) doesn't add dimensions to the data, but rather returns a separate data-array (which the user could then choose to append to his data). |
Just checked the source code, even with the current implementation, it seems the function makes a copy of the data and return it to the user (see code line 57) So the user is supposed to call this function multiple times if you want to generate train/test splits in different ways, for instance: from sklearn.model_selection import ShuffleSplit, KFold
splitter = KFold(n_splits=3)
ds1 = s2spy.traintest.split_groups(splitter, ds)
splitter = ShuffleSplit(n_splits=3)
ds2 = s2spy.traintest.split_groups(splitter, ds) |
But I do notice that for pandas dataframe the changes are made on the input data. We might want to fix this. But given that we would like to make |
Sorry this issue might be obsolete soon. I agree that the dimension 'split' with labels (train, test ,skip) do not have be added by default. I'm currently working on a new solution related to #46, where traintest_split is a class and adding labels is optional via Can we pause this discussion until I have a new draft traintest.py? |
When re-initializing the train-test split, it fails because the dimension 'split' already exists.
This is easily solved by:
if 'split' not in data.dims: data.expand_dims("split")
However, it also exposes an important issue. When re-initializing the traintest split, it should give feedback when traintest split is changed. This communication is purely for clarification purposes. The user might not be expecting the traintest to change when using
sklearn.model_selection.ShuffleSplit
(because the user might assume the seed is fixed), but it will do that. Hence, there will need to be a test, checking whether the splits that were already present in data are identical the ones that were generated.The text was updated successfully, but these errors were encountered: