Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-initializing train-test split fails #45

Open
semvijverberg opened this issue Aug 23, 2022 · 4 comments
Open

Re-initializing train-test split fails #45

semvijverberg opened this issue Aug 23, 2022 · 4 comments

Comments

@semvijverberg
Copy link
Member

semvijverberg commented Aug 23, 2022

When re-initializing the train-test split, it fails because the dimension 'split' already exists.
image

This is easily solved by:
if 'split' not in data.dims: data.expand_dims("split")

However, it also exposes an important issue. When re-initializing the traintest split, it should give feedback when traintest split is changed. This communication is purely for clarification purposes. The user might not be expecting the traintest to change when using sklearn.model_selection.ShuffleSplit (because the user might assume the seed is fixed), but it will do that. Hence, there will need to be a test, checking whether the splits that were already present in data are identical the ones that were generated.

@Peter9192
Copy link
Contributor

Related to #46 and AI4S2S/s2spy#71. I think eventually "reinitializing" the traintest splits is not something you should want to do. You just instantiate a new instance of a splitter class. Regarding the labels, it might be better if the splitter class (which could have a method that gives back labels aligned with the data) doesn't add dimensions to the data, but rather returns a separate data-array (which the user could then choose to append to his data).

@geek-yang
Copy link
Member

Just checked the source code, even with the current implementation, it seems the function makes a copy of the data and return it to the user (see code line 57)
https://github.com/AI4S2S/s2spy/blob/d228782c7fa9a6af70cec3f58904bdba48708e56/s2spy/traintest.py#L50-L60

So the user is supposed to call this function multiple times if you want to generate train/test splits in different ways, for instance:

from sklearn.model_selection import ShuffleSplit, KFold
splitter = KFold(n_splits=3)
ds1 = s2spy.traintest.split_groups(splitter, ds)

splitter = ShuffleSplit(n_splits=3)
ds2 = s2spy.traintest.split_groups(splitter, ds)

@geek-yang
Copy link
Member

But I do notice that for pandas dataframe the changes are made on the input data. We might want to fix this. But given that we would like to make add_labels a feature to be called by the user, we can change it later, I assume.

@semvijverberg
Copy link
Member Author

Sorry this issue might be obsolete soon.

I agree that the dimension 'split' with labels (train, test ,skip) do not have be added by default.

I'm currently working on a new solution related to #46, where traintest_split is a class and adding labels is optional via add_labels.

Can we pause this discussion until I have a new draft traintest.py?

@BSchilperoort BSchilperoort transferred this issue from AI4S2S/s2spy Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants