Implement train/test splits iterator #74

geek-yang · 2022-08-10T14:40:24Z

In this PR we implement an iterator for train/test splits, which enable the user to easily fetch train/test data from each split. The potential usecases are listed in issue #71.

Create generator function for train/test iteration in traintest.py
~~Support xr.Dataset and pd.DataFrame~~ Only xarray
Add examples in notebooks.
Support multiple inputs

This PR closes #71. And AI4S2S/lilio#46

review-notebook-app · 2022-08-10T14:45:35Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

s2spy/traintest.py

Peter9192 · 2022-08-24T10:11:36Z

s2spy/traintest.py

+                else:
+                    raise ValueError("Input y should be of type xr.Dataset or pd.DataFrame")
+
+            yield train_Xs, train_y, test_Xs, test_y


If I read this code correctly, this will iterate over the input data args, whereas I would expect a loop over the splits.

Peter9192 · 2022-08-25T12:38:55Z

Finally, green CI! @geek-yang @semvijverberg please briefly review as we've already discussed it yesterday. Since then I've added tests, tests, tests, type annotations, and some additional sanity checks.

geek-yang · 2022-08-25T14:34:38Z

@Peter9192 Nice work! The implementation is very elegant! I have a few very small comments to add. Since this PR is opened by me, I can not approve it. I would prefer @semvijverberg to take care of it. Well done!

geek-yang · 2022-08-25T14:41:39Z

s2spy/traintest.py

+        if x[dim].size <=1:
+            raise ValueError(
+                f"Cannot split: need at least 2 values along dimension {dim}"
+            )


Not so sure if this is needed. The splitters in sklearn will take care of this. For instance, if we only have one value and call k-folds method (e.g. n_splits=3) to split it, it will complain:

# create data n = 8 time_index = pd.date_range("20151020", periods=n, freq="60d") time_coord = {"time": time_index} x1 = xr.DataArray(np.random.randn(n), coords=time_coord, name="precursor1") # resample calendar = s2spy.time.AdventCalendar(anchor=(10, 15), freq="180d") calendar.map_to_data(x1) x1 = s2spy.time.resample(calendar, x1) # cross validate kfold = KFold(n_splits=3) cv = s2spy.traintest.TrainTestSplit(kfold) for x1_train, x1_test in cv.split(x1): print("Train:", x1_train.anchor_year.values) print("Test:", x1_test.anchor_year.values) >>> ValueError: Cannot have number of splits n_splits=3 greater than the number of samples: n_samples=1.

I would suggest to remove this and let the splitters take care of it, as the error message is more informative with different splitter.

In the tests (test_kfold_too_short), this produces a much more obscure error message:

--> 325 raise TypeError( 326 "Singleton array %r cannot be considered a valid collection." % x 327 )

Hence I think it is worth adding our own, custom error message. However, since sklearn apparently raises different types of errors (ValueError or TypeError), rather than catching it and re-raising, we might be better of with just our own small sanity check. I adapted the message slightly.

s2spy/traintest.py

geek-yang · 2022-08-25T14:57:29Z

s2spy/traintest.py

-    """Splits calendar resampled data into train/test groups, based on the input key.
-    As splitter, a Splitter Class such as sklearn's KFold can be passed.
+# Mypy type aliases
+X = Union[xr.DataArray, List[xr.DataArray]]


An interesting finding, pylint complains on my machine about single letter variable:
Variable name "X" doesn't conform to snake_case naming stylepylint(invalid-name)
But since it follows the vocabulary of machine learning and it doesn't cause confusion, I think that's why in sklearn people also does this.

But just to be friendly to the users (especially from climate science community) who is new to the ML field, maybe we can add one line in the docstrings at the top:

"In this module, X(or x) refers to training data and y refers to testing data. This is the convention widely used by the machine learning community and for more information please check the machine learning vocabulary of scikit-learn in their code examples: https://scikit-learn.org/stable/tutorial/basic/tutorial.html"

Feel free to add it if you think it is nice to mention.

These Pylint warnings do not show up in prospector because strictness is set to medium. Setting it to high will trigger them in prospector as well.

After our discussion (we agreed to allow certain short/capitalized) variable names, I looked into options for configuring pylint. Interestingly, if I add configuration options in pyproject.toml, they are picked up by pylint run through prospector, but not by pylint run as standalone. Vice versa, adding them to setup.cfg yields the opposite result. By adding .pylintrc, both tools picked up the right configuration.

@geek-yang I think vscode should also pick up the new rule. Can you verify?

Thanks for the info. Lessons learned 😉 . Just checked vscode and yes, it picks up the new rule. Thanks for exploring this.

Co-authored-by: Yang <[email protected]>

Peter9192

Since Yang can't approve, I'll do it for him ;-)

sonarcloud · 2022-08-26T11:59:28Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
2 Code Smells

89.4% Coverage
0.0% Duplication

semvijverberg

Dear @geek-yang @Peter9192,

I'm sorry, I don't have time to properly review this PR. But I'm confident you did a good job ;).

Cheers,
Sem

Yang added 2 commits August 10, 2022 16:29

create generator function for train/test iteration and support xarray

1e11a4c

add example of using split_iterate in notebook

4bad5ee

Yang added 2 commits August 10, 2022 17:50

implement iterator for pandas dataframe

6888a63

update notebook for iterator of pandas data

488a3a0

BSchilperoort mentioned this pull request Aug 11, 2022

Add iterator feature to traintest #71

Closed

Yang and others added 3 commits August 12, 2022 11:17

Merge branch 'main' into train_test_iterator

3822018

update notebooks using split iterator

9f94fab

progress with traintest iterator

e6a8ef8

semvijverberg mentioned this pull request Aug 24, 2022

reinitialization issue fixed and warning when traintest split is updated #88

Closed

Peter9192 requested changes Aug 24, 2022

View reviewed changes

Peter9192 added 11 commits August 24, 2022 15:42

Overhaul implementation of traintest splitter

9348ee9

Merge remote-tracking branch 'origin/main' into train_test_iterator

0875a2e

Change return signature and format a bit

96a336c

tests for new implementation

f3c56e3

update tutorial notebook

af654ff

fix tests

1c30184

Mypy and lint fixes

3eacecb

more lint fixes

b4a6cbd

add space

e11f5ab

Are you finally happy prospector

b9fd09e

and now

5f86497

Peter9192 marked this pull request as ready for review August 25, 2022 12:37

geek-yang commented Aug 25, 2022

View reviewed changes

s2spy/traintest.py Outdated Show resolved Hide resolved

geek-yang commented Aug 25, 2022

View reviewed changes

s2spy/traintest.py Outdated Show resolved Hide resolved

geek-yang commented Aug 25, 2022

View reviewed changes

Apply suggestions from code review

0cd3df3

Co-authored-by: Yang <[email protected]>

Peter9192 added 3 commits August 26, 2022 10:26

Rename X to XType for clarity

c51b8a3

Silence agreed-upon pylint warnings

c77252a

modify error message

d5de862

Peter9192 approved these changes Aug 26, 2022

View reviewed changes

semvijverberg reviewed Aug 30, 2022

View reviewed changes

geek-yang merged commit 1c2b047 into main Aug 31, 2022

geek-yang deleted the train_test_iterator branch August 31, 2022 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement train/test splits iterator #74

Implement train/test splits iterator #74

geek-yang commented Aug 10, 2022 •

edited by Peter9192

Loading

review-notebook-app bot commented Aug 10, 2022

Peter9192 Aug 24, 2022

Peter9192 commented Aug 25, 2022

geek-yang commented Aug 25, 2022

geek-yang Aug 25, 2022

Peter9192 Aug 26, 2022

geek-yang Aug 25, 2022

Peter9192 Aug 26, 2022 •

edited

Loading

geek-yang Aug 30, 2022

Peter9192 left a comment

sonarcloud bot commented Aug 26, 2022

semvijverberg left a comment

Implement train/test splits iterator #74

Implement train/test splits iterator #74

Conversation

geek-yang commented Aug 10, 2022 • edited by Peter9192 Loading

review-notebook-app bot commented Aug 10, 2022

Peter9192 Aug 24, 2022

Choose a reason for hiding this comment

Peter9192 commented Aug 25, 2022

geek-yang commented Aug 25, 2022

geek-yang Aug 25, 2022

Choose a reason for hiding this comment

Peter9192 Aug 26, 2022

Choose a reason for hiding this comment

geek-yang Aug 25, 2022

Choose a reason for hiding this comment

Peter9192 Aug 26, 2022 • edited Loading

Choose a reason for hiding this comment

geek-yang Aug 30, 2022

Choose a reason for hiding this comment

Peter9192 left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Aug 26, 2022

semvijverberg left a comment

Choose a reason for hiding this comment

geek-yang commented Aug 10, 2022 •

edited by Peter9192

Loading

Peter9192 Aug 26, 2022 •

edited

Loading