Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement train/test splits iterator #74

Merged
merged 22 commits into from
Aug 31, 2022
Merged

Implement train/test splits iterator #74

merged 22 commits into from
Aug 31, 2022

Conversation

geek-yang
Copy link
Member

@geek-yang geek-yang commented Aug 10, 2022

In this PR we implement an iterator for train/test splits, which enable the user to easily fetch train/test data from each split. The potential usecases are listed in issue #71.

  • Create generator function for train/test iteration in traintest.py
  • Support xr.Dataset and pd.DataFrame Only xarray
  • Add examples in notebooks.
  • Support multiple inputs

This PR closes #71. And AI4S2S/lilio#46

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

s2spy/traintest.py Outdated Show resolved Hide resolved
s2spy/traintest.py Outdated Show resolved Hide resolved
s2spy/traintest.py Outdated Show resolved Hide resolved
s2spy/traintest.py Outdated Show resolved Hide resolved
else:
raise ValueError("Input y should be of type xr.Dataset or pd.DataFrame")

yield train_Xs, train_y, test_Xs, test_y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this code correctly, this will iterate over the input data args, whereas I would expect a loop over the splits.

@Peter9192 Peter9192 marked this pull request as ready for review August 25, 2022 12:37
@Peter9192
Copy link
Contributor

Finally, green CI! @geek-yang @semvijverberg please briefly review as we've already discussed it yesterday. Since then I've added tests, tests, tests, type annotations, and some additional sanity checks.

@geek-yang
Copy link
Member Author

@Peter9192 Nice work! The implementation is very elegant! I have a few very small comments to add. Since this PR is opened by me, I can not approve it. I would prefer @semvijverberg to take care of it. Well done!

Comment on lines 96 to 99
if x[dim].size <=1:
raise ValueError(
f"Cannot split: need at least 2 values along dimension {dim}"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not so sure if this is needed. The splitters in sklearn will take care of this. For instance, if we only have one value and call k-folds method (e.g. n_splits=3) to split it, it will complain:

# create data
n = 8
time_index = pd.date_range("20151020", periods=n, freq="60d")
time_coord = {"time": time_index}
x1 = xr.DataArray(np.random.randn(n), coords=time_coord, name="precursor1")
# resample
calendar = s2spy.time.AdventCalendar(anchor=(10, 15), freq="180d")
calendar.map_to_data(x1) 
x1 = s2spy.time.resample(calendar, x1)
# cross validate
kfold = KFold(n_splits=3)
cv = s2spy.traintest.TrainTestSplit(kfold)
for x1_train, x1_test in cv.split(x1):
    print("Train:", x1_train.anchor_year.values)
    print("Test:", x1_test.anchor_year.values)

>>> ValueError: Cannot have number of splits n_splits=3 greater than the number of samples: n_samples=1.

I would suggest to remove this and let the splitters take care of it, as the error message is more informative with different splitter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tests (test_kfold_too_short), this produces a much more obscure error message:

--> 325             raise TypeError(
    326                 "Singleton array %r cannot be considered a valid collection." % x
    327             )

Hence I think it is worth adding our own, custom error message. However, since sklearn apparently raises different types of errors (ValueError or TypeError), rather than catching it and re-raising, we might be better of with just our own small sanity check. I adapted the message slightly.

s2spy/traintest.py Outdated Show resolved Hide resolved
s2spy/traintest.py Outdated Show resolved Hide resolved
"""Splits calendar resampled data into train/test groups, based on the input key.
As splitter, a Splitter Class such as sklearn's KFold can be passed.
# Mypy type aliases
X = Union[xr.DataArray, List[xr.DataArray]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting finding, pylint complains on my machine about single letter variable:
Variable name "X" doesn't conform to snake_case naming stylepylint(invalid-name)
But since it follows the vocabulary of machine learning and it doesn't cause confusion, I think that's why in sklearn people also does this.

But just to be friendly to the users (especially from climate science community) who is new to the ML field, maybe we can add one line in the docstrings at the top:

"In this module, X(or x) refers to training data and y refers to testing data. This is the convention widely used by the machine learning community and for more information please check the machine learning vocabulary of scikit-learn in their code examples: https://scikit-learn.org/stable/tutorial/basic/tutorial.html"

Feel free to add it if you think it is nice to mention.

Copy link
Contributor

@Peter9192 Peter9192 Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These Pylint warnings do not show up in prospector because strictness is set to medium. Setting it to high will trigger them in prospector as well.

After our discussion (we agreed to allow certain short/capitalized) variable names, I looked into options for configuring pylint. Interestingly, if I add configuration options in pyproject.toml, they are picked up by pylint run through prospector, but not by pylint run as standalone. Vice versa, adding them to setup.cfg yields the opposite result. By adding .pylintrc, both tools picked up the right configuration.

@geek-yang I think vscode should also pick up the new rule. Can you verify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info. Lessons learned 😉 . Just checked vscode and yes, it picks up the new rule. Thanks for exploring this.

Copy link
Contributor

@Peter9192 Peter9192 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Yang can't approve, I'll do it for him ;-)

@sonarcloud
Copy link

sonarcloud bot commented Aug 26, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

89.4% 89.4% Coverage
0.0% 0.0% Duplication

Copy link
Member

@semvijverberg semvijverberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @geek-yang @Peter9192,

I'm sorry, I don't have time to properly review this PR. But I'm confident you did a good job ;).

Cheers,
Sem

@geek-yang geek-yang merged commit 1c2b047 into main Aug 31, 2022
@geek-yang geek-yang deleted the train_test_iterator branch August 31, 2022 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add iterator feature to traintest
3 participants