-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement train/test splits iterator #74
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
s2spy/traintest.py
Outdated
else: | ||
raise ValueError("Input y should be of type xr.Dataset or pd.DataFrame") | ||
|
||
yield train_Xs, train_y, test_Xs, test_y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this code correctly, this will iterate over the input data args, whereas I would expect a loop over the splits.
Finally, green CI! @geek-yang @semvijverberg please briefly review as we've already discussed it yesterday. Since then I've added tests, tests, tests, type annotations, and some additional sanity checks. |
@Peter9192 Nice work! The implementation is very elegant! I have a few very small comments to add. Since this PR is opened by me, I can not approve it. I would prefer @semvijverberg to take care of it. Well done! |
s2spy/traintest.py
Outdated
if x[dim].size <=1: | ||
raise ValueError( | ||
f"Cannot split: need at least 2 values along dimension {dim}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not so sure if this is needed. The splitters in sklearn will take care of this. For instance, if we only have one value and call k-folds method (e.g. n_splits=3) to split it, it will complain:
# create data
n = 8
time_index = pd.date_range("20151020", periods=n, freq="60d")
time_coord = {"time": time_index}
x1 = xr.DataArray(np.random.randn(n), coords=time_coord, name="precursor1")
# resample
calendar = s2spy.time.AdventCalendar(anchor=(10, 15), freq="180d")
calendar.map_to_data(x1)
x1 = s2spy.time.resample(calendar, x1)
# cross validate
kfold = KFold(n_splits=3)
cv = s2spy.traintest.TrainTestSplit(kfold)
for x1_train, x1_test in cv.split(x1):
print("Train:", x1_train.anchor_year.values)
print("Test:", x1_test.anchor_year.values)
>>> ValueError: Cannot have number of splits n_splits=3 greater than the number of samples: n_samples=1.
I would suggest to remove this and let the splitters take care of it, as the error message is more informative with different splitter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the tests (test_kfold_too_short
), this produces a much more obscure error message:
--> 325 raise TypeError(
326 "Singleton array %r cannot be considered a valid collection." % x
327 )
Hence I think it is worth adding our own, custom error message. However, since sklearn apparently raises different types of errors (ValueError or TypeError), rather than catching it and re-raising, we might be better of with just our own small sanity check. I adapted the message slightly.
s2spy/traintest.py
Outdated
"""Splits calendar resampled data into train/test groups, based on the input key. | ||
As splitter, a Splitter Class such as sklearn's KFold can be passed. | ||
# Mypy type aliases | ||
X = Union[xr.DataArray, List[xr.DataArray]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting finding, pylint
complains on my machine about single letter variable:
Variable name "X" doesn't conform to snake_case naming stylepylint(invalid-name)
But since it follows the vocabulary of machine learning and it doesn't cause confusion, I think that's why in sklearn people also does this.
But just to be friendly to the users (especially from climate science community) who is new to the ML field, maybe we can add one line in the docstrings at the top:
"In this module, X(or x) refers to training data and y refers to testing data. This is the convention widely used by the machine learning community and for more information please check the machine learning vocabulary of scikit-learn in their code examples: https://scikit-learn.org/stable/tutorial/basic/tutorial.html"
Feel free to add it if you think it is nice to mention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These Pylint warnings do not show up in prospector because strictness is set to medium. Setting it to high will trigger them in prospector as well.
After our discussion (we agreed to allow certain short/capitalized) variable names, I looked into options for configuring pylint. Interestingly, if I add configuration options in pyproject.toml, they are picked up by pylint run through prospector, but not by pylint run as standalone. Vice versa, adding them to setup.cfg yields the opposite result. By adding .pylintrc, both tools picked up the right configuration.
@geek-yang I think vscode should also pick up the new rule. Can you verify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info. Lessons learned 😉 . Just checked vscode and yes, it picks up the new rule. Thanks for exploring this.
Co-authored-by: Yang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Yang can't approve, I'll do it for him ;-)
Kudos, SonarCloud Quality Gate passed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear @geek-yang @Peter9192,
I'm sorry, I don't have time to properly review this PR. But I'm confident you did a good job ;).
Cheers,
Sem
In this PR we implement an iterator for train/test splits, which enable the user to easily fetch train/test data from each split. The potential usecases are listed in issue #71.
traintest.py
SupportOnly xarrayxr.Dataset
andpd.DataFrame
This PR closes #71. And AI4S2S/lilio#46