-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing train/test split in exercise 5 #1
Comments
Hi, thanks for looking at the exercise in more detail and asking your The data is splitted into training and testing sets; that is done by For each fold of cross validation,
So it does the same thing as the example from the scikit-learn documentation you (From the
the difference between the first and the second box in the exercise is Rewriting the exercise solution with one split instead of 5 (ie with from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from matplotlib import pyplot as plt
X, y = make_regression(noise=10, n_features=5000, random_state=0)
# Question: what is the issue with the code below?
X_reduced = SelectKBest(f_regression).fit_transform(X, y)
X_reduced_train, X_reduced_test, y_train, y_test = train_test_split(
X_reduced, y, shuffle=False, test_size=0.2
)
ridge = Ridge().fit(X_reduced_train, y_train)
predictions = ridge.predict(X_reduced_test)
score = r2_score(y_test, predictions)
print("feature selection in 'preprocessing':", score)
# Now fitting the whole pipeline on the training set only
# TODO:
# - use `make_pipeline` to create a pipeline chaining a `SelectKBest` and a
# `Ridge`
# - use `cross_validate` to obtain cross-validation scores for the whole
# pipeline treated as a single model
# See: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html
model = "make_pipeline(???)"
score_pipe = None
# TODO_BEGIN
model = make_pipeline(SelectKBest(f_regression), Ridge())
X_train, X_test, y_train, y_test = train_test_split(
X, y, shuffle=False, test_size=0.2
)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score_pipe = r2_score(y_test, predictions)
# TODO_END
print("feature selection on train set:", score_pipe) it prints:
As you can see, these correspond to the last (5th) split in the
(note the last value in each list) I hope this helps clarify things but if not don't hesitate to ask more |
Hello! Thank you for the kind time spent on writing this answer up! I started working through it for myself and I have some questions and suggestions, but then I got busy with spine scans and whatnot. I'll try to get back to you this weekend though! |
main-edu-courses-ml/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py
Lines 10 to 14 in 7d354ee
asks "what is wrong with this code?"
The solution claims to use a training set and implies that was the big problem:
main-edu-courses-ml/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py
Line 16 in 7d354ee
but doesn't seem to do train-test splitting either:
main-edu-courses-ml/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py
Lines 27 to 30 in 7d354ee
The plot also claims that the solution plots only on training data:
main-edu-courses-ml/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py
Lines 32 to 39 in 7d354ee
but @valosekj and I don't understand how that is true. What it's actually plotting is the CV scores generated on the reduced data vs the CV scores on the full set (but trained on the reduced set). So it's just demonstrating the effect of overfitting. That's not the same as a train/test split.
By the way, the sklearn docs show using
train_test_split
explicitly withPipeline
(akamake_pipeline
)So is this solution wrong?
The text was updated successfully, but these errors were encountered: