Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing train/test split in exercise 5 #1

Open
kousu opened this issue Dec 10, 2022 · 2 comments
Open

Missing train/test split in exercise 5 #1

kousu opened this issue Dec 10, 2022 · 2 comments

Comments

@kousu
Copy link

kousu commented Dec 10, 2022

# Question: what is the issue with the code below?
X_reduced = SelectKBest(f_regression).fit_transform(X, y)
scores = cross_validate(Ridge(), X_reduced, y)["test_score"]
print("feature selection in 'preprocessing':", scores)

asks "what is wrong with this code?"

The solution claims to use a training set and implies that was the big problem:

# Now fitting the whole pipeline on the training set only

but doesn't seem to do train-test splitting either:

model = make_pipeline(SelectKBest(f_regression), Ridge())
scores_pipe = cross_validate(model, X, y)["test_score"]
# TODO_END
print("feature selection on train set:", scores_pipe)

The plot also claims that the solution plots only on training data:

plt.boxplot(
[scores_pipe, scores],
vert=False,
labels=[
"feature selection on train set",
"feature selection on whole data",
],
)

Figure_1

but @valosekj and I don't understand how that is true. What it's actually plotting is the CV scores generated on the reduced data vs the CV scores on the full set (but trained on the reduced set). So it's just demonstrating the effect of overfitting. That's not the same as a train/test split.

By the way, the sklearn docs show using train_test_split explicitly with Pipeline (aka make_pipeline)

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipe.fit(X_train, y_train)

So is this solution wrong?

@jeromedockes
Copy link
Contributor

Hi, thanks for looking at the exercise in more detail and asking your
question here!

The data is splitted into training and testing sets; that is done by
the cross_validate function.

For each fold of cross validation, cross_validate will:

  • split the data into a train set and a test set
  • fit the estimator (in this case the Pipeline) on the train set
  • evaluate it on the test set and remember the score.
    Finally it returns all the resulting scores.

So it does the same thing as the example from the scikit-learn documentation you
show (using train_test_split), but it does it 5 times (for 5 different splits)
instead of once.

(From the train_test_split documentation, train_test_split is
equivalent to:

next(ShuffleSplit().split(X, y))

ShuffleSplit is the same as the default cross-validation iterator used
by cross_validate for regression problems, ie KFold, except that
ShuffleSplit shuffles the samples before each split.)

the difference between the first and the second box in the exercise is
that in the first case, the feature selector has seen the whole data,
whereas in the second case, it is fitted on the training set of each
cross-validation split.

Rewriting the exercise solution with one split instead of 5 (ie with
train_test_split instead of cross_validate) would look like this:

from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from matplotlib import pyplot as plt

X, y = make_regression(noise=10, n_features=5000, random_state=0)

# Question: what is the issue with the code below?

X_reduced = SelectKBest(f_regression).fit_transform(X, y)
X_reduced_train, X_reduced_test, y_train, y_test = train_test_split(
    X_reduced, y, shuffle=False, test_size=0.2
)
ridge = Ridge().fit(X_reduced_train, y_train)
predictions = ridge.predict(X_reduced_test)
score = r2_score(y_test, predictions)
print("feature selection in 'preprocessing':", score)

# Now fitting the whole pipeline on the training set only

# TODO:
# - use `make_pipeline` to create a pipeline chaining a `SelectKBest` and a
#   `Ridge`
# - use `cross_validate` to obtain cross-validation scores for the whole
#   pipeline treated as a single model
# See: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html
model = "make_pipeline(???)"
score_pipe = None
# TODO_BEGIN
model = make_pipeline(SelectKBest(f_regression), Ridge())
X_train, X_test, y_train, y_test = train_test_split(
    X, y, shuffle=False, test_size=0.2
)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score_pipe = r2_score(y_test, predictions)
# TODO_END
print("feature selection on train set:", score_pipe)

it prints:

feature selection in 'preprocessing': 0.8752332046109655
feature selection on train set: 0.2836725353159354

As you can see, these correspond to the last (5th) split in the
cross_validate results. Indeed, the original solution (using
cross_validate) prints:

feature selection in 'preprocessing': [0.81169757 0.63046326 0.54143034 0.72676923 0.8752332 ]
feature selection on train set: [ 0.11577751  0.0439333  -0.27625968  0.32327364  0.28367254]

(note the last value in each list)

I hope this helps clarify things but if not don't hesitate to ask more
questions here!

@kousu
Copy link
Author

kousu commented Dec 14, 2022

Hello! Thank you for the kind time spent on writing this answer up!

I started working through it for myself and I have some questions and suggestions, but then I got busy with spine scans and whatnot. I'll try to get back to you this weekend though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants