Missing train/test split in exercise 5 #1

kousu · 2022-12-10T20:05:30Z

main-edu-courses-ml/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py

Lines 10 to 14 in 7d354ee

    
           # Question: what is the issue with the code below? 
        
           X_reduced = SelectKBest(f_regression).fit_transform(X, y) 
        
           scores = cross_validate(Ridge(), X_reduced, y)["test_score"] 
        
           print("feature selection in 'preprocessing':", scores)

asks "what is wrong with this code?"

The solution claims to use a training set and implies that was the big problem:

main-edu-courses-ml/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py

Line 16 in 7d354ee

# Now fitting the whole pipeline on the training set only

but doesn't seem to do train-test splitting either:

main-edu-courses-ml/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py

Lines 27 to 30 in 7d354ee

    
           model = make_pipeline(SelectKBest(f_regression), Ridge()) 
        
           scores_pipe = cross_validate(model, X, y)["test_score"] 
        
           # TODO_END 
        
           print("feature selection on train set:", scores_pipe)

The plot also claims that the solution plots only on training data:

main-edu-courses-ml/ml_model_selection_and_validation/exercises/ex_05_feature_selection_solutions.py

Lines 32 to 39 in 7d354ee

    
           plt.boxplot( 
        
               [scores_pipe, scores], 
        
               vert=False, 
        
               labels=[ 
        
                   "feature selection on train set", 
        
                   "feature selection on whole data", 
        
               ], 
        
           )

but @valosekj and I don't understand how that is true. What it's actually plotting is the CV scores generated on the reduced data vs the CV scores on the full set (but trained on the reduced set). So it's just demonstrating the effect of overfitting. That's not the same as a train/test split.

By the way, the sklearn docs show using train_test_split explicitly with Pipeline (aka make_pipeline)

X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
# The pipeline can be used as any other estimator
# and avoids leaking the test set into the train set
pipe.fit(X_train, y_train)

So is this solution wrong?

The text was updated successfully, but these errors were encountered:

jeromedockes · 2022-12-12T11:53:27Z

Hi, thanks for looking at the exercise in more detail and asking your
question here!

The data is splitted into training and testing sets; that is done by
the cross_validate function.

For each fold of cross validation, cross_validate will:

split the data into a train set and a test set
fit the estimator (in this case the Pipeline) on the train set
evaluate it on the test set and remember the score.
Finally it returns all the resulting scores.

So it does the same thing as the example from the scikit-learn documentation you
show (using train_test_split), but it does it 5 times (for 5 different splits)
instead of once.

(From the train_test_split documentation, train_test_split is
equivalent to:

next(ShuffleSplit().split(X, y))

ShuffleSplit is the same as the default cross-validation iterator used
by cross_validate for regression problems, ie KFold, except that
ShuffleSplit shuffles the samples before each split.)

the difference between the first and the second box in the exercise is
that in the first case, the feature selector has seen the whole data,
whereas in the second case, it is fitted on the training set of each
cross-validation split.

Rewriting the exercise solution with one split instead of 5 (ie with
train_test_split instead of cross_validate) would look like this:

from sklearn.datasets import make_regression
from sklearn.linear_model import Ridge
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from matplotlib import pyplot as plt

X, y = make_regression(noise=10, n_features=5000, random_state=0)

# Question: what is the issue with the code below?

X_reduced = SelectKBest(f_regression).fit_transform(X, y)
X_reduced_train, X_reduced_test, y_train, y_test = train_test_split(
    X_reduced, y, shuffle=False, test_size=0.2
)
ridge = Ridge().fit(X_reduced_train, y_train)
predictions = ridge.predict(X_reduced_test)
score = r2_score(y_test, predictions)
print("feature selection in 'preprocessing':", score)

# Now fitting the whole pipeline on the training set only

# TODO:
# - use `make_pipeline` to create a pipeline chaining a `SelectKBest` and a
#   `Ridge`
# - use `cross_validate` to obtain cross-validation scores for the whole
#   pipeline treated as a single model
# See: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html
model = "make_pipeline(???)"
score_pipe = None
# TODO_BEGIN
model = make_pipeline(SelectKBest(f_regression), Ridge())
X_train, X_test, y_train, y_test = train_test_split(
    X, y, shuffle=False, test_size=0.2
)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score_pipe = r2_score(y_test, predictions)
# TODO_END
print("feature selection on train set:", score_pipe)

it prints:

feature selection in 'preprocessing': 0.8752332046109655
feature selection on train set: 0.2836725353159354

As you can see, these correspond to the last (5th) split in the
cross_validate results. Indeed, the original solution (using
cross_validate) prints:

feature selection in 'preprocessing': [0.81169757 0.63046326 0.54143034 0.72676923 0.8752332 ]
feature selection on train set: [ 0.11577751  0.0439333  -0.27625968  0.32327364  0.28367254]

(note the last value in each list)

I hope this helps clarify things but if not don't hesitate to ask more
questions here!

kousu · 2022-12-14T18:42:37Z

Hello! Thank you for the kind time spent on writing this answer up!

I started working through it for myself and I have some questions and suggestions, but then I got busy with spine scans and whatnot. I'll try to get back to you this weekend though!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing train/test split in exercise 5 #1

Missing train/test split in exercise 5 #1

kousu commented Dec 10, 2022

jeromedockes commented Dec 12, 2022

kousu commented Dec 14, 2022

Missing train/test split in exercise 5 #1

Missing train/test split in exercise 5 #1

Comments

kousu commented Dec 10, 2022

jeromedockes commented Dec 12, 2022

kousu commented Dec 14, 2022