Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds sample_weight option to estimators fit method #441

Merged
merged 10 commits into from
Sep 24, 2018

Conversation

kota7
Copy link
Contributor

@kota7 kota7 commented Sep 23, 2018

Description

Adds sample_weight option to the fit method of estimators.
Aims to cover the followings:

  • StackingClassifier
  • StackingCVClassifier
  • EnsembleVoteClassifier
  • StackingRegressor
  • StackingCVRegressor

Related issues or pull requests

Fixes #438

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran nosetests ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., nosetests ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./mlxtend

@pep8speaks
Copy link

pep8speaks commented Sep 23, 2018

Hello @kota7! Thanks for updating the PR.

Line 176:54: W291 trailing whitespace

Line 69:80: E501 line too long (80 > 79 characters)

Comment last updated on September 24, 2018 at 03:55 Hours UTC

@@ -111,6 +111,8 @@ def fit(self, X, y):
n_features is the number of features.
y : array-like, shape = [n_samples] or [n_samples, n_targets]
Target values.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you additionally specify that these are used by both the level-1 and meta-regressors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. And the meta regressor too.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe as doc string sth like

Sample weights passed as sample_weights to each regressor in the regressors list as well as the meta_regressor.

The rest of the PR looks great btw! Thanks!

@coveralls
Copy link

coveralls commented Sep 23, 2018

Coverage Status

Coverage increased (+0.5%) to 91.556% when pulling 4a7229d on kota7:fit-with-sample_weight into 030c1b7 on rasbt:master.

@kota7
Copy link
Contributor Author

kota7 commented Sep 24, 2018

@rasbt This thread may not be the best place to talk about this, but when I update the test script test_stacking_cv_regressor.py, I found that generating another random vector make existing tests fail.

For example, below this part:

# Some test data
np.random.seed(1)
X1 = np.sort(5 * np.random.rand(40, 1), axis=0)
X2 = np.sort(5 * np.random.rand(40, 2), axis=0)
X3 = np.zeros((40, 3))
y = np.sin(X1).ravel()
y[::5] += 3 * (0.5 - np.random.rand(8))
y2 = np.zeros((40,))

if you add another line:

w = np.random.random(40)

then tests start to fail. This is perhaps because cross validation generators share random seed with numpy.
To see this, the below code generates two different results, due to the call for np.random.random after setting seed.

import numpy as np
from sklearn.model_selection import KFold

np.random.seed(1)
cv = KFold(2, shuffle=True)
print(list(cv.split([1,2,3,4,5,6])))

np.random.seed(1)
np.random.random(10)
cv = KFold(2, shuffle=True)
print(list(cv.split([1,2,3,4,5,6])))

As a side effect, when I some test raises exception, then the random seed changes, which makes other tests to fail. Due to this, error report becomes confusing since tests that have no problem fail.

A possible workaround is to explicitly define cross validation object with specific random state at the top of script, and keep using it in tests. Would you want me to do this, or do you have any thoughts?

@rasbt
Copy link
Owner

rasbt commented Sep 24, 2018

Thanks for bringing this up. This is actually a bit of a messy unit test design. The random seed should either be set for each function individually or it should be using the random_state via scikit-learn, which is probably the better solution.

I think the reason why the StackingCVRegressor currently doesn't have a random_state parameter is because it uses the sklearn.model_selection.check_cv method, which doesn't support that either.

So, one option would be resetting the random seed in each unit test seems to be the best option for now, but that might be a lot of work, unfortunately, because the unit tests might produce different results then like you mentioned.

Your suggestion

A possible workaround is to explicitly define cross validation object with specific random state at the top of script, and keep using it in tests.

is probably better because in case we change the the StackingCVRegressor in future to have its own random_state, a fixed Kfold object wouldn't require changing all the unit tests again

@rasbt
Copy link
Owner

rasbt commented Sep 24, 2018

Looks great so far, thanks! Are you planning to add this for the other Stacking classes as well? You don't have to, but if you do, I would really appreciate it and wait a bit before merging.

@kota7 kota7 changed the title [WIP] Adds sample_weight option to estimators fit method Adds sample_weight option to estimators fit method Sep 24, 2018
@kota7
Copy link
Contributor Author

kota7 commented Sep 24, 2018

@rasbt Yearh, I made a very similar edits to five estimators. I think it is ready for your review. Thanks!

@rasbt
Copy link
Owner

rasbt commented Sep 24, 2018

This looks really nice overall. I am wondering though..., don't the sklearn estimators accept sample_weight=None? This could simplify a lot of the if/else statements that are currently in there:

     if sample_weight is None:
        regr.fit(X, y)		            
             regr.fit(X, y)
         else:
             regr.fit(X, y, sample_weight=sample_weight)

@kota7
Copy link
Contributor Author

kota7 commented Sep 24, 2018

As far as I know, Lasso, MLPClassifier, KNerighborsClassifier do not support it, hence raises exception if you give sample_weight to them. This piece of unittest catches that:

def test_weight_unsupported_with_no_weight():
# pass no weight to regressors with no weight support
# should not be a problem
lr = LinearRegression()
svr_lin = SVR(kernel='linear')
ridge = Ridge(random_state=1)
svr_rbf = SVR(kernel='rbf')
lasso = Lasso(random_state=1)
stregr = StackingRegressor(regressors=[svr_lin, lr, ridge, lasso],
meta_regressor=svr_rbf)
stregr.fit(X1, y).predict(X1)
stregr = StackingRegressor(regressors=[svr_lin, lr, ridge],
meta_regressor=lasso)
stregr.fit(X1, y).predict(X1)

That is, if you code just regr.fit(X, y, sample_weight=sample_weight), sample_weight=None is passed to Lasso, which raises error.

@rasbt
Copy link
Owner

rasbt commented Sep 24, 2018

Good point! Otherwise the PR seems fine and I'd be happy to merge. Or do you have any additions in mind?

@kota7
Copy link
Contributor Author

kota7 commented Sep 24, 2018

No more addition. Please merge!

@rasbt
Copy link
Owner

rasbt commented Sep 24, 2018

Awesome! Thanks for this PR, really appreciate it!

@rasbt rasbt merged commit c55d849 into rasbt:master Sep 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants