Skip to content

Commit

Permalink
allow grid search for classifiers/regressors params in ensemble metho…
Browse files Browse the repository at this point in the history
…ds (#259)
  • Loading branch information
rasbt authored Oct 2, 2017
1 parent 00981a6 commit 5735e00
Show file tree
Hide file tree
Showing 16 changed files with 266 additions and 22 deletions.
1 change: 1 addition & 0 deletions docs/sources/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ The CHANGELOG for the current development version is available at
- Added `evaluate.permutation_test`, a permutation test for hypothesis testing (or A/B testing) to test if two samples come from the same distribution. Or in other words, a procedure to test the null hypothesis that that two groups are not significantly different (e.g., a treatment and a control group).
- Added `'leverage'` and `'conviction` as evaluation metrics to the `frequent_patterns.association_rules` function. [#246](https://github.com/rasbt/mlxtend/pull/246) & [#247](https://github.com/rasbt/mlxtend/pull/247)
- Added a `loadings_` attribute to `PrincipalComponentAnalysis` to compute the factor loadings of the features on the principal components. [#251](https://github.com/rasbt/mlxtend/pull/251)
- Allow grid search over classifiers/regressors in ensemble and stacking estimators [#259](https://github.com/rasbt/mlxtend/pull/259)

##### Changes

Expand Down
14 changes: 14 additions & 0 deletions docs/sources/user_guide/classifier/EnsembleVoteClassifier.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -459,6 +459,20 @@
"grid = grid.fit(iris.data, iris.target)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**\n",
"\n",
"The `EnsembleVoteClass` also enables grid search over the `clfs` argument. However, due to the current implementation of `GridSearchCV` in scikit-learn, it is not possible to search over both, differenct classifiers and classifier parameters at the same time. For instance, while the following parameter dictionary works\n",
"\n",
" params = {'randomforestclassifier__n_estimators': [1, 100],\n",
" 'clfs': [(clf1, clf1, clf1), (clf2, clf3)]}\n",
" \n",
"it will use the instance settings of `clf1`, `clf2`, and `clf3` and not overwrite it with the `'n_estimators'` settings from `'randomforestclassifier__n_estimators': [1, 100]`."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
14 changes: 14 additions & 0 deletions docs/sources/user_guide/classifier/StackingCVClassifier.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,20 @@
"print('Accuracy: %.2f' % grid.best_score_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**\n",
"\n",
"The `StackingCVClassifier` also enables grid search over the `classifiers` argument. However, due to the current implementation of `GridSearchCV` in scikit-learn, it is not possible to search over both, differenct classifiers and classifier parameters at the same time. For instance, while the following parameter dictionary works\n",
"\n",
" params = {'randomforestclassifier__n_estimators': [1, 100],\n",
" 'classifiers': [(clf1, clf1, clf1), (clf2, clf3)]}\n",
" \n",
"it will use the instance settings of `clf1`, `clf2`, and `clf3` and not overwrite it with the `'n_estimators'` settings from `'randomforestclassifier__n_estimators': [1, 100]`."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
14 changes: 14 additions & 0 deletions docs/sources/user_guide/classifier/StackingClassifier.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -400,6 +400,20 @@
"print('Accuracy: %.2f' % grid.best_score_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**\n",
"\n",
"The `StackingClassifier` also enables grid search over the `classifiers` argument. However, due to the current implementation of `GridSearchCV` in scikit-learn, it is not possible to search over both, differenct classifiers and classifier parameters at the same time. For instance, while the following parameter dictionary works\n",
"\n",
" params = {'randomforestclassifier__n_estimators': [1, 100],\n",
" 'classifiers': [(clf1, clf1, clf1), (clf2, clf3)]}\n",
" \n",
"it will use the instance settings of `clf1`, `clf2`, and `clf3` and not overwrite it with the `'n_estimators'` settings from `'randomforestclassifier__n_estimators': [1, 100]`."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
14 changes: 14 additions & 0 deletions docs/sources/user_guide/regressor/StackingCVRegressor.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,20 @@
"print('Accuracy: %.2f' % grid.best_score_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**\n",
"\n",
"The `StackingCVRegressor` also enables grid search over the `regressors` argument. However, due to the current implementation of `GridSearchCV` in scikit-learn, it is not possible to search over both, differenct classifiers and classifier parameters at the same time. For instance, while the following parameter dictionary works\n",
"\n",
" params = {'randomforestregressor__n_estimators': [1, 100],\n",
" 'regressors': [(regr1, regr1, regr1), (regr2, regr3)]}\n",
" \n",
"it will use the instance settings of `regr1`, `regr2`, and `regr3` and not overwrite it with the `'n_estimators'` settings from `'randomforestregressor__n_estimators': [1, 100]`."
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
13 changes: 11 additions & 2 deletions docs/sources/user_guide/regressor/StackingRegressor.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,9 @@
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from mlxtend.regressor import StackingRegressor\n",
Expand Down Expand Up @@ -604,7 +606,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In case we are planning to use a regression algorithm multiple times, all we need to do is to add an additional number suffix in the parameter grid as shown below:"
"**Note**\n",
"\n",
"The `StackingRegressor` also enables grid search over the `regressors` argument. However, due to the current implementation of `GridSearchCV` in scikit-learn, it is not possible to search over both, differenct classifiers and classifier parameters at the same time. For instance, while the following parameter dictionary works\n",
"\n",
" params = {'randomforestregressor__n_estimators': [1, 100],\n",
" 'regressors': [(regr1, regr1, regr1), (regr2, regr3)]}\n",
" \n",
"it will use the instance settings of `regr1`, `regr2`, and `regr3` and not overwrite it with the `'n_estimators'` settings from `'randomforestregressor__n_estimators': [1, 100]`."
]
},
{
Expand Down
5 changes: 1 addition & 4 deletions mlxtend/classifier/ensemble_vote.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,10 +255,7 @@ def get_params(self, deep=True):

for key, value in six.iteritems(super(EnsembleVoteClassifier,
self).get_params(deep=False)):
if key == 'clfs':
continue
else:
out['%s' % key] = value
out['%s' % key] = value
return out

def _predict(self, X):
Expand Down
5 changes: 1 addition & 4 deletions mlxtend/classifier/stacking_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,10 +141,7 @@ def get_params(self, deep=True):

for key, value in six.iteritems(super(StackingClassifier,
self).get_params(deep=False)):
if key in ('classifiers', 'meta-classifier'):
continue
else:
out['%s' % key] = value
out['%s' % key] = value

return out

Expand Down
5 changes: 1 addition & 4 deletions mlxtend/classifier/stacking_cv_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,10 +245,7 @@ def get_params(self, deep=True):

for key, value in six.iteritems(super(StackingCVClassifier,
self).get_params(deep=False)):
if key in ('classifiers', 'meta-classifier'):
continue
else:
out['%s' % key] = value
out['%s' % key] = value

return out

Expand Down
36 changes: 36 additions & 0 deletions mlxtend/classifier/tests/test_ensemble_vote_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
Expand Down Expand Up @@ -86,3 +87,38 @@ def test_EnsembleVoteClassifier_gridsearch_enumerate_names():

grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid = grid.fit(iris.data, iris.target)


def test_get_params():
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3])

got = sorted(list({s.split('__')[0] for s in eclf.get_params().keys()}))
expect = ['clfs',
'gaussiannb',
'kneighborsclassifier',
'randomforestclassifier',
'refit',
'verbose',
'voting',
'weights']
assert got == expect, got


def test_classifier_gridsearch():
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
eclf = EnsembleVoteClassifier(clfs=[clf1])

params = {'clfs': [[clf1, clf1, clf1], [clf2, clf3]]}

grid = GridSearchCV(estimator=eclf,
param_grid=params,
cv=5,
refit=True)
grid.fit(X, y)

assert len(grid.best_params_['clfs']) == 2
41 changes: 41 additions & 0 deletions mlxtend/classifier/tests/test_stacking_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -241,3 +241,44 @@ def test_use_features_in_secondary_predict_proba():
y_pred = sclf.predict_proba(X[idx])[:, 0]
expect = np.array([0.911, 0.829, 0.885])
np.testing.assert_almost_equal(y_pred, expect, 3)


def test_get_params():
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)

got = sorted(list({s.split('__')[0] for s in sclf.get_params().keys()}))
expect = ['average_probas',
'classifiers',
'gaussiannb',
'kneighborsclassifier',
'meta-logisticregression',
'meta_classifier',
'randomforestclassifier',
'use_features_in_secondary',
'use_probas',
'verbose']
assert got == expect, got


def test_classifier_gridsearch():
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)

params = {'classifiers': [[clf1, clf1, clf1], [clf2, clf3]]}

grid = GridSearchCV(estimator=sclf,
param_grid=params,
cv=5,
refit=True)
grid.fit(X, y)

assert len(grid.best_params_['classifiers']) == 2
44 changes: 44 additions & 0 deletions mlxtend/classifier/tests/test_stacking_cv_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn import datasets
from mlxtend.utils import assert_raises
Expand Down Expand Up @@ -246,3 +247,46 @@ def test_pandas():
sclf.fit(X_df, iris.target)
except KeyError as e:
assert 'are NumPy arrays. If X and y are pandas DataFrames' in str(e)


def test_get_params():
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)

got = sorted(list({s.split('__')[0] for s in sclf.get_params().keys()}))
expect = ['classifiers',
'cv',
'gaussiannb',
'kneighborsclassifier',
'meta-logisticregression',
'meta_classifier',
'randomforestclassifier',
'shuffle',
'stratify',
'use_features_in_secondary',
'use_probas',
'verbose']
assert got == expect, got


def test_classifier_gridsearch():
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingCVClassifier(classifiers=[clf1],
meta_classifier=lr)

params = {'classifiers': [[clf1], [clf1, clf2, clf3]]}

grid = GridSearchCV(estimator=sclf,
param_grid=params,
cv=5,
refit=True)
grid.fit(X, y)

assert len(grid.best_params_['classifiers']) == 3
5 changes: 1 addition & 4 deletions mlxtend/regressor/stacking_cv_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,9 +182,6 @@ def get_params(self, deep=True):

for key, value in six.iteritems(super(StackingCVRegressor,
self).get_params(deep=False)):
if key in ('regressors', 'meta-regressor'):
continue
else:
out['%s' % key] = value
out['%s' % key] = value

return out
5 changes: 1 addition & 4 deletions mlxtend/regressor/stacking_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,10 +131,7 @@ def get_params(self, deep=True):

for key, value in six.iteritems(super(StackingRegressor,
self).get_params(deep=False)):
if key in ('regressors', 'meta-regressor'):
continue
else:
out['%s' % key] = value
out['%s' % key] = value

return out

Expand Down
37 changes: 37 additions & 0 deletions mlxtend/regressor/tests/test_cv_stacking_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,3 +103,40 @@ def test_gridsearch_numerate_regr():
grid = grid.fit(X1, y)
got = round(grid.best_score_, 1)
assert got >= 0.1 and got <= 0.2, '%f is wrong' % got


def test_get_params():
lr = LinearRegression()
svr_rbf = SVR(kernel='rbf')
ridge = Ridge(random_state=1)
stregr = StackingCVRegressor(regressors=[ridge, lr],
meta_regressor=svr_rbf)

got = sorted(list({s.split('__')[0] for s in stregr.get_params().keys()}))
expect = ['cv',
'linearregression',
'meta-svr',
'meta_regressor',
'regressors',
'ridge',
'shuffle',
'use_features_in_secondary']
assert got == expect, got


def test_regressor_gridsearch():
lr = LinearRegression()
svr_rbf = SVR(kernel='rbf')
ridge = Ridge(random_state=1)
stregr = StackingCVRegressor(regressors=[lr],
meta_regressor=svr_rbf)

params = {'regressors': [[ridge, lr], [lr, ridge, lr]]}

grid = GridSearchCV(estimator=stregr,
param_grid=params,
cv=5,
refit=True)
grid.fit(X1, y)

assert len(grid.best_params_['regressors']) == 3
Loading

0 comments on commit 5735e00

Please sign in to comment.