Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridSearch with pipelines of dataframes #24

Open
mratsim opened this issue Dec 3, 2016 · 3 comments
Open

GridSearch with pipelines of dataframes #24

mratsim opened this issue Dec 3, 2016 · 3 comments

Comments

@mratsim
Copy link

mratsim commented Dec 3, 2016

Hello again Cédric,

Following your help on transformer I am now trying to use a GridSearch to optimize the hyperparameters of a RandomForest.

I have a pipeline with lots of transformer which works great with Cross Validation and actual prediction, however I get a type error when trying to use it in a GridSearchCV, it seems like there is an extra argument of type ScikitLearn.Skcore.ParameterGrid in my setup :

pipe = Pipelines.Pipeline([ # This is working fine for cross validation, fitting and predicting
    ("extract_deck",PP_DeckTransformer()),
     ... # A list of 15 transformers
     ("featurize", mapper), # This is a DataFrameMapper to convert to Array
    ("forest", RandomForestClassifier(ntrees=200)) #Hyperparam: nsubfeatures, partialsampling, maxdepth
    ])

X_train = train
Y_train = convert(Array, train[:Survived])

# #Cross Validation - check model accuracy -- This is working fine
# crossval = round(cross_val_score(pipe, X_train, Y_train, cv =10), 2)
# print("\n",crossval,"\n")
# print(mean(crossval))

# GridSearch
grid = Dict(:ntrees => 10:30:240,
            :nsubfeatures => 0:1:13,
            :partialsampling => 0.2:0.1:1.0,
            :maxdepth => -1:2:13
)

gridsearch = GridSearchCV(pipe, grid)
fit!(gridsearch, X_train, Y_train)
println("Best hyper-parameters: $(gridsearch.best_params_)")

The error I get is :

ERROR: LoadError: MethodError: no method matching _fit!(::ScikitLearn.Skcore.GridSearchCV, ::DataFrames.DataFrame, ::Array{Int64,1}, ::ScikitLearn.Skcore.ParameterGrid)
Closest candidates are:
  _fit!(::ScikitLearn.Skcore.BaseSearchCV, !Matched::AbstractArray{T,N}, ::Any, ::Any) at /Users/<user>/.julia/v0.5/ScikitLearn/src/grid_search.jl:254
 in fit!(::ScikitLearn.Skcore.GridSearchCV, ::DataFrames.DataFrame, ::Array{Int64,1}) at /Users/<user>/.julia/v0.5/ScikitLearn/src/grid_search.jl:526
 in include_from_node1(::String) at ./loading.jl:488
 in include_from_node1(::String) at /usr/local/Cellar/julia/0.5.0/lib/julia/sys.dylib:?
 in process_options(::Base.JLOptions) at ./client.jl:262
 in _start() at ./client.jl:318
 in _start() at /usr/local/Cellar/julia/0.5.0/lib/julia/sys.dylib:?
while loading /Users/<path>/Kaggle-001-Julia-MagicalForest.jl, in expression starting on line 538

So the proc is receiving _fit!(::ScikitLearn.Skcore.GridSearchCV, ::DataFrames.DataFrame, ::Array{Int64,1}, ::ScikitLearn.Skcore.ParameterGrid) but expecting an array instead of a Dataframe. The thing is it should have been converted away by the DataFrameMapper.

If needed the full code is there https://github.com/mratsim/MachineLearning_Kaggle/blob/9c07a64a981a6512e021ae01623212a278fd05d1/Kaggle%20-%20001%20-%20Titanic%20Survivors/Kaggle-001-Julia-MagicalForest.jl#L530

@cstjean
Copy link
Owner

cstjean commented Dec 4, 2016

Hi, thank you for filing an issue about this. That's definitely a bug. I think that DataFrames have never been tested as input to grid-search. I just removed the AbstractArray type. Could you please try it out again? (Pkg.checkout("ScikitLearn"))

I'll have more time to look into it tomorrow.

@cstjean
Copy link
Owner

cstjean commented Dec 4, 2016

Pull requests are welcome.

@cstjean
Copy link
Owner

cstjean commented Feb 8, 2017

It looks like this isn't possible with scikit-learn in Python either. See scikit-learn-contrib/sklearn-pandas#61. Some proposed solutions in scikit-learn-contrib/sklearn-pandas#62 and scikit-learn-contrib/sklearn-pandas#64.

The primary challenge is to implement get_params/set_params for DataFrameMapper. Here's the code I used to test it:

using DataFrames: DataFrame
using ScikitLearn
using ScikitLearn.GridSearch: GridSearchCV
@sk_import ensemble: RandomForestClassifier
@sk_import preprocessing: StandardScaler

X_train = DataFrame(Any[randn(100), randn(100)], [:a, :b])
Y_train = rand(0:1, 100)

mapper = DataFrameMapper([([:a, :b], StandardScaler())])
pipe = Pipelines.Pipeline([ 
    ("featurize", mapper), 
    ("forest", RandomForestClassifier(n_estimators=200))
    ])

# GridSearch
grid = Dict(:forest__n_estimators => 10:30:240)

gridsearch = GridSearchCV(pipe, grid)
fit!(gridsearch, X_train, Y_train)
println("Best hyper-parameters: $(gridsearch.best_params_)")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants