-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added a data frame mapper which uses just Pipeline and FeatureUnion #62
base: master
Are you sure you want to change the base?
added a data frame mapper which uses just Pipeline and FeatureUnion #62
Conversation
Hi @chanansh ! I integrated a modified version of your |
mapping_to_pipeline doesn't seem to check for column_name uniqueness, which leads to strange behavior if the same column is used twice. Alternatively, it seems this functionality could be added to the current implementation of DataFrameMapper if it was modified to accept step names, similarly to sklearn.pipeline.Pipeline. Unfortunately I don't see an easy way to do that. Maybe check the length of the mapping tuples and if no name is provided, automatically generate one such as "col_trans_0". This shouldn't break backwards compatibility. |
Good idea! I believe we can generate feature names from the selected columns, like it's done in the PR now, and allow the user to provide a custom name as a third argument to the feature tuple. For example:
Feature names (both custom and auto-generated) can be checked during init to ensure they are unique, and raise an exception otherwise. Does that sound reasonable? |
Yeah, that looks like a good approach. I can work on this, but I'm very busy these days so it might be a few weeks until I have time to start. |
Don't worry, I'm already working on this myself. Will upload updated code this evening. |
Will this be implemented in the master any time soon? |
Hi @havardl . We worked on this like one year ago, but honestly I don't think it adds so much value in terms of features compared to the amount of flexibility lost and the fact that it would break all previous pickled mappers (it's backwards incompatible). I believe it would be better to try to implement the feature of setting parameters as mentioned in the original post of this PR. |
I tried to use the DataFrameMapper but had problems with setting parameters of internal models and consistency with other wrapping methods of scikit-learn. I found that one can just FeatureUnion a bunch of pipeline where each pipeline has a front-end column selector Transformer
ColumnSelectTransformer
followed by the requested list of transformer. The resultant pipeline has also names so one can track back what each parameter means when doingget_params(deep=True)
.I hope someone will find this useful.