Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added a data frame mapper which uses just Pipeline and FeatureUnion #62

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

added a data frame mapper which uses just Pipeline and FeatureUnion #62

wants to merge 1 commit into from

Conversation

chanansh
Copy link

@chanansh chanansh commented Aug 2, 2016

I tried to use the DataFrameMapper but had problems with setting parameters of internal models and consistency with other wrapping methods of scikit-learn. I found that one can just FeatureUnion a bunch of pipeline where each pipeline has a front-end column selector Transformer ColumnSelectTransformer followed by the requested list of transformer. The resultant pipeline has also names so one can track back what each parameter means when doing get_params(deep=True).

I hope someone will find this useful.

@dukebody
Copy link
Collaborator

dukebody commented Aug 4, 2016

Hi @chanansh !

I integrated a modified version of your FeatureUnion-based pipeline into the current version of DataFrameMapper in https://github.com/paulgb/sklearn-pandas/tree/feature_union_pipe. Let me know what you think.

@vzaretsk
Copy link
Contributor

vzaretsk commented Aug 5, 2016

mapping_to_pipeline doesn't seem to check for column_name uniqueness, which leads to strange behavior if the same column is used twice. Alternatively, it seems this functionality could be added to the current implementation of DataFrameMapper if it was modified to accept step names, similarly to sklearn.pipeline.Pipeline. Unfortunately I don't see an easy way to do that. Maybe check the length of the mapping tuples and if no name is provided, automatically generate one such as "col_trans_0". This shouldn't break backwards compatibility.
@chanansh @dukebody

@dukebody
Copy link
Collaborator

dukebody commented Aug 25, 2016

Good idea! I believe we can generate feature names from the selected columns, like it's done in the PR now, and allow the user to provide a custom name as a third argument to the feature tuple.

For example:

mapper = DataFrameMapper([
    (['height'], StandardScaler()),
    (['height'], None, 'unmodified_height'),
])

Feature names (both custom and auto-generated) can be checked during init to ensure they are unique, and raise an exception otherwise.

Does that sound reasonable?

@vzaretsk
Copy link
Contributor

Yeah, that looks like a good approach. I can work on this, but I'm very busy these days so it might be a few weeks until I have time to start.

@dukebody
Copy link
Collaborator

Don't worry, I'm already working on this myself. Will upload updated code this evening.

@havardl
Copy link

havardl commented Aug 22, 2017

Will this be implemented in the master any time soon?

@dukebody
Copy link
Collaborator

Hi @havardl . We worked on this like one year ago, but honestly I don't think it adds so much value in terms of features compared to the amount of flexibility lost and the fact that it would break all previous pickled mappers (it's backwards incompatible).

I believe it would be better to try to implement the feature of setting parameters as mentioned in the original post of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants