Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn handles text labels differently than ml_dask on OneHotEncoding #964

Open
brian-methodical opened this issue Feb 8, 2023 · 1 comment

Comments

@brian-methodical
Copy link
Contributor

brian-methodical commented Feb 8, 2023

Describe the issue:
OneHotEncoding defaults to sets catagories_ to a numpy dtype='<U1' when explicitly given an array of text [["a", "b", "c"]] sklearn sets catagories_ dtype object.
Which is the correct behavior:

  1. either cast catagories_ to dtype=object from OneHotEncoding on dask-ml when explicitly setting arrays (auto[1] is off)
  2. or fix the test so it does not care about catagories_ dtype mismatch OR explicitly set dtype as argument in tests

[1] please note this issue is only with categories explicitly set and not when set to 'auto'.

When set to 'auto' dtype is numpy.float64 array, which is a third behavior. However we are matching what is done by sklearn there.

Minimal Complete Verifiable Example:

see test_basic_array

Anything else we need to know?:

This is illustrated in the failing test

Sklearn documentation found here shows dtype as object where dask-ml shows dtype as <U1 found here

Environment:

  • Dask version: dask-ml-3.8 conda env dask 2023.1.1
  • Python version: 3.8, 3.9, 3.10
  • Operating System: ubuntu-latest
  • Install method (conda, pip, source): conda
brian-methodical added a commit to brian-methodical/dask-ml that referenced this issue Feb 9, 2023
brian-methodical added a commit to brian-methodical/dask-ml that referenced this issue Feb 10, 2023
mmccarty pushed a commit that referenced this issue Feb 10, 2023
* text matrix

* spliting the string creates the expected input to FeatureHasher #964

* FeatureHasher issue #963

* addressing catagories_ type mismatch when auto by explicitly setting dtype on test data to object #964

* reverted to just ubuntu for time saving
@narnia24
Copy link

hey can i work on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants