sklearn handles text labels differently than ml_dask on OneHotEncoding #964

brian-methodical · 2023-02-08T23:01:23Z

Describe the issue:
OneHotEncoding defaults to sets catagories_ to a numpy dtype='<U1' when explicitly given an array of text [["a", "b", "c"]] sklearn sets catagories_ dtype object.
Which is the correct behavior:

either cast catagories_ to dtype=object from OneHotEncoding on dask-ml when explicitly setting arrays (auto[1] is off)
or fix the test so it does not care about catagories_ dtype mismatch OR explicitly set dtype as argument in tests

[1] please note this issue is only with categories explicitly set and not when set to 'auto'.

When set to 'auto' dtype is numpy.float64 array, which is a third behavior. However we are matching what is done by sklearn there.

Minimal Complete Verifiable Example:

see test_basic_array

Anything else we need to know?:

This is illustrated in the failing test

Sklearn documentation found here shows dtype as object where dask-ml shows dtype as <U1 found here

Environment:

Dask version: dask-ml-3.8 conda env dask 2023.1.1
Python version: 3.8, 3.9, 3.10
Operating System: ubuntu-latest
Install method (conda, pip, source): conda

The text was updated successfully, but these errors were encountered:

…dtype on test data to object dask#964

* text matrix * spliting the string creates the expected input to FeatureHasher #964 * FeatureHasher issue #963 * addressing catagories_ type mismatch when auto by explicitly setting dtype on test data to object #964 * reverted to just ubuntu for time saving

narnia24 · 2024-10-11T10:25:40Z

hey can i work on this?

brian-methodical added a commit to brian-methodical/dask-ml that referenced this issue Feb 9, 2023

spliting the string creates the expected input to FeatureHasher dask#964

c59fc98

brian-methodical added a commit to brian-methodical/dask-ml that referenced this issue Feb 10, 2023

addressing catagories_ type mismatch when auto by explicitly setting …

3f4517d

…dtype on test data to object dask#964

brian-methodical mentioned this issue Feb 10, 2023

Fix tests #965

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sklearn handles text labels differently than ml_dask on OneHotEncoding #964

sklearn handles text labels differently than ml_dask on OneHotEncoding #964

brian-methodical commented Feb 8, 2023 •

edited

Loading

narnia24 commented Oct 11, 2024

sklearn handles text labels differently than ml_dask on OneHotEncoding #964

sklearn handles text labels differently than ml_dask on OneHotEncoding #964

Comments

brian-methodical commented Feb 8, 2023 • edited Loading

narnia24 commented Oct 11, 2024

brian-methodical commented Feb 8, 2023 •

edited

Loading