Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Adding binning capabilities to decision trees #23

Open
adam2392 opened this issue Jun 14, 2022 · 6 comments
Open

[ENH] Adding binning capabilities to decision trees #23

adam2392 opened this issue Jun 14, 2022 · 6 comments

Comments

@adam2392
Copy link
Collaborator

adam2392 commented Jun 14, 2022

Describe the workflow you want to enable

I am part of the @neurodata team. Binning features have resulted in highly efficient and little loss in performance in gradient-boosted trees. This feature should not only be used in gradient-boosted trees, but should be available within all decision trees [1].

By including binning as a feature for decision trees, we would enable massive speedups for decision trees that operate on high-dimensional data (both in features and sample sizes). This would be an additional tradeoff that users can take. The intuition behind binning for decision trees would be exactly that of Gradient Boosted Trees.

Describe your proposed solution

We propose introducing binning to the decision tree classifier and regressor.

An initial PR is proposed here: #24 (review)
However, it seems that many of the files were copied and it is not 100% clear if needed. Perhaps we can explore how to consolidate the _binning.py/pyx files using the current versions under ensemble/_hist_gradient_boosting/*.

Changes to the Cython codebase

TBD

Changes to the Python API

The following two parameters would be added to the DecisionTreeClassifier and Regressor:

hist_binning=False,
max_bins=255

where the default number of bins follows that of histgradient boosting.

Additional context

These changes can also trivially be applied to Oblique Trees.

References:
[1] https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree

@adam2392
Copy link
Collaborator Author

@jovo and @jshinm here's a draft Issue for the binning problem for decision trees.

I am parsing @p-teng 's proposed OG changes. Currently too much copy-pasting from the original "binning" files to make it work for decision trees. sklearn-devs will prefer/want something with a short change. Basically, need to determine if we can just import/short-modify the existing binning code to allow decision trees to be also subbed in.

@p-teng if you have any comments based on when you tried it, please lmk. E.g. If there were any technical reasons you needed to copy/paste the existing binning files from ensemble/_hist_gradient/* into ensemble/* folder. Thanks!

CC: @PSSF23 incase you're interested.

@PSSF23
Copy link
Member

PSSF23 commented Jun 17, 2022

As mentioned in my comment in #22, this branch should be more helpful if you want to compare changes Philip made: https://github.com/PSSF23/scikit-learn-stream/tree/hist

Overall, the work does not look complete to me.

@adam2392
Copy link
Collaborator Author

adam2392 commented Jun 21, 2022

Unfortunately the work in #22, Philip's changes are still very much incomplete. There wasn't a need to copy certain code. Moreover, the binning capability needs to be consistent between fit/predict phases. Finally, the code itself does not work or compile and is simply a "skeleton" of what needs to be done.

To eventually make things work... I think this can be as simple as just adding a few LOC to each Python function to enable the trees to bin the data. Possibly refactoring out a private function that does this... I've begun doing this in #25. I think if @jshinm is interested in this kind of scientific software work and getting experience in developing OSS, then this is a good opportunity since the changes needed to make everything work will be considerably less compared to that of Oblique Forests.

We'll see what jovo thinks. This probably will require a bit more work + testing, so idk if I have the time to dedicate to this.

@jshinm
Copy link

jshinm commented Jun 23, 2022

@adam2392 Thank you for your insights, Adam! This is a definitely interesting project and some guidance on your end will certainly keep the ball rolling 😄

@adam2392
Copy link
Collaborator Author

Another note from the sklearn devs:

in order to add the binning capabilities to the trees, most likely, we would need to enable it for all trees, meaning RandomForests and ExtraTrees. Then I think we would want to show that there is a relevant performance/efficiency tradeoff.

^ this seems like quite a bit of work, so I think a simple sketch of what the code needs to look like is ideal first and then discussing with sklearn devs is the next step. Ideally someone from NDD can do this because it's a straightforward coding project imo (assuming you know some Cython and how to structure Python files).

@adam2392
Copy link
Collaborator Author

Seems like predict has some different implementations based on the DTYPE:

https://github.com/scikit-learn/scikit-learn/blob/75ae699281549bf455f0767fa65e78268c8fd704/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx#L16

The issue is that https://github.com/scikit-learn/scikit-learn/blob/b75b5aaddef89931d372ab6a22759eaf8385e6b1/sklearn/ensemble/_hist_gradient_boosting/binning.py#L276-L279 also shows that the BinMapper transforms the datatype to uint8 to conserve RAM, but this would require generalizing the Tree Splitter to accept uint8 DTYPEs, which is not trivial.

I think for now to enable binning at this scale, we can just sacrifice RAM (i.e. create a copy of the dataset that is binned that is of np.float64, which obviously would be more RAM usage unnecessarily).

A path to getting around this is adding a fused type definition for places like this: https://github.com/scikit-learn/scikit-learn/blob/b75b5aaddef89931d372ab6a22759eaf8385e6b1/sklearn/tree/_splitter.pyx#L708

This way, the class can work for both DTYPE_t (i.e. np.float64) and e.g. uint8.

@adam2392 adam2392 mentioned this issue Mar 28, 2023
adam2392 added a commit that referenced this issue Mar 28, 2023
#### Reference Issues/PRs
Closes: #25 
Closes: #23 


#### What does this implement/fix? Explain your changes.
Adds preliminary capability for binning features. The cons is we need to
"bin" again during predict time.

I've documented how we can get around this in the README, but it will involve some
heavy-duty Cython coding.

- Remove Oblique tree models and migrated to scikit-tree
- Fix up CI so that way unnecessary workflows are not ran on the fork
- Updated the documentation with the current limitation of binning

---------

Signed-off-by: Adam Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants