WIP: Add projection split criteria (see issue #4) #10

morgsmss7 · 2019-11-24T22:51:53Z

What does this implement/fix? Explain your changes.

Fixes Issue #4. See also #2.

Implemented the following split criteria:

Axis projection: one predictor is chosen at random to calculate MSE
Oblique projection: outputs are projected onto a sparse line before splitting with MSE

Note: These required bringing the random state into the RegressionCriterion class

Added basic tests for new split criteria in sklearn/tree/tests/tree/test_tree.py:

test_axis_proj: compares mse node impurity and child impurity calculations for a single predictor to the associated values for axis projection on a multi-output toy dataset
test_oblique_proj: compares mse node impurity and child impurity calculations for a single predictor or two predictors to the associated values for axis projection on a multi-output toy dataset

Demo Notebook

(in progress)

Any other comments? (Especially seeking help/input for the following)

Please note that the oblique projection test does not pass at the moment as I need to find a way to use try/except with assert statements.
In creating tests, I also found it difficult to work with the random state, so I am suspicious of my addition of this parameter.
I calculated MSE by hand for both tests, but did not end up using these calculations because they did not even match for the standard mse split criterion. I suspect it is due to rounding error, but I'm not sure.
I could not figure out what rf.tree_.value.flat was, so I was unable to test this effectively. I commented out tests I tried.
@v715 is working on finding out where the existing splitters fail. I am currently running an additional simulation that I expect to see the new split criteria excel at. If this works as expected, I will add this test to this pull request.

… test_forest.py pass all tests. Wrote axis projection test to compare against past results

…s not pass atm

adam2392 · 2019-12-05T15:22:05Z

sklearn/tree/_criterion.pyx

@@ -24,8 +24,11 @@ from libc.math cimport fabs

 import numpy as np
 cimport numpy as np
+import random #added by morgan


"#added by morgan" Are these necessary?

No, they aren't. Thanks for reminding me. I was trying to keep track of lines I added incase issues arose later. I'll be sure to remove them.

no problem! I know the sklearn folks will probably have you remove them anyways down the line. But a good tool to use for keeping track of your stuff you added is checking the PR "files changed", as well as using "git diff".

adam2392

Currently pulling down sklearn and installing dependencies to run sklearn original tests + your tests. Quick question: why isn't travis.CI running on your guy's PR branches? possibly due to main sklearn repo build failing? Will edit down below as soon as I get it working. Had something come up.

Please note that the oblique projection test does not pass at the moment as I need to find a way to use try/except with assert statements.

In creating tests, I also found it difficult to work with the random state, so I am suspicious of my addition of this parameter.

I calculated MSE by hand for both tests, but did not end up using these calculations because they did not even match for the standard mse split criterion. I suspect it is due to rounding error, but I'm not sure.

I could not figure out what rf.tree_.value.flat was, so I was unable to test this effectively. I commented out tests I tried.

sklearn/tree/_criterion.pyx

morgsmss7 · 2019-12-05T22:44:48Z

Hmm. Not sure why that is happening. I haven't tried make. I've been mainly using make in and that finishes without any issues. I always have to run this file before I can use the modified sklearn. Just be careful with the exports because I don't append things, just replace them because that was the only way I could get it to work.

export CC=/usr/bin/clang
export CXX=/usr/bin/clang++
export CPPFLAGS="-Xpreprocessor -fopenmp"
export CFLAGS="-I/usr/local/opt/libomp/include"
export CXXFLAGS="-I/usr/local/opt/libomp/include"
export LDFLAGS="-Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"

make clean
pip install --verbose --editable .
make in

adam2392

Hmm. Not sure why that is happening...

Yep got it working. All that you need I think is to rerun "pip install --verbose --editable ." Ref: https://scikit-learn.org/stable/developers/contributing.html#how-to-contribute

Regarding your tests: It looked like the following failed:

test_boston:

# Check consistency on dataset boston house prices.
    
        for (name, Tree), criterion in product(REG_TREES.items(), REG_CRITERIONS):
            reg = Tree(criterion=criterion, random_state=0)
            reg.fit(boston.data, boston.target)
            score = mean_squared_error(boston.target, reg.predict(boston.data))
>           assert score < 1, (
                "Failed with {0}, criterion = {1} and score = {2}"
                "".format(name, criterion, score))
E           AssertionError: Failed with DecisionTreeRegressor, criterion = oblique and score = 84.41955615616554
E           assert 84.41955615616554 < 1

test_oblique_proj:

    def test_oblique_proj():
        """Check oblique projection criterion produces correct results on small toy dataset:
    
        -----------------------
        | X | y1  y2  | weight |
        -----------------------
        | 3 |  3   3  |  0.1   |
        | 5 |  3   3  |  0.3   |
        | 8 |  4   4  |  1.0   |
        | 3 |  7   7  |  0.6   |
        | 5 |  8   8  |  0.3   |
        -----------------------
        |sum wt:|  2.3         |
        -----------------------
    
        Mean1 = 5
        Mean_tot = 5
    
        For all the samples, we can get the total error by summing:
        (Mean1 - y1)^2 * weight or (Mean_tot - y)^2 * weight
    
        I.e., error1      = (5 - 3)^2 * 0.1)
                          + (5 - 3)^2 * 0.3)
                          + (5 - 4)^2 * 1.0)
                          + (5 - 7)^2 * 0.6)
                          + (5 - 8)^2 * 0.3)
                          = 0.4 + 1.2 + 1.0 + 2.4 + 2.7
                          = 7.7
              error_tot   = 15.4
    
        Impurity = error / total weight
                 = 7.7 / 2.3
                 = 3.3478260869565
                 or
                 = 15.4 / 2.3
                 = 6.6956521739130
                 -----------------
    
        From this root node, the next best split is between X values of 5 and 8.
        Thus, we have left and right child nodes:
    
        LEFT                        RIGHT
        -----------------------     -----------------------
        | X | y1  y2  | weight |    | X | y1  y2  | weight |
        -----------------------     -----------------------
        | 3 |  3   3  |  0.1   |    | 8 |  4   4  |  1.0   |
        | 3 |  7   7  |  0.6   |    -----------------------
        | 5 |  3   3  |  0.3   |    |sum wt:|  1.0         |
        | 5 |  8   8  |  0.3   |    -----------------------
        -----------------------
        |sum wt:|  1.3         |
        -----------------------
    
        (5.0625 + 3.0625 + 5.0625 + 7.5625) / 4  + 0 = 5.1875
        4 + 4.667 = 8.667
    
        Impurity is found in the same way:
        Left node Mean1 = Mean2 = 5.25
            error1  = ((5.25 - 3)^2 * 0.1)
                    + ((5.25 - 7)^2 * 0.6)
                    + ((5.25 - 3)^2 * 0.3)
                    + ((5.25 - 8)^2 * 0.3)
                    = 6.13125
          error_tot = 12.2625
    
        Left Impurity = Total error / total weight
                = 6.13125 / 1.3
                = 4.716346153846154
                or
                = 12.2625 / 1.3
                = 9.43269231
                -------------------
    
        Likewise for Right node:
        Right node Mean1 = Mean2 = 4
        Total error = ((4 - 4)^2 * 1.0)
                    = 0
    
        Right Impurity = Total error / total weight
                = 0 / 1.0
                = 0.0
                ------
        """
        #y=[[3,3], [3,3], [4,4], [7,7], [8,8]]
        dt_axis = DecisionTreeRegressor(random_state=3, criterion="oblique",
                                       max_leaf_nodes=2)
        dt_mse = DecisionTreeRegressor(random_state=3, criterion="mse",
                                       max_leaf_nodes=2)
    
        # Test axis projection where sample weights are non-uniform (as illustrated above):
        dt_axis.fit(X=[[3], [5], [8], [3], [5]], y=[3, 3, 4, 7, 8],
                   sample_weight=[0.1, 0.3, 1.0, 0.6, 0.3])
        dt_mse.fit(X=[[3], [5], [8], [3], [5]], y=[3, 3, 4, 7, 8],
                   sample_weight=[0.1, 0.3, 1.0, 0.6, 0.3])
        try:
            assert_allclose(dt_axis.tree_.impurity, dt_mse.tree_.impurity*2)
        except:
>           assert_allclose(dt_axis.tree_.impurity, dt_mse.tree_.impurity)
E           AssertionError: 
E           Not equal to tolerance rtol=1e-07, atol=0
E           
E           Mismatch: 33.3%
E           Max absolute difference: 4.15384615
E           Max relative difference: 1.
E            x: array([3.330813, 0.      , 0.      ])
E            y: array([3.330813, 4.153846, 0.      ])

Sorry I had to c/p actual results, since I can't refer to a CI build to reference. I have a few comments:

It seems that you are matching the tree_imupurities on 2 indices, but not the 2nd. I don't think the initial "try statement" will help?
I don't see the test_boston mentioned in your comments, so does it fail for you too?

Apologies if some of these have been addressed before in your presentations. I am being explicit on the PR tho to keep track of this. I'm going to take a look at the explicit source code to try to answer a few of your other questions and concerns (to the best of my ability).

…s oblique test

…tions

morgsmss7 · 2019-12-13T01:29:37Z

@adam2392 I apologize for the delay. My laptop died causing me to lose a lot of my work from last week. I'm using a loaner right now, but I won't have it much longer as it is supposed to be my work laptop, and I need to wipe it to put a new OS on it.

Anyway, I think I pretty much replicated what I had.
Here are the main changes I made:

Added pred_weights to the SplitRecord struct in order to share predictor weights across axis and oblique projection impurity calculations for a single node. (Previously, new weights were generated between calculating parent and child impurities, but that was not exactly what Vivek had in mind, so I implemented the shared weights)
Changed both axis and oblique impurity calculations to match the formula:
sum((y-mean)^2))
I also changed test_axis_proj and test_oblique_proj to reflect this change.
Moved projection tests to their own file (test_proj_criteria.py)
These tests can be run using python test_proj_criteria.py. If "axis passed!" and "oblique passed!" are printed, both tests have passed without errors. Occasionally, I have been getting segfaults after these print statements, but I attribute this to something other than the tests themselves because it happens after all lines are executed.

morgsmss7 · 2019-12-14T02:56:37Z

Because running Vivek's simulations resulted in many errors, I have decided to revert to unshared weights (a previous implementation that ran without errors). I have, however, updated the tests in test_tree.py and changed how MSE is calculated in Axis and Oblique split criteria to reflect by-hand calculations. I took out extraneous comments, and cleaned up the documentation. This commit should be good to go. I'll still be hunting for excess newlines, comments, etc., but other than that, this should be good.

morgsmss7 · 2019-12-14T20:08:07Z

I'm not sure exactly the significance of these failed tests. They don't seem to pertain to anything I changed in this PR. Can a TA help me understand this a little better? Do I need to find a way to fix these before merging? @adam2392 @j1c @bdpedigo

All pytests pass on my machine (understandably because they work on most of the MacOS tests). The failed tests seem to have something to do with missing data. I can't actually decipher what file causes the failure though.

morgsmss7 · 2019-12-14T22:41:34Z

Actually, it looks like these tests were already failing in the latest commit to master.

j1c · 2019-12-15T05:04:30Z

I'm not sure exactly the significance of these failed tests. They don't seem to pertain to anything I changed in this PR. Can a TA help me understand this a little better? Do I need to find a way to fix these before merging? @adam2392 @j1c @bdpedigo

All pytests pass on my machine (understandably because they work on most of the MacOS tests). The failed tests seem to have something to do with missing data. I can't actually decipher what file causes the failure though.

No you dont have to fix the failing builds. I just wanted to make sure new tests ran on some builds..

morgsmss7 · 2019-12-17T00:58:45Z

@adam2392 Is there something specific I should do so this is no longer a WIP?

adam2392 · 2019-12-17T01:22:49Z

@adam2392 Is there something specific I should do so this is no longer a WIP?

lol no sorry, this is just to tag it so I remind us to review it.

I've skimmed it and it looks mostly fine rn.

v715 and others added 8 commits November 2, 2019 10:13

Add dummy criterion class

eec7b54

testing random state additions

5e1d444

solved random_state so that make in works and test_multioutput.py and…

f1da357

… test_forest.py pass all tests. Wrote axis projection test to compare against past results

added a test with one good predictor and other noisy predictors

1d7cdd5

saving changes before removing from git

e3c7b96

removing unnecessary tests before creating pull request

1d4a4a2

created tests for axis and oblique projections. TODO oblique test doe…

e6e6f55

…s not pass atm

removed unnecessary changes to make review easier

5032694

morgsmss7 requested review from bdpedigo and j1c November 24, 2019 22:52

morgsmss7 self-assigned this Nov 24, 2019

eigenvivek mentioned this pull request Dec 2, 2019

Add nonlinear regression simulations for RandomForest split criteria comparison #12

Merged

removed print statement

9b405d4

adam2392 reviewed Dec 5, 2019

View reviewed changes

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved

adam2392 reviewed Dec 6, 2019

View reviewed changes

admin added 7 commits December 12, 2019 01:38

adding shared predictor weights to oblique projection criterion passe…

1b68b3d

…s oblique test

implemented shared predictor weighhts for axis projections

e78dbd3

removed unnecessary print statements and comments

c7b5870

removed excess changes

e960ddf

adjusted mse calculation in projection criteria to match hand calcula…

1799ef0

…tions

removed unnecessary comments

6510d0b

removed unnecessary changes

40a17a6

morgsmss7 requested a review from adam2392 December 13, 2019 01:29

reverting to unshared weights due to memory errors

7c023ed

morgsmss7 added 2 commits December 14, 2019 14:08

remove proj criteria test file. (tests are now in test_tree.py)

a50062b

revert this file to original version

d717241

adam2392 changed the title ~~Add projection split criteria (see issue #4)~~ WIP: Add projection split criteria (see issue #4) Dec 16, 2019

make criterion more memory efficient and adjust tests accordingly

0c9335d

j1c approved these changes Dec 19, 2019

View reviewed changes

j1c merged commit b5a21d0 into master Dec 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add projection split criteria (see issue #4) #10

WIP: Add projection split criteria (see issue #4) #10

morgsmss7 commented Nov 24, 2019

adam2392 Dec 5, 2019

morgsmss7 Dec 5, 2019

adam2392 Dec 5, 2019

adam2392 left a comment

morgsmss7 commented Dec 5, 2019

adam2392 left a comment

morgsmss7 commented Dec 13, 2019

morgsmss7 commented Dec 14, 2019

morgsmss7 commented Dec 14, 2019

morgsmss7 commented Dec 14, 2019

j1c commented Dec 15, 2019

morgsmss7 commented Dec 17, 2019

adam2392 commented Dec 17, 2019

WIP: Add projection split criteria (see issue #4) #10

WIP: Add projection split criteria (see issue #4) #10

Conversation

morgsmss7 commented Nov 24, 2019

What does this implement/fix? Explain your changes.

Implemented the following split criteria:

Added basic tests for new split criteria in sklearn/tree/tests/tree/test_tree.py:

Demo Notebook

Any other comments? (Especially seeking help/input for the following)

adam2392 Dec 5, 2019

Choose a reason for hiding this comment

morgsmss7 Dec 5, 2019

Choose a reason for hiding this comment

adam2392 Dec 5, 2019

Choose a reason for hiding this comment

adam2392 left a comment

Choose a reason for hiding this comment

morgsmss7 commented Dec 5, 2019

adam2392 left a comment

Choose a reason for hiding this comment

morgsmss7 commented Dec 13, 2019

morgsmss7 commented Dec 14, 2019

morgsmss7 commented Dec 14, 2019

morgsmss7 commented Dec 14, 2019

j1c commented Dec 15, 2019

morgsmss7 commented Dec 17, 2019

adam2392 commented Dec 17, 2019