Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add projection split criteria (see issue #4) #10

Merged
merged 20 commits into from
Dec 20, 2019

Conversation

morgsmss7
Copy link

What does this implement/fix? Explain your changes.

Fixes Issue #4. See also #2.

Implemented the following split criteria:

  • Axis projection: one predictor is chosen at random to calculate MSE
  • Oblique projection: outputs are projected onto a sparse line before splitting with MSE

Note: These required bringing the random state into the RegressionCriterion class

Added basic tests for new split criteria in sklearn/tree/tests/tree/test_tree.py:

  • test_axis_proj: compares mse node impurity and child impurity calculations for a single predictor to the associated values for axis projection on a multi-output toy dataset

  • test_oblique_proj: compares mse node impurity and child impurity calculations for a single predictor or two predictors to the associated values for axis projection on a multi-output toy dataset

Demo Notebook

(in progress)

Any other comments? (Especially seeking help/input for the following)

  • Please note that the oblique projection test does not pass at the moment as I need to find a way to use try/except with assert statements.
  • In creating tests, I also found it difficult to work with the random state, so I am suspicious of my addition of this parameter.
  • I calculated MSE by hand for both tests, but did not end up using these calculations because they did not even match for the standard mse split criterion. I suspect it is due to rounding error, but I'm not sure.
  • I could not figure out what rf.tree_.value.flat was, so I was unable to test this effectively. I commented out tests I tried.
  • @v715 is working on finding out where the existing splitters fail. I am currently running an additional simulation that I expect to see the new split criteria excel at. If this works as expected, I will add this test to this pull request.

@@ -24,8 +24,11 @@ from libc.math cimport fabs

import numpy as np
cimport numpy as np
import random #added by morgan
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"#added by morgan" Are these necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they aren't. Thanks for reminding me. I was trying to keep track of lines I added incase issues arose later. I'll be sure to remove them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem! I know the sklearn folks will probably have you remove them anyways down the line. But a good tool to use for keeping track of your stuff you added is checking the PR "files changed", as well as using "git diff".

Copy link

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently pulling down sklearn and installing dependencies to run sklearn original tests + your tests. Quick question: why isn't travis.CI running on your guy's PR branches? possibly due to main sklearn repo build failing? Will edit down below as soon as I get it working. Had something come up.

  • Please note that the oblique projection test does not pass at the moment as I need to find a way to use try/except with assert statements.
  • In creating tests, I also found it difficult to work with the random state, so I am suspicious of my addition of this parameter.
  • I calculated MSE by hand for both tests, but did not end up using these calculations because they did not even match for the standard mse split criterion. I suspect it is due to rounding error, but I'm not sure.
  • I could not figure out what rf.tree_.value.flat was, so I was unable to test this effectively. I commented out tests I tried.

sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved
sklearn/tree/_criterion.pyx Outdated Show resolved Hide resolved
@morgsmss7
Copy link
Author

Hmm. Not sure why that is happening. I haven't tried make. I've been mainly using make in and that finishes without any issues. I always have to run this file before I can use the modified sklearn. Just be careful with the exports because I don't append things, just replace them because that was the only way I could get it to work.

export CC=/usr/bin/clang
export CXX=/usr/bin/clang++
export CPPFLAGS="-Xpreprocessor -fopenmp"
export CFLAGS="-I/usr/local/opt/libomp/include"
export CXXFLAGS="-I/usr/local/opt/libomp/include"
export LDFLAGS="-Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"

make clean
pip install --verbose --editable .
make in

Copy link

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Not sure why that is happening...

Yep got it working. All that you need I think is to rerun "pip install --verbose --editable ." Ref: https://scikit-learn.org/stable/developers/contributing.html#how-to-contribute

Regarding your tests: It looked like the following failed:

  1. test_boston:
# Check consistency on dataset boston house prices.
    
        for (name, Tree), criterion in product(REG_TREES.items(), REG_CRITERIONS):
            reg = Tree(criterion=criterion, random_state=0)
            reg.fit(boston.data, boston.target)
            score = mean_squared_error(boston.target, reg.predict(boston.data))
>           assert score < 1, (
                "Failed with {0}, criterion = {1} and score = {2}"
                "".format(name, criterion, score))
E           AssertionError: Failed with DecisionTreeRegressor, criterion = oblique and score = 84.41955615616554
E           assert 84.41955615616554 < 1

  1. test_oblique_proj:
    def test_oblique_proj():
        """Check oblique projection criterion produces correct results on small toy dataset:
    
        -----------------------
        | X | y1  y2  | weight |
        -----------------------
        | 3 |  3   3  |  0.1   |
        | 5 |  3   3  |  0.3   |
        | 8 |  4   4  |  1.0   |
        | 3 |  7   7  |  0.6   |
        | 5 |  8   8  |  0.3   |
        -----------------------
        |sum wt:|  2.3         |
        -----------------------
    
        Mean1 = 5
        Mean_tot = 5
    
        For all the samples, we can get the total error by summing:
        (Mean1 - y1)^2 * weight or (Mean_tot - y)^2 * weight
    
        I.e., error1      = (5 - 3)^2 * 0.1)
                          + (5 - 3)^2 * 0.3)
                          + (5 - 4)^2 * 1.0)
                          + (5 - 7)^2 * 0.6)
                          + (5 - 8)^2 * 0.3)
                          = 0.4 + 1.2 + 1.0 + 2.4 + 2.7
                          = 7.7
              error_tot   = 15.4
    
        Impurity = error / total weight
                 = 7.7 / 2.3
                 = 3.3478260869565
                 or
                 = 15.4 / 2.3
                 = 6.6956521739130
                 -----------------
    
        From this root node, the next best split is between X values of 5 and 8.
        Thus, we have left and right child nodes:
    
        LEFT                        RIGHT
        -----------------------     -----------------------
        | X | y1  y2  | weight |    | X | y1  y2  | weight |
        -----------------------     -----------------------
        | 3 |  3   3  |  0.1   |    | 8 |  4   4  |  1.0   |
        | 3 |  7   7  |  0.6   |    -----------------------
        | 5 |  3   3  |  0.3   |    |sum wt:|  1.0         |
        | 5 |  8   8  |  0.3   |    -----------------------
        -----------------------
        |sum wt:|  1.3         |
        -----------------------
    
        (5.0625 + 3.0625 + 5.0625 + 7.5625) / 4  + 0 = 5.1875
        4 + 4.667 = 8.667
    
        Impurity is found in the same way:
        Left node Mean1 = Mean2 = 5.25
            error1  = ((5.25 - 3)^2 * 0.1)
                    + ((5.25 - 7)^2 * 0.6)
                    + ((5.25 - 3)^2 * 0.3)
                    + ((5.25 - 8)^2 * 0.3)
                    = 6.13125
          error_tot = 12.2625
    
        Left Impurity = Total error / total weight
                = 6.13125 / 1.3
                = 4.716346153846154
                or
                = 12.2625 / 1.3
                = 9.43269231
                -------------------
    
        Likewise for Right node:
        Right node Mean1 = Mean2 = 4
        Total error = ((4 - 4)^2 * 1.0)
                    = 0
    
        Right Impurity = Total error / total weight
                = 0 / 1.0
                = 0.0
                ------
        """
        #y=[[3,3], [3,3], [4,4], [7,7], [8,8]]
        dt_axis = DecisionTreeRegressor(random_state=3, criterion="oblique",
                                       max_leaf_nodes=2)
        dt_mse = DecisionTreeRegressor(random_state=3, criterion="mse",
                                       max_leaf_nodes=2)
    
        # Test axis projection where sample weights are non-uniform (as illustrated above):
        dt_axis.fit(X=[[3], [5], [8], [3], [5]], y=[3, 3, 4, 7, 8],
                   sample_weight=[0.1, 0.3, 1.0, 0.6, 0.3])
        dt_mse.fit(X=[[3], [5], [8], [3], [5]], y=[3, 3, 4, 7, 8],
                   sample_weight=[0.1, 0.3, 1.0, 0.6, 0.3])
        try:
            assert_allclose(dt_axis.tree_.impurity, dt_mse.tree_.impurity*2)
        except:
>           assert_allclose(dt_axis.tree_.impurity, dt_mse.tree_.impurity)
E           AssertionError: 
E           Not equal to tolerance rtol=1e-07, atol=0
E           
E           Mismatch: 33.3%
E           Max absolute difference: 4.15384615
E           Max relative difference: 1.
E            x: array([3.330813, 0.      , 0.      ])
E            y: array([3.330813, 4.153846, 0.      ])

Sorry I had to c/p actual results, since I can't refer to a CI build to reference. I have a few comments:

  1. It seems that you are matching the tree_imupurities on 2 indices, but not the 2nd. I don't think the initial "try statement" will help?
  2. I don't see the test_boston mentioned in your comments, so does it fail for you too?

Apologies if some of these have been addressed before in your presentations. I am being explicit on the PR tho to keep track of this. I'm going to take a look at the explicit source code to try to answer a few of your other questions and concerns (to the best of my ability).

@morgsmss7
Copy link
Author

@adam2392 I apologize for the delay. My laptop died causing me to lose a lot of my work from last week. I'm using a loaner right now, but I won't have it much longer as it is supposed to be my work laptop, and I need to wipe it to put a new OS on it.

Anyway, I think I pretty much replicated what I had.
Here are the main changes I made:

  1. Added pred_weights to the SplitRecord struct in order to share predictor weights across axis and oblique projection impurity calculations for a single node. (Previously, new weights were generated between calculating parent and child impurities, but that was not exactly what Vivek had in mind, so I implemented the shared weights)
  2. Changed both axis and oblique impurity calculations to match the formula:
    sum((y-mean)^2))
    I also changed test_axis_proj and test_oblique_proj to reflect this change.
  3. Moved projection tests to their own file (test_proj_criteria.py)
    These tests can be run using python test_proj_criteria.py. If "axis passed!" and "oblique passed!" are printed, both tests have passed without errors. Occasionally, I have been getting segfaults after these print statements, but I attribute this to something other than the tests themselves because it happens after all lines are executed.

@morgsmss7 morgsmss7 requested a review from adam2392 December 13, 2019 01:29
@morgsmss7
Copy link
Author

Because running Vivek's simulations resulted in many errors, I have decided to revert to unshared weights (a previous implementation that ran without errors). I have, however, updated the tests in test_tree.py and changed how MSE is calculated in Axis and Oblique split criteria to reflect by-hand calculations. I took out extraneous comments, and cleaned up the documentation. This commit should be good to go. I'll still be hunting for excess newlines, comments, etc., but other than that, this should be good.

@morgsmss7
Copy link
Author

I'm not sure exactly the significance of these failed tests. They don't seem to pertain to anything I changed in this PR. Can a TA help me understand this a little better? Do I need to find a way to fix these before merging? @adam2392 @j1c @bdpedigo

All pytests pass on my machine (understandably because they work on most of the MacOS tests). The failed tests seem to have something to do with missing data. I can't actually decipher what file causes the failure though.

@morgsmss7
Copy link
Author

Actually, it looks like these tests were already failing in the latest commit to master.

image

@j1c
Copy link
Member

j1c commented Dec 15, 2019

I'm not sure exactly the significance of these failed tests. They don't seem to pertain to anything I changed in this PR. Can a TA help me understand this a little better? Do I need to find a way to fix these before merging? @adam2392 @j1c @bdpedigo

All pytests pass on my machine (understandably because they work on most of the MacOS tests). The failed tests seem to have something to do with missing data. I can't actually decipher what file causes the failure though.

No you dont have to fix the failing builds. I just wanted to make sure new tests ran on some builds..

@adam2392 adam2392 changed the title Add projection split criteria (see issue #4) WIP: Add projection split criteria (see issue #4) Dec 16, 2019
@morgsmss7
Copy link
Author

@adam2392 Is there something specific I should do so this is no longer a WIP?

@adam2392
Copy link

@adam2392 Is there something specific I should do so this is no longer a WIP?

lol no sorry, this is just to tag it so I remind us to review it.

I've skimmed it and it looks mostly fine rn.

@j1c j1c merged commit b5a21d0 into master Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants