-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add projection split criteria (see issue #4) #10
Conversation
… test_forest.py pass all tests. Wrote axis projection test to compare against past results
sklearn/tree/_criterion.pyx
Outdated
@@ -24,8 +24,11 @@ from libc.math cimport fabs | |||
|
|||
import numpy as np | |||
cimport numpy as np | |||
import random #added by morgan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"#added by morgan" Are these necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they aren't. Thanks for reminding me. I was trying to keep track of lines I added incase issues arose later. I'll be sure to remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem! I know the sklearn folks will probably have you remove them anyways down the line. But a good tool to use for keeping track of your stuff you added is checking the PR "files changed", as well as using "git diff".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently pulling down sklearn and installing dependencies to run sklearn original tests + your tests. Quick question: why isn't travis.CI running on your guy's PR branches? possibly due to main sklearn repo build failing? Will edit down below as soon as I get it working. Had something come up.
- Please note that the oblique projection test does not pass at the moment as I need to find a way to use try/except with assert statements.
- In creating tests, I also found it difficult to work with the random state, so I am suspicious of my addition of this parameter.
- I calculated MSE by hand for both tests, but did not end up using these calculations because they did not even match for the standard mse split criterion. I suspect it is due to rounding error, but I'm not sure.
- I could not figure out what rf.tree_.value.flat was, so I was unable to test this effectively. I commented out tests I tried.
Hmm. Not sure why that is happening. I haven't tried
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Not sure why that is happening...
Yep got it working. All that you need I think is to rerun "pip install --verbose --editable ." Ref: https://scikit-learn.org/stable/developers/contributing.html#how-to-contribute
Regarding your tests: It looked like the following failed:
- test_boston:
# Check consistency on dataset boston house prices.
for (name, Tree), criterion in product(REG_TREES.items(), REG_CRITERIONS):
reg = Tree(criterion=criterion, random_state=0)
reg.fit(boston.data, boston.target)
score = mean_squared_error(boston.target, reg.predict(boston.data))
> assert score < 1, (
"Failed with {0}, criterion = {1} and score = {2}"
"".format(name, criterion, score))
E AssertionError: Failed with DecisionTreeRegressor, criterion = oblique and score = 84.41955615616554
E assert 84.41955615616554 < 1
- test_oblique_proj:
def test_oblique_proj():
"""Check oblique projection criterion produces correct results on small toy dataset:
-----------------------
| X | y1 y2 | weight |
-----------------------
| 3 | 3 3 | 0.1 |
| 5 | 3 3 | 0.3 |
| 8 | 4 4 | 1.0 |
| 3 | 7 7 | 0.6 |
| 5 | 8 8 | 0.3 |
-----------------------
|sum wt:| 2.3 |
-----------------------
Mean1 = 5
Mean_tot = 5
For all the samples, we can get the total error by summing:
(Mean1 - y1)^2 * weight or (Mean_tot - y)^2 * weight
I.e., error1 = (5 - 3)^2 * 0.1)
+ (5 - 3)^2 * 0.3)
+ (5 - 4)^2 * 1.0)
+ (5 - 7)^2 * 0.6)
+ (5 - 8)^2 * 0.3)
= 0.4 + 1.2 + 1.0 + 2.4 + 2.7
= 7.7
error_tot = 15.4
Impurity = error / total weight
= 7.7 / 2.3
= 3.3478260869565
or
= 15.4 / 2.3
= 6.6956521739130
-----------------
From this root node, the next best split is between X values of 5 and 8.
Thus, we have left and right child nodes:
LEFT RIGHT
----------------------- -----------------------
| X | y1 y2 | weight | | X | y1 y2 | weight |
----------------------- -----------------------
| 3 | 3 3 | 0.1 | | 8 | 4 4 | 1.0 |
| 3 | 7 7 | 0.6 | -----------------------
| 5 | 3 3 | 0.3 | |sum wt:| 1.0 |
| 5 | 8 8 | 0.3 | -----------------------
-----------------------
|sum wt:| 1.3 |
-----------------------
(5.0625 + 3.0625 + 5.0625 + 7.5625) / 4 + 0 = 5.1875
4 + 4.667 = 8.667
Impurity is found in the same way:
Left node Mean1 = Mean2 = 5.25
error1 = ((5.25 - 3)^2 * 0.1)
+ ((5.25 - 7)^2 * 0.6)
+ ((5.25 - 3)^2 * 0.3)
+ ((5.25 - 8)^2 * 0.3)
= 6.13125
error_tot = 12.2625
Left Impurity = Total error / total weight
= 6.13125 / 1.3
= 4.716346153846154
or
= 12.2625 / 1.3
= 9.43269231
-------------------
Likewise for Right node:
Right node Mean1 = Mean2 = 4
Total error = ((4 - 4)^2 * 1.0)
= 0
Right Impurity = Total error / total weight
= 0 / 1.0
= 0.0
------
"""
#y=[[3,3], [3,3], [4,4], [7,7], [8,8]]
dt_axis = DecisionTreeRegressor(random_state=3, criterion="oblique",
max_leaf_nodes=2)
dt_mse = DecisionTreeRegressor(random_state=3, criterion="mse",
max_leaf_nodes=2)
# Test axis projection where sample weights are non-uniform (as illustrated above):
dt_axis.fit(X=[[3], [5], [8], [3], [5]], y=[3, 3, 4, 7, 8],
sample_weight=[0.1, 0.3, 1.0, 0.6, 0.3])
dt_mse.fit(X=[[3], [5], [8], [3], [5]], y=[3, 3, 4, 7, 8],
sample_weight=[0.1, 0.3, 1.0, 0.6, 0.3])
try:
assert_allclose(dt_axis.tree_.impurity, dt_mse.tree_.impurity*2)
except:
> assert_allclose(dt_axis.tree_.impurity, dt_mse.tree_.impurity)
E AssertionError:
E Not equal to tolerance rtol=1e-07, atol=0
E
E Mismatch: 33.3%
E Max absolute difference: 4.15384615
E Max relative difference: 1.
E x: array([3.330813, 0. , 0. ])
E y: array([3.330813, 4.153846, 0. ])
Sorry I had to c/p actual results, since I can't refer to a CI build to reference. I have a few comments:
- It seems that you are matching the tree_imupurities on 2 indices, but not the 2nd. I don't think the initial "try statement" will help?
- I don't see the test_boston mentioned in your comments, so does it fail for you too?
Apologies if some of these have been addressed before in your presentations. I am being explicit on the PR tho to keep track of this. I'm going to take a look at the explicit source code to try to answer a few of your other questions and concerns (to the best of my ability).
@adam2392 I apologize for the delay. My laptop died causing me to lose a lot of my work from last week. I'm using a loaner right now, but I won't have it much longer as it is supposed to be my work laptop, and I need to wipe it to put a new OS on it. Anyway, I think I pretty much replicated what I had.
|
Because running Vivek's simulations resulted in many errors, I have decided to revert to unshared weights (a previous implementation that ran without errors). I have, however, updated the tests in |
I'm not sure exactly the significance of these failed tests. They don't seem to pertain to anything I changed in this PR. Can a TA help me understand this a little better? Do I need to find a way to fix these before merging? @adam2392 @j1c @bdpedigo All pytests pass on my machine (understandably because they work on most of the MacOS tests). The failed tests seem to have something to do with missing data. I can't actually decipher what file causes the failure though. |
No you dont have to fix the failing builds. I just wanted to make sure new tests ran on some builds.. |
@adam2392 Is there something specific I should do so this is no longer a WIP? |
lol no sorry, this is just to tag it so I remind us to review it. I've skimmed it and it looks mostly fine rn. |
What does this implement/fix? Explain your changes.
Fixes Issue #4. See also #2.
Implemented the following split criteria:
Note: These required bringing the random state into the RegressionCriterion class
Added basic tests for new split criteria in sklearn/tree/tests/tree/test_tree.py:
test_axis_proj: compares mse node impurity and child impurity calculations for a single predictor to the associated values for axis projection on a multi-output toy dataset
test_oblique_proj: compares mse node impurity and child impurity calculations for a single predictor or two predictors to the associated values for axis projection on a multi-output toy dataset
Demo Notebook
(in progress)
Any other comments? (Especially seeking help/input for the following)