-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Shared Weights + Tests #33
base: master
Are you sure you want to change the base?
Conversation
@@ -65,14 +65,11 @@ cdef class Criterion: | |||
cdef double impurity_improvement(self, double impurity) nogil | |||
cdef double proxy_impurity_improvement(self) nogil | |||
|
|||
<<<<<<< HEAD | |||
cdef double node_impurity2(self, double* pred_weights) | |||
cdef void children_impurity2(self, double* impurity_left, | |||
double* impurity_right, double* pred_weights) | |||
cdef double proxy_impurity_improvement2(self, double* pred_weights) nogil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@morgsmss7 is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noted a few things about the tests. @jheiko1 Mostly they were good. My main comments would be to split everything into different tests, recalculate MSE for the tests with new y values, and look over some of the logic for the random state tests. Looks good otherwise :). I also added some comments for myself. Don't worry about those.
sklearn/tree/tests/test_tree.py
Outdated
----------------------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jheiko1 make sure you have charts for each of the tests like the ones above and include your impurity calculations in the comments as well.
sklearn/tree/tests/test_tree.py
Outdated
Likewise for Right node: | ||
Right node Mean1 = Mean2 = 4 | ||
Total error = ((4 - 4)^2 * 1.0) | ||
= 0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jheiko1 add a separate test with a descriptive name for each thing you are testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for the tree fit at line 1841, I think this is incorrect because you are going to need a try-except block. Since there is only 1 split, the algorithm is going to randomly choose either the first output (3,3,4,7,8) OR the second output (2,4,3,6,7) to split on, so the impurity can either be the same as line 1852 if the first output is chosen or it could be what you calculated here. Just make sure you write out your calculation. (I edited this after looking at it more)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove line 1845
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, for some reason it won't let me comment on some of the lines, but at line 1855, I might suggest increasing to range 30 or something just because it might be stronger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for lines 1870 and 1875, I think you need to set a variable to np.random.randint(1,100,(5,7)) and then use the same y for both dt_axis_3 and dt_axis_4 because otherwise it will definitely be different because they are using different data. If it doesn't pass, I would increase the number or outputs from 7 to like 20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after line 1878, I think you also need to break the for loop because just because the assert runs doesn't mean it won't continue trying until i = 100. The way it is right now, whether the test passes is contingent on what happens at random state 100.
Likewise for Right node: | ||
Right node Mean1 = Mean2 = 4 | ||
Total error = ((4 - 4)^2 * 1.0) | ||
= 0 | ||
|
||
Right Impurity = Total error / total weight | ||
= 0 / 1.0 | ||
= 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jheiko1 line 1987 needs to be different. You'll need to calculate what happens when both y1 and y2 are chosen here. You also need to add another nested try-except block or two because a few things could happen: just y1 can be chosen with weight 1 or -1, just y2 can be chosen with weight 1 or -1, both can be chosen with weights 1, both can be chosen with weights -1, or both can be chosen y1 with weight 1 and y2 with weight -1 or y2 with weight 1 and y1 with weight -1. You are going to need a try except for each case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 2007, same as previous comment about axis_proj. use the same y for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also line 1992, increase to a larger number to make the argument stronger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 2009: same comment as axis. I think you need a break statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in line 2014. should be oblique not axis (that was probably me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in 2025. should be oblique not MAE (that was probably also me)
@@ -1657,3 +1440,4 @@ def __init__(self, | |||
min_impurity_split=min_impurity_split, | |||
random_state=random_state, | |||
ccp_alpha=ccp_alpha) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of unnecessary changes throughout and in this file. I'll fix these. @jheiko1
The shared weights portion of the PR should be ready for review now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started with a few areas where the algorithm speed could be improved. Since you are using cython, I highly recommend using MemoryView syntax for numpy arrays whenever are doing basic math operations
|
||
cdef DOUBLE_t w = 1.0 | ||
for p in range(start, pos): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next 2 loops can be merged into one
|
||
cdef DOUBLE_t w = 1.0 | ||
|
||
for p in range(start, end): | ||
i = samples[p] | ||
if sample_weight != NULL: | ||
w = sample_weight[i] | ||
y_ik = self.y[i, k] | ||
sq_sum_total += w * y_ik * y_ik | ||
for k in range(self.n_outputs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be faster to use matrix manipulation instead of loop. Something like np.sum(w * self.y * self.y * pred_weights)
|
||
impurity = sq_sum_total / self.weighted_n_node_samples | ||
impurity -= (sum_total[k] / self.weighted_n_node_samples)**2.0 | ||
for k in range(self.n_outputs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, faster with list comprehension do something like impurity -= np.sum([sum_total[k] * pred_weights[k] / self.weighted_n_node_samples for k in range(self.n_outputs)
proxy_impurity_right += sum_right[k] * sum_right[k] | ||
|
||
|
||
for k in range(self.n_outputs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list comprehension or broadcasting
impurity_left[0] = sq_sum_left / self.weighted_n_left | ||
impurity_right[0] = sq_sum_right / self.weighted_n_right | ||
|
||
impurity_left[0] -= (sum_left[k] / self.weighted_n_left) ** 2.0 | ||
impurity_right[0] -= (sum_right[k] / self.weighted_n_right) ** 2.0 | ||
for k in range(self.n_outputs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list comprehension thing here again
cdef SIZE_t i | ||
cdef SIZE_t p | ||
cdef SIZE_t k | ||
cdef UINT32_t rand_r_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this code removed?
cdef double* sum_left = self.sum_left | ||
cdef double* sum_right = self.sum_right | ||
|
||
cdef SIZE_t k | ||
cdef double proxy_impurity_left = 0.0 | ||
cdef double proxy_impurity_right = 0.0 | ||
|
||
cdef UINT32_t rand_r_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same about this code block
@@ -1642,10 +1553,11 @@ cdef class ObliqueProjection(RegressionCriterion): | |||
|
|||
if sample_weight != NULL: | |||
w = sample_weight[i] | |||
|
|||
for k in range(self.n_outputs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, same list comment
@j1c @bdpedigo I've looked over the checks that aren't passing, and I'm not sure what the problem is really. All of the tests that fail in the checks pass on my machine (which I guess is understandable, but I can't really fix it if I can't recreate it) and a lot of them seem pretty unrelated to the changes we made. I know this was a reach goal for the first sprint this semester, so it doesn't really matter in terms of grades, but I'm wondering how you guys think I should proceed. |
I'm initiating this PR because for some reason, the test_tree.py file will not run on @jheiko1's machine. She will make comments for her tests and assign reviewers below.
Fixes #28
@jheiko1 and I have been working on the same branch. I have updated the criteria to save the outputs used in impurity calculations. @jheiko1 wrote tests for the new split criteria.
As of right now, the shared weights code is not ready for review, but the tests are.