You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The function API in the current distribution class has instance vectors for the target (adY), the current ensemble value (adF), the tree value (adFadj), the gradient (adZ), the weights (adW), and the offset (adOffset). First, ComputeWorkingResponse is called to calculate the gradient. Then, it is passed to FitBestConstant.
Although FitBestConstant is implemented in every separate distribution, it is quite similar each time: A numerator array keeps track of the sum of gradients per terminal node, and a denominator array sums the diagonals of the Hessian (computed here). The final predicted value is the ratio of the two.
Proposal: If we changed the interfaces of ComputeWorkingResponse and FitBestConstant to include the Hessian as well, it might be possible to reuse the same single implementation of FitBestConstant. Moreover, this would make it easier to allow options to use the Hessian differently, or not at all.
While the Newton algorithm helps find a good solution fast, sometimes the final model might be actually better using gradients alone (or, as a compromise, limit/cap the gradients). Low Hessians can easily lead to overfitting. I realize such a cap is implemented for the Bernoulli distribution, but we could make the procedure generally applicable for all distributions - or give the user an option to use only gradients (for the initial trees) ... Thoughts?
The text was updated successfully, but these errors were encountered:
The function API in the current distribution class has instance vectors for the target (adY), the current ensemble value (adF), the tree value (adFadj), the gradient (adZ), the weights (adW), and the offset (adOffset). First, ComputeWorkingResponse is called to calculate the gradient. Then, it is passed to FitBestConstant.
Although FitBestConstant is implemented in every separate distribution, it is quite similar each time: A numerator array keeps track of the sum of gradients per terminal node, and a denominator array sums the diagonals of the Hessian (computed here). The final predicted value is the ratio of the two.
Proposal: If we changed the interfaces of ComputeWorkingResponse and FitBestConstant to include the Hessian as well, it might be possible to reuse the same single implementation of FitBestConstant. Moreover, this would make it easier to allow options to use the Hessian differently, or not at all.
While the Newton algorithm helps find a good solution fast, sometimes the final model might be actually better using gradients alone (or, as a compromise, limit/cap the gradients). Low Hessians can easily lead to overfitting. I realize such a cap is implemented for the Bernoulli distribution, but we could make the procedure generally applicable for all distributions - or give the user an option to use only gradients (for the initial trees) ... Thoughts?
The text was updated successfully, but these errors were encountered: