Optimal intercept initialization for simple objectives #10298

david-cortes · 2024-05-18T20:43:16Z

This PR modifies the intercept initialization for simple objectives (logistic, poisson, gamma, tweedie) to use their closed-form optimal solutions (as in: the number that minimizes the objective function) instead of a non-optimal one-step Newton.

For these objectives, the optimal intercept corresponds simply to the link function applied to the mean of the response variable. Since base_score already undergoes this transformation, the PR here just changes calculation to the mean of the response variable in those cases.

For multi-target versions of these objectives, it sets them to zero instead as otherwise applying a common intercept might not make much sense for the given problem.

Note that there's still room for improvements:

Custom user-defined functions would most likely be better served by a default score of zero or by a 1D newton estimation. I wasn't sure where in the code to identify when a user-defined objective is passed though.
Other objectives would likely benefit from using more than one newton step for the intercept estimation.

Note1: I wasn't sure about how to calculate a weighted sample mean here (not familiar with GPU computing and the 'devices' logic). Would be helpful to have a WeightedMean function under stats if possible, to use in case there's sample weights.

Note2: The compiler checks here don't like turning a linalg::Tensor<T, 2> into linalg::Tensor<T, 1> by reinterpret_cast. I'm also not sure what would be the right way to do it without a data copy.

Note3: I wasn't sure where to add tests for the changes here. For example, would be ideal to test that binary:logistic and binary:logitraw produce the same raw scores, but I'm not sure where's the right place to add such test.

trivialfis · 2024-05-20T04:17:09Z

Thank you for working on this! I will look into the changes.

trivialfis · 2024-07-31T16:30:29Z

Hi @david-cortes , I have added sample mean and weighted sample mean functions. However, I removed the new inv link zero function since we are moving away from the binary model and will be dropping the capability of saving the binary model in this release, I don't think it's necessary to add new behavior in the codebase to workaround it.

Custom user-defined functions would most likely be better served by a default score of zero or by a 1D newton estimation. I wasn't sure where in the code to identify when a user-defined objective is passed though.

We will work on this in a different PR. At the moment, the custom objective is quite primitive and there are lots of work we need to do to make it closer to the builtin objectives in terms of functionality.

Other objectives would likely benefit from using more than one newton step for the intercept estimation.

It's gradient boosting, we can always stack more models on top instead of making one model perfect.

trivialfis · 2024-07-31T18:13:19Z

Please let me know if you want me to take over from here.

david-cortes · 2024-07-31T18:48:11Z

@trivialfis Thanks for looking into this.

Since the binary format will be dropped, would be better to do this more correctly then, by making the changes deeper:

Making the base_score work on the untransformed / raw prediction scale, so that the user can output any real number regardless of what the link function is, and so that it would have the same interpretation for built-in and custom objectives.
Defaulting to zero in the untransformed scale for custom objective functions.
Not converting the base_score to a string, thereby avoiding potential losses of precision.
Adding a compatibility workaround for transforming models that use the old json/ubjson base_score to the new one (guess it would involve adding transformations for each link function).

So yes, please take over.

By the way, while you're at it, I understand that the removal of the old binary format should also pave the way for vector-valued intercepts - in such case, once those get added, would also be nice to arrange the intercepts for multinomial logistic objective in such a way that they would sum to zero, like GLMNET does.

trivialfis · 2024-07-31T19:40:12Z

These are excellent suggestions.

Since the binary format will be dropped

We will remove the capability of saving it in this release, then the next release will remove the loading part. Can take some time.

Making the base_score work on the untransformed / raw prediction scale

To implement this, we will have to add a new parameter, probably called intercept. Otherwise, it's a breaking change that's impossible to show warning.

Not converting the base_score to a string, thereby avoiding potential losses of precision

This one is more difficult, currently, all parameters are passed through strings. Using JSON might help, but floating point serialization requires matching the encoder and decoder, otherwise there will be precision losses.

I will take over the PR, please let me know if there are other things I can help with the R package. Hoping to include it in the next release (2.2). We still have plenty of time.

trivialfis · 2024-08-01T14:15:27Z

Hmm, run into an issue with the weighted sample mean. If the weight is not strictly a proba simplex summing up to 1, we can generate an invalid mean value, like a mean value greater than 1 for logistic regression.

david-cortes · 2024-08-01T16:00:46Z

Hmm, run into an issue with the weighted sample mean. If the weight is not strictly a proba simplex summing up to 1, we can generate an invalid mean value, like a mean value greater than 1 for logistic regression.

I'm not sure what kind of impact it'd have, but there's always the option of doing a more precise sequential calculation like this:
https://github.com/david-cortes/isotree/blob/f8983ae2170c8c675b4ccda1f6ed394678c26a0a/src/mult.hpp#L183

Or to use compensated sums.

Although from a look at the code, could it perhaps be that some matrix is in the wrong memory layout? I see it uses:

auto cidx = i / n_rows;
auto ridx = i % n_rows;

But it doesn't validate that the matrix is column-major.

trivialfis · 2024-08-02T12:48:29Z

I'm not sure what kind of impact it'd have

It violates the range of logistic regression [0, 1], the prediction (after applying inverse link) cannot exceed 1. But with inappropriate weight, we can make the weighted mean greater than 1. At the moment, we only require the weights to be positive, which is insufficient to provide a valid sample mean for logistic regression. One such example (for illustration) would be simply reusing the target y as weight, resulting 0 weight for 0 label, and 1 weight for 1 label.

could it perhaps be that some matrix is in the wrong memory layout

I will add a restriction to it once the weight issue is resolved. The matrix view supports different memory layout type, we can dispatch.

david-cortes · 2024-08-02T17:56:13Z

One such example (for illustration) would be simply reusing the target y as weight, resulting 0 weight for 0 label, and 1 weight for 1 label.

But that is not an issue with the computation: if the data contains only examples of one class, then the optimal solution for a regularized method like XGBoost's is to make the intercept infinite and not use any feature for generating scores.

If a user wishes to use XGBoost as one-class classifier, the correct behavior from the user would be to manually set the intercept to zero.

I guess the most logical course of action in such cases would be to throw an error explaining that the response is constant.

At the moment, we only require the weights to be positive, which is insufficient to provide a valid sample mean for logistic regression

If the weights are non-negative and the floating point computations had infinite precision, then it shouldn't be possible to arrive at a value greater or smaller than the max/min of the inputs that go into that mean. But I can see it going wrong with fp32 when the number of rows is large, whether the weights sum to 1 or not.

Some other ideas:

Perhaps you could divide the weights by their sum outside of the function that goes into the transformation, so as to force the operation order to be y * (w / sum(w)) rather than (y * w) / sum(w) (or perhaps fma(y, w/sum(w), agg)) - not sure if that's the issue here though.
For the specific case of logistic regression, another possibility could be to sum the weights by class instead (fewer numbers to sum = more precise computations).
Maybe the aggregate could be kept in fp64 precision?

Are you able to provide an example where the result is outside the range [0,1] when the inputs are all {0,1}?

trivialfis · 2024-08-03T11:34:40Z

I guess the most logical course of action in such cases would be to throw an error explaining that the response is constant.

Make sense.

Some other ideas:

Thank you for sharing, will look into them. Spent a bit of time today to prove the sample mean being the minimizer. Will continue to work on the PR next week.

trivialfis · 2024-08-09T01:06:12Z

could it perhaps be that some matrix is in the wrong memory layout

Added notes about why the reduction is written in this way.

todo for myself:

add allreduce synchronization.
add tests.
edge cases, including empty labels for binary classification.

trivialfis · 2024-11-04T09:17:32Z

@david-cortes @hcho3 The PR is completed now, please help review when you have a spare moment.

src/common/stats.cc

david-cortes · 2024-11-06T18:17:33Z

LGTM 👍

david-cortes changed the title ~~Optimal intercept initialization for simple objectives~~ [WIP] Optimal intercept initialization for simple objectives May 18, 2024

trivialfis self-assigned this Jul 15, 2024

trivialfis force-pushed the intercepts branch 2 times, most recently from eaa8453 to 70ac561 Compare July 31, 2024 16:26

trivialfis force-pushed the intercepts branch from bee5a5b to 4fe3475 Compare August 1, 2024 14:14

david-cortes and others added 14 commits October 22, 2024 22:59

optimal intercept initialization ref dmlc#9899

ae4e652

linter

afd3ed8

use default value for multi-targets

e76bda2

add cpu-only weighted mean

7b46bb1

initialize to zero for multitarget

a2fcf6a

linter

ccebc75

Implement sample mean.

6effdfe

weighted.

a955730

Use it in init estimation.

7d9bf04

Remove.

75c98db

polish.

be70800

Remove inv link for now.

7ba0278

CPU build.

150dea7

Note.

3b3df7a

trivialfis added 7 commits October 22, 2024 22:59

Float eq.

716849a

Work on allreduce.

fb93e94

Precision.

1556572

Fixes.

3a39422

qemu test.

75c63a6

Empty input.

0dfddac

i386

7b03f3b

trivialfis force-pushed the intercepts branch from d03db14 to 7b03f3b Compare October 22, 2024 15:00

trivialfis added 7 commits October 24, 2024 00:34

Fix.

a5edb49

spark test.

c1dda98

Skip distributed tests if nccl is not used.

972e256

qemu tests.

7a15f84

Add tests.

55f1bd6

Merge branch 'master' into intercepts

df2c036

Unused.

4313318

trivialfis changed the title ~~[WIP] Optimal intercept initialization for simple objectives~~ Optimal intercept initialization for simple objectives Nov 4, 2024

david-cortes commented Nov 4, 2024

View reviewed changes

src/common/stats.cc Outdated Show resolved Hide resolved

src/common/stats.cc Outdated Show resolved Hide resolved

trivialfis added 2 commits November 5, 2024 17:57

Reviewer's comments.

7970092

check mean.

8603a33

lint.

417a2b1

trivialfis merged commit ab228cc into dmlc:master Nov 7, 2024
28 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal intercept initialization for simple objectives #10298

Optimal intercept initialization for simple objectives #10298

david-cortes commented May 18, 2024 •

edited

Loading

trivialfis commented May 20, 2024 •

edited

Loading

trivialfis commented Jul 31, 2024 •

edited

Loading

trivialfis commented Jul 31, 2024

david-cortes commented Jul 31, 2024 •

edited

Loading

trivialfis commented Jul 31, 2024

trivialfis commented Aug 1, 2024 •

edited

Loading

david-cortes commented Aug 1, 2024

trivialfis commented Aug 2, 2024 •

edited

Loading

david-cortes commented Aug 2, 2024 •

edited

Loading

trivialfis commented Aug 3, 2024

trivialfis commented Aug 9, 2024 •

edited

Loading

trivialfis commented Nov 4, 2024

david-cortes commented Nov 6, 2024

Optimal intercept initialization for simple objectives #10298

Optimal intercept initialization for simple objectives #10298

Conversation

david-cortes commented May 18, 2024 • edited Loading

trivialfis commented May 20, 2024 • edited Loading

trivialfis commented Jul 31, 2024 • edited Loading

trivialfis commented Jul 31, 2024

david-cortes commented Jul 31, 2024 • edited Loading

trivialfis commented Jul 31, 2024

trivialfis commented Aug 1, 2024 • edited Loading

david-cortes commented Aug 1, 2024

trivialfis commented Aug 2, 2024 • edited Loading

david-cortes commented Aug 2, 2024 • edited Loading

trivialfis commented Aug 3, 2024

trivialfis commented Aug 9, 2024 • edited Loading

trivialfis commented Nov 4, 2024

david-cortes commented Nov 6, 2024

david-cortes commented May 18, 2024 •

edited

Loading

trivialfis commented May 20, 2024 •

edited

Loading

trivialfis commented Jul 31, 2024 •

edited

Loading

david-cortes commented Jul 31, 2024 •

edited

Loading

trivialfis commented Aug 1, 2024 •

edited

Loading

trivialfis commented Aug 2, 2024 •

edited

Loading

david-cortes commented Aug 2, 2024 •

edited

Loading

trivialfis commented Aug 9, 2024 •

edited

Loading