consequences of logloss as loss measure for binary classification when optimal classification threshold != 0.5 #10

mikoontz · 2022-12-06T16:45:35Z

Hello CPI team,

This is such a fantastic package and I'm so happy that I found it (and the suite of papers your team has written about state-of-the-art random forests for inference).

This may be a question that points to a broader theoretical question, but the {cpi} package is how I came to it, so I'm starting here...

tl;dr
What are the implications of using log loss as the loss function when the tuned classification threshold is not at 0.5? I'm sure someone has thought/written about this, but I'm having trouble finding those papers!

Example
A model prediction of 0.6 for an observation in the positive class seems like it should carry different information about loss if the optimal classification threshold for the model is 0.5 versus 0.2 (for instance). In both cases, a model prediction of 0.6 would correctly classify the observation as belonging to the positive class. But if the tuned classification threshold is 0.5, then a model prediction of 0.6 is barely over that threshold compared to the case where the tuned classification threshold is 0.2 (the model prediction is quite a bit over the 0.6 threshold). But the log loss would be equal in each case. Naively, I would have expected a loss function would show more loss for a prediction of 0.6 when the classification threshold is 0.5, and less loss when then the classification threshold is 0.2.

Possible solution?
Do we need to rescale the model's probability predictions to take into account a tuned classification threshold that isn't at 0.5 prior to calculating log loss? That is, all predictions below the classification threshold get rescaled to [0, 0.5] and all predictions above the classification threshold get rescaled to [0.5, 1]?

Eventual goal
I have a highly unbalanced binary classification problem with multicollinearity in the features, mostly continuous features (just 1 categorical feature with only 5 levels), and a desire to better understand which features are important and how (i.e,. the shape of their relationship to the target).

What I've tried
I've played around with modifying the {cpi} package to calculate loss at an aggregated scale (i.e., per test data set) rather than a per-observation scale using measures more robust to class imbalance (Matthew's Correlation Coefficient). The actual implementation of that modificaiton to {cpi} is here. In that case, I relied on the repeated spatial cross validation for "significance" of CPI for each feature, since the implemented statistical tests rely on having CPI on a per-observation scale (before taking the mean to report a per-feature CPI value). But this strikes me as perhaps being overly conservative, so I'm revisiting using the default CPI loss functions.

dswatson · 2022-12-06T18:09:42Z

Hi Michael,

Thanks for using the package! Glad to hear you're getting some mileage out of it.

My inclination on this is to say that log loss is just not the right loss function for this problem. The theoretical motivation for log loss is to maximize the data likelihood under your model parameters. But if you've set the decision threshold to something other than 0.5, you are doing something else. Possible solutions include, as you suggest, transforming your output. If this mapping were controlled by some parameter $\theta$, then the ML estimate of $\theta$ should produce probabilities > 0.5 for cases where $Y=1$, at least in the limit of infinite data.

Alternatively, you could try another loss function better suited to imbalanced classification. A weighted accuracy score may work, where weights reflect the relative cost of false positives and false negatives. Area under the precision-recall curve is another option, although in this case you would need to do some more work under the hood since AUPR is not defined for individual samples. CPI can still work here, as we note in the paper – it will just need to be calculated over batches instead of over samples. I'm not sure how easy that is to do within the package parameters...?

Best of luck on this project! If you end up implementing a useful solution, please feel free to make a pull request :)

Best,
David

mikoontz · 2022-12-06T18:51:08Z

This is so helpful; thanks very much!

Sounds like my decision is:

transform the probability model output and still use log loss (will require some modification of {cpi})
use a weighted accuracy loss function (will require some research into how to implement this, then some modification of {cpi})
use a batch-level CPI score and rely on cross-validation for inferring feature significance (already implemented in my forked version of {cpi}, but could be cleaned up)

Happy to submit PRs for whatever route I take, but I definitely defer to you all for how they might ultimately get implemented. Maybe the PRs could at least serve as a first, rough draft. : )

More on 3 (links in my first comment), I have already implemented a batch-level loss function-- the Matthew's correlation coefficient (MCC)-- in my fork of {cpi} . Based on work by Chicco et al. (2021) and Poisot (2022), this seemed to me to be the most appropriate loss function for an imbalanced, binary classification problem.

For posterity: when I actually use the modified function that calculates MCC as a batch-level CPI in my workflow, I ignore the statistical tests of feature importance currently implemented in {cpi} which are based on sample-level CPI scores, and use spatial cross validation to infer that a feature with a median CPI above 0 across spatial folds is "significant". Specifically, I set test="fisher" and B=1 so not much computation time is spent doing statistical tests and I avoid errors that would come up by, for instance, conducting a t-test on a single value!.

dswatson · 2022-12-06T21:37:41Z

Sounds like a plan!

For the record, you can do statistical testing with batch-level CPI. There is still a $\Delta$ variable that tracks the difference in risk between original and knockoff data, it just has $k$ entries instead of $n$, where $k$ is the number of folds/batches. If you do enough splits – and these may be partitions as in $k$-fold CV (optionally repeated) or overlapping subsamples, as in Monte Carlo estimation – then you'll have enough entries to conduct a $t$-test, permutation test, etc.

mikoontz · 2022-12-08T20:14:27Z

That's good to know that the statistical testing approach is still legit even if the "samples" are from batches!

I've been digging pretty deep into the package, but I don't see how the Δ variable (as far as I can tell, that's the dif object in the cpi.R script) can have k entries. The first thing that the compute_loss() function does is collapse all of the batch-separated truth, response, and prob data into a single vector each. For instance, in the case of using a resampling scheme of repeated_cv with 10 iterations and 10 folds, the result of predict_learner() is a 100-element list, with each element containing the truth, response, and prob data associated with all of the observations in that batch. But the separation of batches isn't preserved so I don't see how we can end up with 100 CPI values when using a batch-level measure like MCC.

The current workaround (and what I describe in the worked example for the PR I made, is to just use the test_data and to set up the resampling splits ahead of time.

One approach might be to force all output from predict_learner() to be a list, even if it's just a one-element list, then lapply across the list items in order to get a loss vector for each of those list items, then do.call(c, loss_list) the result to get a single vector of loss. That would allow the batch-scale loss functions to work as you describe, but it would change the statistics of the observation-level approach you've already implemented.

Right now, the different iterations and the different folds are treated equally with respect to being samples in the statistical test-- is that desired behavior? It seems like two sets of loss vectors from two different folds are more i.i.d. compared to two sets of loss vectors from two different iterations (in the case of repeated cross-validation).

This was referenced Dec 8, 2022

allow user to modify model prediction results #12

Open

allow user to modify model prediction results mikoontz/cpi#2

Merged

allow for custom loss function to be passed as measure argument #13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consequences of logloss as loss measure for binary classification when optimal classification threshold != 0.5 #10

consequences of logloss as loss measure for binary classification when optimal classification threshold != 0.5 #10

mikoontz commented Dec 6, 2022 •

edited

Loading

dswatson commented Dec 6, 2022

mikoontz commented Dec 6, 2022 •

edited

Loading

dswatson commented Dec 6, 2022

mikoontz commented Dec 8, 2022

consequences of logloss as loss measure for binary classification when optimal classification threshold != 0.5 #10

consequences of logloss as loss measure for binary classification when optimal classification threshold != 0.5 #10

Comments

mikoontz commented Dec 6, 2022 • edited Loading

dswatson commented Dec 6, 2022

mikoontz commented Dec 6, 2022 • edited Loading

dswatson commented Dec 6, 2022

mikoontz commented Dec 8, 2022

mikoontz commented Dec 6, 2022 •

edited

Loading

mikoontz commented Dec 6, 2022 •

edited

Loading