Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consequences of logloss as loss measure for binary classification when optimal classification threshold != 0.5 #10

Open
mikoontz opened this issue Dec 6, 2022 · 4 comments

Comments

@mikoontz
Copy link

mikoontz commented Dec 6, 2022

Hello CPI team,

This is such a fantastic package and I'm so happy that I found it (and the suite of papers your team has written about state-of-the-art random forests for inference).

This may be a question that points to a broader theoretical question, but the {cpi} package is how I came to it, so I'm starting here...

tl;dr
What are the implications of using log loss as the loss function when the tuned classification threshold is not at 0.5? I'm sure someone has thought/written about this, but I'm having trouble finding those papers!

Example
A model prediction of 0.6 for an observation in the positive class seems like it should carry different information about loss if the optimal classification threshold for the model is 0.5 versus 0.2 (for instance). In both cases, a model prediction of 0.6 would correctly classify the observation as belonging to the positive class. But if the tuned classification threshold is 0.5, then a model prediction of 0.6 is barely over that threshold compared to the case where the tuned classification threshold is 0.2 (the model prediction is quite a bit over the 0.6 threshold). But the log loss would be equal in each case. Naively, I would have expected a loss function would show more loss for a prediction of 0.6 when the classification threshold is 0.5, and less loss when then the classification threshold is 0.2.

Possible solution?
Do we need to rescale the model's probability predictions to take into account a tuned classification threshold that isn't at 0.5 prior to calculating log loss? That is, all predictions below the classification threshold get rescaled to [0, 0.5] and all predictions above the classification threshold get rescaled to [0.5, 1]?

Eventual goal
I have a highly unbalanced binary classification problem with multicollinearity in the features, mostly continuous features (just 1 categorical feature with only 5 levels), and a desire to better understand which features are important and how (i.e,. the shape of their relationship to the target).

What I've tried
I've played around with modifying the {cpi} package to calculate loss at an aggregated scale (i.e., per test data set) rather than a per-observation scale using measures more robust to class imbalance (Matthew's Correlation Coefficient). The actual implementation of that modificaiton to {cpi} is here. In that case, I relied on the repeated spatial cross validation for "significance" of CPI for each feature, since the implemented statistical tests rely on having CPI on a per-observation scale (before taking the mean to report a per-feature CPI value). But this strikes me as perhaps being overly conservative, so I'm revisiting using the default CPI loss functions.

@dswatson
Copy link
Collaborator

dswatson commented Dec 6, 2022

Hi Michael,

Thanks for using the package! Glad to hear you're getting some mileage out of it.

My inclination on this is to say that log loss is just not the right loss function for this problem. The theoretical motivation for log loss is to maximize the data likelihood under your model parameters. But if you've set the decision threshold to something other than 0.5, you are doing something else. Possible solutions include, as you suggest, transforming your output. If this mapping were controlled by some parameter $\theta$, then the ML estimate of $\theta$ should produce probabilities > 0.5 for cases where $Y=1$, at least in the limit of infinite data.

Alternatively, you could try another loss function better suited to imbalanced classification. A weighted accuracy score may work, where weights reflect the relative cost of false positives and false negatives. Area under the precision-recall curve is another option, although in this case you would need to do some more work under the hood since AUPR is not defined for individual samples. CPI can still work here, as we note in the paper – it will just need to be calculated over batches instead of over samples. I'm not sure how easy that is to do within the package parameters...?

Best of luck on this project! If you end up implementing a useful solution, please feel free to make a pull request :)

Best,
David

@mikoontz
Copy link
Author

mikoontz commented Dec 6, 2022

This is so helpful; thanks very much!

Sounds like my decision is:

  1. transform the probability model output and still use log loss (will require some modification of {cpi})
  2. use a weighted accuracy loss function (will require some research into how to implement this, then some modification of {cpi})
  3. use a batch-level CPI score and rely on cross-validation for inferring feature significance (already implemented in my forked version of {cpi}, but could be cleaned up)

Happy to submit PRs for whatever route I take, but I definitely defer to you all for how they might ultimately get implemented. Maybe the PRs could at least serve as a first, rough draft. : )

More on 3 (links in my first comment), I have already implemented a batch-level loss function-- the Matthew's correlation coefficient (MCC)-- in my fork of {cpi} . Based on work by Chicco et al. (2021) and Poisot (2022), this seemed to me to be the most appropriate loss function for an imbalanced, binary classification problem.

For posterity: when I actually use the modified function that calculates MCC as a batch-level CPI in my workflow, I ignore the statistical tests of feature importance currently implemented in {cpi} which are based on sample-level CPI scores, and use spatial cross validation to infer that a feature with a median CPI above 0 across spatial folds is "significant". Specifically, I set test="fisher" and B=1 so not much computation time is spent doing statistical tests and I avoid errors that would come up by, for instance, conducting a t-test on a single value!.

@dswatson
Copy link
Collaborator

dswatson commented Dec 6, 2022

Sounds like a plan!

For the record, you can do statistical testing with batch-level CPI. There is still a $\Delta$ variable that tracks the difference in risk between original and knockoff data, it just has $k$ entries instead of $n$, where $k$ is the number of folds/batches. If you do enough splits – and these may be partitions as in $k$-fold CV (optionally repeated) or overlapping subsamples, as in Monte Carlo estimation – then you'll have enough entries to conduct a $t$-test, permutation test, etc.

@mikoontz
Copy link
Author

mikoontz commented Dec 8, 2022

That's good to know that the statistical testing approach is still legit even if the "samples" are from batches!

I've been digging pretty deep into the package, but I don't see how the Δ variable (as far as I can tell, that's the dif object in the cpi.R script) can have k entries. The first thing that the compute_loss() function does is collapse all of the batch-separated truth, response, and prob data into a single vector each. For instance, in the case of using a resampling scheme of repeated_cv with 10 iterations and 10 folds, the result of predict_learner() is a 100-element list, with each element containing the truth, response, and prob data associated with all of the observations in that batch. But the separation of batches isn't preserved so I don't see how we can end up with 100 CPI values when using a batch-level measure like MCC.

The current workaround (and what I describe in the worked example for the PR I made, is to just use the test_data and to set up the resampling splits ahead of time.

One approach might be to force all output from predict_learner() to be a list, even if it's just a one-element list, then lapply across the list items in order to get a loss vector for each of those list items, then do.call(c, loss_list) the result to get a single vector of loss. That would allow the batch-scale loss functions to work as you describe, but it would change the statistics of the observation-level approach you've already implemented.

Right now, the different iterations and the different folds are treated equally with respect to being samples in the statistical test-- is that desired behavior? It seems like two sets of loss vectors from two different folds are more i.i.d. compared to two sets of loss vectors from two different iterations (in the case of repeated cross-validation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants