-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consequences of logloss as loss measure for binary classification when optimal classification threshold != 0.5 #10
Comments
Hi Michael, Thanks for using the package! Glad to hear you're getting some mileage out of it. My inclination on this is to say that log loss is just not the right loss function for this problem. The theoretical motivation for log loss is to maximize the data likelihood under your model parameters. But if you've set the decision threshold to something other than 0.5, you are doing something else. Possible solutions include, as you suggest, transforming your output. If this mapping were controlled by some parameter Alternatively, you could try another loss function better suited to imbalanced classification. A weighted accuracy score may work, where weights reflect the relative cost of false positives and false negatives. Area under the precision-recall curve is another option, although in this case you would need to do some more work under the hood since AUPR is not defined for individual samples. CPI can still work here, as we note in the paper – it will just need to be calculated over batches instead of over samples. I'm not sure how easy that is to do within the package parameters...? Best of luck on this project! If you end up implementing a useful solution, please feel free to make a pull request :) Best, |
This is so helpful; thanks very much! Sounds like my decision is:
Happy to submit PRs for whatever route I take, but I definitely defer to you all for how they might ultimately get implemented. Maybe the PRs could at least serve as a first, rough draft. : ) More on 3 (links in my first comment), I have already implemented a batch-level loss function-- the Matthew's correlation coefficient (MCC)-- in my fork of {cpi} . Based on work by Chicco et al. (2021) and Poisot (2022), this seemed to me to be the most appropriate loss function for an imbalanced, binary classification problem. For posterity: when I actually use the modified function that calculates MCC as a batch-level CPI in my workflow, I ignore the statistical tests of feature importance currently implemented in {cpi} which are based on sample-level CPI scores, and use spatial cross validation to infer that a feature with a median CPI above 0 across spatial folds is "significant". Specifically, I set |
Sounds like a plan! For the record, you can do statistical testing with batch-level CPI. There is still a |
That's good to know that the statistical testing approach is still legit even if the "samples" are from batches! I've been digging pretty deep into the package, but I don't see how the Δ variable (as far as I can tell, that's the The current workaround (and what I describe in the worked example for the PR I made, is to just use the One approach might be to force all output from Right now, the different iterations and the different folds are treated equally with respect to being samples in the statistical test-- is that desired behavior? It seems like two sets of loss vectors from two different folds are more i.i.d. compared to two sets of loss vectors from two different iterations (in the case of repeated cross-validation). |
Hello CPI team,
This is such a fantastic package and I'm so happy that I found it (and the suite of papers your team has written about state-of-the-art random forests for inference).
This may be a question that points to a broader theoretical question, but the {cpi} package is how I came to it, so I'm starting here...
tl;dr
What are the implications of using log loss as the loss function when the tuned classification threshold is not at 0.5? I'm sure someone has thought/written about this, but I'm having trouble finding those papers!
Example
A model prediction of 0.6 for an observation in the positive class seems like it should carry different information about loss if the optimal classification threshold for the model is 0.5 versus 0.2 (for instance). In both cases, a model prediction of 0.6 would correctly classify the observation as belonging to the positive class. But if the tuned classification threshold is 0.5, then a model prediction of 0.6 is barely over that threshold compared to the case where the tuned classification threshold is 0.2 (the model prediction is quite a bit over the 0.6 threshold). But the log loss would be equal in each case. Naively, I would have expected a loss function would show more loss for a prediction of 0.6 when the classification threshold is 0.5, and less loss when then the classification threshold is 0.2.
Possible solution?
Do we need to rescale the model's probability predictions to take into account a tuned classification threshold that isn't at 0.5 prior to calculating log loss? That is, all predictions below the classification threshold get rescaled to [0, 0.5] and all predictions above the classification threshold get rescaled to [0.5, 1]?
Eventual goal
I have a highly unbalanced binary classification problem with multicollinearity in the features, mostly continuous features (just 1 categorical feature with only 5 levels), and a desire to better understand which features are important and how (i.e,. the shape of their relationship to the target).
What I've tried
I've played around with modifying the {cpi} package to calculate loss at an aggregated scale (i.e., per test data set) rather than a per-observation scale using measures more robust to class imbalance (Matthew's Correlation Coefficient). The actual implementation of that modificaiton to {cpi} is here. In that case, I relied on the repeated spatial cross validation for "significance" of CPI for each feature, since the implemented statistical tests rely on having CPI on a per-observation scale (before taking the mean to report a per-feature CPI value). But this strikes me as perhaps being overly conservative, so I'm revisiting using the default CPI loss functions.
The text was updated successfully, but these errors were encountered: