Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What segmentation values to put when both GT and predictions are empty? #15

Open
naga-karthik opened this issue Oct 28, 2024 · 1 comment

Comments

@naga-karthik
Copy link
Member

Currently, when both the GT mask and the prediction mask are emtpy, we consider this special case and set, for e.g., the Dice score to 1 (as here). This is reasonable to some extent under the argument that if the prediciton is also emtpy when the GT is empty, then, the model might have learned correctly.

But, this might not always be the case. Take the following example -- if there are 3 images in the test set, 2 of them have no lesions and 1 has a lesion. We compare two models A and B. Let's say that model A is not good and predicts 3 empty masks. On the other hand, model B is better than A and predicts 1 empty mask, 2 non-empty masks. Now, because we're setting Dice=1.0 when both GT and Pred lesions masks are emtpy, a higher overall Dice score for model A does not automatically imply that it is better in segmentation than model B. In other words, a bad model that always predicts empty masks would have a higher Dice than the model that somewhat learns but has a lower Dice.

The issue is that there is no clear consensus on how to proceed with the segmentation metrics when both GT and preds are empty. The Anima toolbox is not helpful here because it skips the subjects with empty GT masks i.e. when running anima evaluation, the overall metrics are averaged only over the subjects with non-empty lesion masks (which is not correct as it would bias the segmentation metrics to be higher by ignoring the effect of potential false positives).

Opening this issue just a note/documentation that this is an open problem and users should be aware of this issue when evaluating their models. Tagging @valosekj who is also aware of this issue.

@valosekj
Copy link
Member

Good points!

Take the following example -- if there are 3 images in the test set, 2 of them have no lesions and 1 has a lesion. We compare two models A and B. Let's say that model A is not good and predicts 3 empty masks. On the other hand, model B is better than A and predicts 1 empty mask, 2 non-empty masks. Now, because we're setting Dice=1.0 when both GT and Pred lesions masks are emtpy, a higher overall Dice score for model A does not automatically imply that it is better in segmentation than model B. In other words, a bad model that always predicts empty masks would have a higher Dice than the model that somewhat learns but has a lower Dice.

Maybe Dice might not be the best metric to report in this case. What about Average Precision (AP); see Figs 50 and 80 in the MetricsReloaded preprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants