Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotated corpus map: provide corrected instead of uncorrected p-value in Scores output #1077

Open
wvdvegte opened this issue Jul 22, 2024 · 1 comment

Comments

@wvdvegte
Copy link

Is your feature request related to a problem? Please describe.
This relates to #997. There's no way (at least not in Orange) to get a table with the keywords per cluster. They can only be obtained in the graphical representation of the Annotated Corpus Map. Sometimes you' d like to present the characteristic keywords per cluster in a table - for instance together with other information about the cluster that can easily be obtained with Group By > Cluster (e.g., the number of documents in each cluster).
The help information says "FDR Threshold sets the threshold for selecting a keyword as a cluster's keyword", but only the uncorrected p values per word are available in the Scores output.

Describe the solution you'd like
Replace the uncorrected keywords by the corrected ones. Or even better, provide some form of output that provides the keywords per cluster right away (e.g., a table with the columns Cluster, Keyword 1, Keyword2, ... Keyword5). Because, even if you have the corrected p values, I don't see a way of getting such a table, especially not if the number of clusters isn't kept constant (varying the number of clusters to see the effects)

Describe alternatives you've considered
None are known to me.

@wvdvegte wvdvegte changed the title Annotated corpus map: provide correctedinstead of uncorrected p-value in Scores output Annotated corpus map: provide corrected instead of uncorrected p-value in Scores output Jul 22, 2024
@wvdvegte
Copy link
Author

I now realized that actually the p-values in the Scores output are the corrected ones, a.k.a. FDRs. But it remains confusing: why refer to FDR in the menu where the threshold is set and simply call them p-value in the Scores output? I suggest the same term is used (either 'FDR' or 'corrected p-value') where the same number is meant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant