-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate some similarity measures #43
Comments
We can also consider just picking one simple measure (like Jaccard Index) and programming this functionality into the wiki. Then anything more complicated will be outside the scope of the wiki. Good to discuss with Ike. |
When "counting" shared and non-shared taxa, what is the "rule" to use when
some taxa are not at the same taxonomic level?
…On Thu, Nov 19, 2020 at 10:43 AM Levi Waldron ***@***.***> wrote:
We can also consider just picking one simple measure (like Jaccard Index)
and programming this functionality into the wiki. Then anything more
complicated will be outside the scope of the wiki. Good to discuss with Ike.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAWSE6YIVYGSENWXLAI63DSQU4JNANCNFSM4T3RY2UQ>
.
|
Yes. This is tricky to store with Semantic MediaWiki but we'll dream something up.
That would be way better! And way cooler. |
That is a good question - the little bit of bug set enrichment analysis I've seen just ignores taxonomy, which is obviously not correct but could be useful in this context anyways if we don't try to attach a p-value, or limits to a single taxonomic rank. I can't think of anything better that would be straightforward - thinking of things like unweighted UniFrac distance (https://en.wikipedia.org/wiki/UniFrac) which measures phylogenetic distance between two microbial communities, and Ancestral State Reconstruction to compare mixed taxonomic levels. It leaves me thinking that just for a basic purpose of showing similar signatures, which are mostly either genus or species-level, Jaccard might be good enough? We'll end up with species-level signatures (WMS) always being dissimilar to genus-level signatures (16S), but I'm not sure right now what we could do about that. |
Sounds to me like an "argument to the function". So far, we only considered exact matches, in the sense that they have the same NCBI ID. Of course, your similiarity measure calculation could allow for eg going up/down 1, 2, ... levels of the taxonomy to declare overlap. |
Let's close this for now just to make space for priority issues. First priority now will be transferring curation over to bugsigdb.org. |
We could open this issue again. The essentials have been taken care of, and this would be a great enhancement. I've tested simple Jaccard Index and it seems to produce pretty intuitive groupings. It's rather heavy to calculate it for all pairwise combinations of signatures so they would have to be pre-computed, and then computed only for signatures that are added or changed. I liked it better than other simple alternatives like intersection length over minimum length, or simple intersection. But it should be designed to allow for supporting other similarity measures in the future (for example, genus-level only Jaccard index). |
@tosfos @lwaldron : how do we proceed with this? Are we proceeding along these lines:
I am noting in this context that I recently played around with the concept of semantic similarity. Here we are treating the NCBI taxonomy as an ontology and calculate pairwise similarity between signatures or groups of signatures as eg implemented in the ontologySimilarity package. |
Since this is an enhancement, it should be lower priority than getting the home page and all basic functionality in place. Eventually after the other key functionality is implemented, we may want to discuss whether semantic similarity is practical to implement, because it is elegant and probably will provide more relevant similarities than Jaccard for mixed-taxonomy signatures. |
We would like to be able to store relevant similarity measures between signatures, but are not sure what measures we will want in the future. These will be updated regularly in the future as new signatures are added, through the wiki API if this is possible. For now we should have:
@lgeistlinger would you create a file of similarity indices based on the bugsigdb.org dump? I think we want to leave open the possibility of adding new similarity measures in the future. @tosfos would this be a good format?
signature1 signature2 jaccard number
These will be used to link to "similar" other signatures from the signature pages.
The text was updated successfully, but these errors were encountered: