Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate some similarity measures #43

Open
lwaldron opened this issue Nov 19, 2020 · 9 comments
Open

Calculate some similarity measures #43

lwaldron opened this issue Nov 19, 2020 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@lwaldron
Copy link
Member

We would like to be able to store relevant similarity measures between signatures, but are not sure what measures we will want in the future. These will be updated regularly in the future as new signatures are added, through the wiki API if this is possible. For now we should have:

  1. Jaccard Index
  2. Number of overlapping taxa

@lgeistlinger would you create a file of similarity indices based on the bugsigdb.org dump? I think we want to leave open the possibility of adding new similarity measures in the future. @tosfos would this be a good format?

signature1 signature2 jaccard number

These will be used to link to "similar" other signatures from the signature pages.

@lwaldron lwaldron added the enhancement New feature or request label Nov 19, 2020
@lwaldron
Copy link
Member Author

We can also consider just picking one simple measure (like Jaccard Index) and programming this functionality into the wiki. Then anything more complicated will be outside the scope of the wiki. Good to discuss with Ike.

@seandavi
Copy link
Collaborator

seandavi commented Nov 19, 2020 via email

@tosfos
Copy link
Collaborator

tosfos commented Nov 20, 2020

@tosfos would this be a good format?
signature1 signature2 jaccard number

Yes. This is tricky to store with Semantic MediaWiki but we'll dream something up.

We can also consider just picking one simple measure (like Jaccard Index) and programming this functionality into the wiki. Then anything more complicated will be outside the scope of the wiki. Good to discuss with Ike.

That would be way better! And way cooler.

@lwaldron
Copy link
Member Author

When "counting" shared and non-shared taxa, what is the "rule" to use when
some taxa are not at the same taxonomic level? (@seandavi)

That is a good question - the little bit of bug set enrichment analysis I've seen just ignores taxonomy, which is obviously not correct but could be useful in this context anyways if we don't try to attach a p-value, or limits to a single taxonomic rank. I can't think of anything better that would be straightforward - thinking of things like unweighted UniFrac distance (https://en.wikipedia.org/wiki/UniFrac) which measures phylogenetic distance between two microbial communities, and Ancestral State Reconstruction to compare mixed taxonomic levels. It leaves me thinking that just for a basic purpose of showing similar signatures, which are mostly either genus or species-level, Jaccard might be good enough? We'll end up with species-level signatures (WMS) always being dissimilar to genus-level signatures (16S), but I'm not sure right now what we could do about that.

@lgeistlinger
Copy link
Collaborator

When "counting" shared and non-shared taxa, what is the "rule" to use when some taxa are not at the same taxonomic level?

Sounds to me like an "argument to the function". So far, we only considered exact matches, in the sense that they have the same NCBI ID. Of course, your similiarity measure calculation could allow for eg going up/down 1, 2, ... levels of the taxonomy to declare overlap.

@lwaldron
Copy link
Member Author

Let's close this for now just to make space for priority issues. First priority now will be transferring curation over to bugsigdb.org.

@lwaldron
Copy link
Member Author

lwaldron commented Jul 7, 2021

We could open this issue again. The essentials have been taken care of, and this would be a great enhancement. I've tested simple Jaccard Index and it seems to produce pretty intuitive groupings. It's rather heavy to calculate it for all pairwise combinations of signatures so they would have to be pre-computed, and then computed only for signatures that are added or changed. I liked it better than other simple alternatives like intersection length over minimum length, or simple intersection. But it should be designed to allow for supporting other similarity measures in the future (for example, genus-level only Jaccard index).

@lwaldron lwaldron reopened this Jul 9, 2021
@lwaldron lwaldron added the priority necessary for early utility label Jul 9, 2021
@lgeistlinger
Copy link
Collaborator

@tosfos @lwaldron : how do we proceed with this? Are we proceeding along these lines:

We can also consider just picking one simple measure (like Jaccard Index) and programming this functionality into the wiki. Then anything more complicated will be outside the scope of the wiki. Good to discuss with Ike.

That would be way better! And way cooler.

I am noting in this context that I recently played around with the concept of semantic similarity. Here we are treating the NCBI taxonomy as an ontology and calculate pairwise similarity between signatures or groups of signatures as eg implemented in the ontologySimilarity package.

@lgeistlinger lgeistlinger pinned this issue Sep 13, 2021
@lwaldron
Copy link
Member Author

Since this is an enhancement, it should be lower priority than getting the home page and all basic functionality in place. Eventually after the other key functionality is implemented, we may want to discuss whether semantic similarity is practical to implement, because it is elegant and probably will provide more relevant similarities than Jaccard for mixed-taxonomy signatures.

@lwaldron lwaldron removed the priority necessary for early utility label Sep 13, 2021
@lgeistlinger lgeistlinger unpinned this issue Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants