Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate all transcripts #195

Open
joaoe opened this issue Oct 4, 2016 · 4 comments
Open

Annotate all transcripts #195

joaoe opened this issue Oct 4, 2016 · 4 comments

Comments

@joaoe
Copy link

joaoe commented Oct 4, 2016

One of the differences between VEP and varcode is that VEP is happy to annotate ALL transcripts in its database, including pseudo-genes. Varcode will limit itself to transcripts which have Transcript.is_protein_coding returning True. Also see openvax/pyensembl#169.
As such, in the case of VEP, it's up for the developer/user to filter out which transcripts he/she finds useful.

I'd like for there to be a way to tell varcode which biotypes should be accepted. A possibility would be to have an optional callback method ('transcript_filter') when calling predict_variant_effects() which returns False or True if a transcript should be annotated. The developer/user would then implement his/her filtering logic, perhaps even filtering transcripts by ID. That way, uninteresting transcripts can be skipped (saving time and CPU cycles), non coding transcripts of interest can be returned.

Another challenge is that VEP also annotates incomplete transcripts. But supporting this might be a bit more laborsome. Perhaps something for a different task.

Thoughts ?

@iskandr
Copy link
Contributor

iskandr commented Oct 5, 2016

What kinds of annotations are you interested in for non-coding transcripts? We've been fairly narrowly focused on coding effects so I don't know what you can say about a non-coding transcript.

@joaoe
Copy link
Author

joaoe commented Oct 5, 2016

I'm interested in comparing tools and get them to perform as similar as possible. But I'm not that interested in non coding transcripts. Other people might though. But, by checking for biotype = "protein_coding" you're skipping a bunch of coding biotypes. If there is a generic API for people to pick whichever transcripts their want, then I guess varcode becomes more useful.

@iskandr
Copy link
Contributor

iskandr commented Oct 5, 2016

If a transcript is already annotated as triggering NMD due to an early stop codon, is it useful to predict some other effect in its protein sequence (e.g. single amino acid substitution)? It might be but I can't currently think of the use-case.

I can try adding a parameter for a set of biotypes on which we perform predictions but it's not clear to me that those predictions will always be meaningful.

@joaoe
Copy link
Author

joaoe commented Oct 5, 2016

Everything you said is quite valid. But that's not the issue. The issue is just letting the user pick and choose which transcripts/biotypes he/she wants, still keeping the current behavior as default. Like, right now, IG* and TR* transcripts are too ignored.

For instance https://github.com/joaoe/varcode/commit/fe02769f199f9e6c6d2a6e8075786cd2a19d2f89

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants