-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable quantification using StringTie AND a custom Ensembl genome #1074
Comments
@mplescher could you please provide a reproducible example of this? I've actually written a solution up quickly, but wanted to replicate your issue before I ask to merge, and find that I can't. For example, in the test profile, I intercepted the stringtie process, and changed the first entry to a gene without a transcript_id attribute, and that seemed to be fine. |
Probably related to/similar to #1102? For an example GTF file see this thread in the rnaseq-Slack channel and this gffread issue for some background information. |
Well, this is a stringtie error, so possibly not directly related, but yes, maybe it's empty transcript_id attributes rather than missing ones which are the issue. |
Yes, that is what I was thinking. Multiple tools struggle with empty strings in the GTF attributes, because that is against the original format specification. Yet, NCBI and Ensembl release this kind of GTF files now. This is why I linked the gffread issue above, where Geo Pertea states:
Based on this, I do not expect a fix for Hence, I think, we must remove those lines or at least include a check, because everyone trying to use an up-to-date reference transcriptome in GTF format will likely download an invalid file. |
OK, think this is now addressed in #1107. I've also added a line to the GTF in the test data to make doubly sure I'm right. |
Description of feature
Hi.
I am using the nf-core rnaseq pipeline, version 3.12.0.
Since you pointed out that the transcriptome and GTF files in iGenomes are vastly out of date here, I am using a custom Ensembl genome, version 110.
I tried out these two:
I would like to use StringTie for transcript assembly and quantification, but had to face this bug. It seems like all genes in the ensembl genome lack the transcript_id required for StringTie. Since StringTie only needs the annotation of transcripts anyway, simply removing all genes from the GTF file solves the problem.
e.g. run:
Would it be possible to check for ensembl genomes automatically and (temporarily) remove the gene lines if necessary?
Many thanks.
The text was updated successfully, but these errors were encountered: