-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support transcripts with >1 CDS? #3
Comments
Hi Janet, The error message is a little bit misleading: it should say that In the Canonical Gene example shown in the GFF3 specs, transcript When we started working on TxDb objects back in 2010, our primary goal was to support UCSC and Ensembl annotations, so we designed a db schema that assumes a one-to-one relationship between coding transcripts and CDSs/proteins. So when you import the Canonical Gene example, only the first CDS on transcript EDEN.3 is imported:
The one-to-one relationship between coding transcripts and CDSs/proteins is a very early assumption that is at the root of the design/behavior of many functions in GenomicFeatures/txdbmaker, so is something that would be very difficult to change, and also very disruptive. I knew that the GFF3 specs did support a one-to-many relationship between coding transcripts and CDSs/proteins, and that's a feature that I found surprising when I ran into it, because I was not aware of any real-world situation where this would actually happen, and because AFAIK none of the hundreds of thousands of GFF3 files provided by UCSC and Ensembl seemed to use this capability. Also I don't know how UCSC or Ensembl handle this situation, or how frequently it is observed in biology. Maybe they just use 2 distinct transcript ids to represent such event? In your case it means that the GFF3 file would need to be modified but I don't know how hard that would be. If this is a viral genome, then there's no exon, and each CDS can be represented by a single line (no CDS parts), so the file should be much simpler than in the general situation where the gene/transcript/exon/cds hierarchy can be complex and hard to figure out. Maybe H. |
hi Herve, thanks - this is useful insight into the back end. I think a good ad hoc solution in my case will be to make duplicate coding transcripts to achieve that one-to-one relationship. If that can be done in I think over the next few years having >1 ORF per transcript will become more common. There's good evidence that a lot of additional ORFs exist: I think the jury is still out on how many of them have important function, but it is clear that at least some of them are important. This review article covers the biology nicely, and this preprint has more in-depth analysis on how important it may (or may not be) in the human genome. There is also discussion in that preprint about how to represent these additional ORFs in databases: I haven't looked to see where they're at with that, but it sounds like it's an active area at EBI. Viruses can have introns! Not often, but they do sometimes. I'm working with HSV-1 sequences - the 'canonical' annotation has just a handful of introns. A more recent and comprehensive [annotation[(https://www.nature.com/articles/s41467-020-15992-5) based on deep genomics+proteomics has a lot more introns (and probably some noise). The gffs I have been working with are frankly a mess, and what I see there is unlikely to be generalizable. The annotations I'm trying to use seem to be only available in Genbank format - here. NCBI's website has a way to export that to gff3 (menu options 'send to' - 'file' - 'complete record' - 'gff3'). I can already see that exported gff3 has inconsistencies: some of CDSs have mRNA parents, some have a gene as parent instead. I've also played a bit with the Probably best not to attempt to solve this particular case - I'm trying to keep my eyes on the big picture for this project (which is actually to use VEP with the exported gff3 files after I try to fix some of these inconsistencies). thanks again, Janet |
Regarding my own messed up gff3 file (I'm sure I will google my own question 5 years from now and want to know how I solved it) - turns out I am not the first person to have trouble with NCBI gff files. This post is old (and this), but I am seeing similar issues even now. For the annotations I care about, I think I have managed to create a new gff3 file that makes much more sense, by extracting co-ordinates from an Excel file (horror!) of a published supplementary dataset. Now that I have full control over how that gff3 gets constructed, I make sure there's one transcript per CDS, and I can now import using Back to my original suggestion, about supporting >1 CDS per transcript (or perhaps your thought about building in a way to duplicate transcripts, to allow for this without changing the schema): I do think this still could be useful in future, but is not something I need at the moment. Perhaps it makes sense to wait and see how Ensembl/UCSC end up representing the upstream ORFs described in that preprint. thanks, Herve! |
hi there,
I am working on a viral genome, where the annotations (based on ribosome profiling) fairly often include >1 CDS per exon. Using
makeTxDbFromGFF
means that some of these CDSs get dropped with a warningThe following transcripts have exons that contain more than one CDS (only the first CDS was kept for each exon)
. I'd like to be able to keep all the CDSs.I'm not sure if this website is what determines 'official' gff3 specs, but it suggests that >1 CDS per exon should be allowed:
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
Any change that makeTxDbFromGFF could support >1 CDS per exon? Not sure how tricky that would be. Example data below, from the website linked above.
Thanks!
Janet
Here's the example gff file from that page:
And here's the warning - we lose the edenprotein.4 annotation
The text was updated successfully, but these errors were encountered: