Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support transcripts with >1 CDS? #3

Open
jayoung opened this issue Jan 28, 2022 · 3 comments
Open

support transcripts with >1 CDS? #3

jayoung opened this issue Jan 28, 2022 · 3 comments

Comments

@jayoung
Copy link

jayoung commented Jan 28, 2022

hi there,

I am working on a viral genome, where the annotations (based on ribosome profiling) fairly often include >1 CDS per exon. Using makeTxDbFromGFF means that some of these CDSs get dropped with a warning The following transcripts have exons that contain more than one CDS (only the first CDS was kept for each exon). I'd like to be able to keep all the CDSs.

I'm not sure if this website is what determines 'official' gff3 specs, but it suggests that >1 CDS per exon should be allowed:
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

Any change that makeTxDbFromGFF could support >1 CDS per exon? Not sure how tricky that would be. Example data below, from the website linked above.

Thanks!

Janet

Here's the example gff file from that page:

##gff-version 3
##sequence-region ctg123 1 1497228
ctg123	.	gene	1000	9000	.	+	.	ID=gene00001;Name=EDEN
ctg123	.	TF_binding_site	1000	1012	.	+	.	ID=tfbs00001;Parent=gene00001
ctg123	.	mRNA	1050	9000	.	+	.	ID=mRNA00001;Parent=gene00001;Name=EDEN.1
ctg123	.	mRNA	1050	9000	.	+	.	ID=mRNA00002;Parent=gene00001;Name=EDEN.2
ctg123	.	mRNA	1300	9000	.	+	.	ID=mRNA00003;Parent=gene00001;Name=EDEN.3
ctg123	.	exon	1300	1500	.	+	.	ID=exon00001;Parent=mRNA00003
ctg123	.	exon	1050	1500	.	+	.	ID=exon00002;Parent=mRNA00001,mRNA00002
ctg123	.	exon	3000	3902	.	+	.	ID=exon00003;Parent=mRNA00001,mRNA00003
ctg123	.	exon	5000	5500	.	+	.	ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003
ctg123	.	exon	7000	9000	.	+	.	ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003
ctg123	.	CDS	1201	1500	.	+	0	ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
ctg123	.	CDS	3000	3902	.	+	0	ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
ctg123	.	CDS	5000	5500	.	+	0	ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
ctg123	.	CDS	7000	7600	.	+	0	ID=cds00001;Parent=mRNA00001;Name=edenprotein.1
ctg123	.	CDS	1201	1500	.	+	0	ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
ctg123	.	CDS	5000	5500	.	+	0	ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
ctg123	.	CDS	7000	7600	.	+	0	ID=cds00002;Parent=mRNA00002;Name=edenprotein.2
ctg123	.	CDS	3301	3902	.	+	0	ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
ctg123	.	CDS	5000	5500	.	+	1	ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
ctg123	.	CDS	7000	7600	.	+	1	ID=cds00003;Parent=mRNA00003;Name=edenprotein.3
ctg123	.	CDS	3391	3902	.	+	0	ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123	.	CDS	5000	5500	.	+	1	ID=cds00004;Parent=mRNA00003;Name=edenprotein.4
ctg123	.	CDS	7000	7600	.	+	1	ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

And here's the warning - we lose the edenprotein.4 annotation

library(txdbmaker)
example_txdb <- makeTxDbFromGFF(file="~/Desktop/example.gff3")
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning message:
#     In .find_exon_cds(exons, cds) :
#     The following transcripts have exons that contain more than one CDS (only the
#     first CDS was kept for each exon): mRNA00003

cds(example_txdb, use.names=TRUE)
# GRanges object with 10 ranges and 1 metadata column:
#               seqnames    ranges strand |    cds_id
#                   <Rle> <IRanges>  <Rle> | <integer>
#  edenprotein.1   ctg123 1201-1500      + |         1
#  edenprotein.2   ctg123 1201-1500      + |         2
#  edenprotein.1   ctg123 3000-3902      + |         3
#  edenprotein.3   ctg123 3301-3902      + |         4
#  edenprotein.1   ctg123 5000-5500      + |         5
#  edenprotein.2   ctg123 5000-5500      + |         6
#  edenprotein.3   ctg123 5000-5500      + |         7
#  edenprotein.1   ctg123 7000-7600      + |         8
#  edenprotein.2   ctg123 7000-7600      + |         9
#  edenprotein.3   ctg123 7000-7600      + |        10
@hpages
Copy link
Contributor

hpages commented Jan 29, 2022

Hi Janet,

The error message is a little bit misleading: it should say that makeTxDbFromGFF() does not support coding-transcripts that encode more than one protein.

In the Canonical Gene example shown in the GFF3 specs, transcript EDEN.3 encodes 2 proteins (edenprotein.3 and edenprotein.4), via 2 CDSs (cds00003 and cds00004). makeTxDbFromGFF(), and the underlying db schema used by TxDb objects, cannot handle this. Note that AFAIK the 2 major annotation providers UCSC and Ensembl don't handle this either. For example transcript tables at UCSC like knownGene or ncbiRefSeq use the cdsStart and cdsEnd fields to store a single CDS start and end value per transcript. IIRC the MySQL db used at Ensembl does something similar.

When we started working on TxDb objects back in 2010, our primary goal was to support UCSC and Ensembl annotations, so we designed a db schema that assumes a one-to-one relationship between coding transcripts and CDSs/proteins. So when you import the Canonical Gene example, only the first CDS on transcript EDEN.3 is imported:

library(txdbmaker)

EDEN <- system.file("extdata", "GFF3_files", "TheCanonicalGene_v1.gff3", package="txdbmaker")

txdb <- makeTxDbFromGFF(EDEN)
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning message:
# In .find_exon_cds(exons, cds) :
#   The following transcripts have exons that contain more than one CDS
#  (only the first CDS was kept for each exon): mRNA00003

cdsBy(txdb, by="tx", use.names=TRUE)
# GRangesList object of length 3:
# $EDEN.1
# GRanges object with 4 ranges and 3 metadata columns:
#       seqnames    ranges strand |    cds_id      cds_name exon_rank
#          <Rle> <IRanges>  <Rle> | <integer>   <character> <integer>
#   [1]   ctg123 1201-1500      + |         1 edenprotein.1         1
#   [2]   ctg123 3000-3902      + |         3 edenprotein.1         2
#   [3]   ctg123 5000-5500      + |         5 edenprotein.1         3
#   [4]   ctg123 7000-7600      + |         8 edenprotein.1         4
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
# 
# $EDEN.2
# GRanges object with 3 ranges and 3 metadata columns:
#       seqnames    ranges strand |    cds_id      cds_name exon_rank
#          <Rle> <IRanges>  <Rle> | <integer>   <character> <integer>
#   [1]   ctg123 1201-1500      + |         2 edenprotein.2         1
#   [2]   ctg123 5000-5500      + |         6 edenprotein.2         2
#   [3]   ctg123 7000-7600      + |         9 edenprotein.2         3
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
# 
# $EDEN.3
# GRanges object with 3 ranges and 3 metadata columns:
#       seqnames    ranges strand |    cds_id      cds_name exon_rank
#          <Rle> <IRanges>  <Rle> | <integer>   <character> <integer>
#   [1]   ctg123 3301-3902      + |         4 edenprotein.3         2
#   [2]   ctg123 5000-5500      + |         7 edenprotein.3         3
#   [3]   ctg123 7000-7600      + |        10 edenprotein.3         4
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The one-to-one relationship between coding transcripts and CDSs/proteins is a very early assumption that is at the root of the design/behavior of many functions in GenomicFeatures/txdbmaker, so is something that would be very difficult to change, and also very disruptive.

I knew that the GFF3 specs did support a one-to-many relationship between coding transcripts and CDSs/proteins, and that's a feature that I found surprising when I ran into it, because I was not aware of any real-world situation where this would actually happen, and because AFAIK none of the hundreds of thousands of GFF3 files provided by UCSC and Ensembl seemed to use this capability. Also makeTxDbFromGFF() has been used on hundreds or thousands of GFF3 files from FlyBase, WormBase, and other annotations providers, for many years, and AFAIK nobody has reported seeing that. So congratulations for finding and reporting the first case of such file 😉! Out of curiosity, may I know where is this file coming from, if that's something you're willing to share?

I don't know how UCSC or Ensembl handle this situation, or how frequently it is observed in biology. Maybe they just use 2 distinct transcript ids to represent such event? In your case it means that the GFF3 file would need to be modified but I don't know how hard that would be. If this is a viral genome, then there's no exon, and each CDS can be represented by a single line (no CDS parts), so the file should be much simpler than in the general situation where the gene/transcript/exon/cds hierarchy can be complex and hard to figure out. Maybe makeTxDbFromGFF() should be able to do something this like e.g. by importing 2 transcripts instead of one and assigning the corresponding protein IDs to them, or by adding the .1 and .2 suffixes to the original transcript ID? I'm open to suggestions.

H.

@jayoung
Copy link
Author

jayoung commented Feb 1, 2022

hi Herve,

thanks - this is useful insight into the back end. I think a good ad hoc solution in my case will be to make duplicate coding transcripts to achieve that one-to-one relationship. If that can be done in makeTxDbFromGFF, great, if not I can probably handle it myself. I like the .1/.2 suffix idea.

I think over the next few years having >1 ORF per transcript will become more common. There's good evidence that a lot of additional ORFs exist: I think the jury is still out on how many of them have important function, but it is clear that at least some of them are important. This review article covers the biology nicely, and this preprint has more in-depth analysis on how important it may (or may not be) in the human genome. There is also discussion in that preprint about how to represent these additional ORFs in databases: I haven't looked to see where they're at with that, but it sounds like it's an active area at EBI.

Viruses can have introns! Not often, but they do sometimes. I'm working with HSV-1 sequences - the 'canonical' annotation has just a handful of introns. A more recent and comprehensive [annotation[(https://www.nature.com/articles/s41467-020-15992-5) based on deep genomics+proteomics has a lot more introns (and probably some noise).

The gffs I have been working with are frankly a mess, and what I see there is unlikely to be generalizable. The annotations I'm trying to use seem to be only available in Genbank format - here. NCBI's website has a way to export that to gff3 (menu options 'send to' - 'file' - 'complete record' - 'gff3'). I can already see that exported gff3 has inconsistencies: some of CDSs have mRNA parents, some have a gene as parent instead.

I've also played a bit with the genbankr package and its readGenBank() and makeTxDbFromGenBank() functions, but again I think the annotations are not organized in a helpful way - I get an error about duplicated transcript IDs from makeTxDbFromGenBank().

Probably best not to attempt to solve this particular case - I'm trying to keep my eyes on the big picture for this project (which is actually to use VEP with the exported gff3 files after I try to fix some of these inconsistencies).

thanks again,

Janet

@jayoung
Copy link
Author

jayoung commented Feb 2, 2022

Regarding my own messed up gff3 file (I'm sure I will google my own question 5 years from now and want to know how I solved it) - turns out I am not the first person to have trouble with NCBI gff files. This post is old (and this), but I am seeing similar issues even now.

For the annotations I care about, I think I have managed to create a new gff3 file that makes much more sense, by extracting co-ordinates from an Excel file (horror!) of a published supplementary dataset.

Now that I have full control over how that gff3 gets constructed, I make sure there's one transcript per CDS, and I can now import using makeTxDbFromGFF just fine.

Back to my original suggestion, about supporting >1 CDS per transcript (or perhaps your thought about building in a way to duplicate transcripts, to allow for this without changing the schema): I do think this still could be useful in future, but is not something I need at the moment. Perhaps it makes sense to wait and see how Ensembl/UCSC end up representing the upstream ORFs described in that preprint.

thanks, Herve!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants