Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene Name attributes that start with gensp. #187

Open
sammyjava opened this issue Dec 9, 2023 · 8 comments
Open

Gene Name attributes that start with gensp. #187

sammyjava opened this issue Dec 9, 2023 · 8 comments
Assignees
Labels

Comments

@sammyjava
Copy link
Contributor

We talked about this already, but I'd like to take action and update the Datastore where appropriate.

There are a number of gene_models_main GFFs that have the Name attribute starting with gensp. This seems non-conformant to me, in the sense that Name is meant to be what a gene is called in the source material and the gensp prefixes tend to be an LIS thing.

Here's the list with an example of a GFF line for each case. I'd like @StevenCannon-USDA to confirm that the gene Name attributes should, in fact, contain the gensp prefix in these cases or, when not, to update the GFFs (outsourcing that to me is fine). Also, @adf-ncgr may have some arcane reasons for including the gensp prefix in certain cases. (Name uniqueness does not qualify as a reason, in my opinion, but he may have some JBrowse-related or other reasons for doing so.)

 starts_with | count 
-------------+-------
 cajca.      | 40071
 cicar.      | 58526
 glyma.      | 96036
 glyso.      | 102507
 lupal.      | 38258
 lupan.      | 33072
 medtr.      | 517176
 tripr.      | 39948
 vigun.      | 31948

cajca.ICPL87119.gnm1.ann1.Y27M.gene_models_main.gff3.gz:

cajca.ICPL87119.gnm1.Cc01	GLEAN	gene	13892	14559	0.659822	+	.	ID=cajca.ICPL87119.gnm1.ann1.C.cajan_19181;Name=cajca.C.cajan_19181;evid_id=C.cajan_GLEAN_10029733;Dbxref=Gene3D:G3DSA:1.10.10.60,InterPro:IPR001005,InterPro:IPR006447,InterPro:IPR009057,InterPro:IPR017930,JCVI_TIGRFAMS:TIGR01557,PANTHER:PTHR12802,PANTHER:PTHR12802:SF23,Pfam:PF00249,Prosite:PS51294,SMART:SM00717,Superfamily:SSF46689;Ontology_term=GO:0003677,GO:0003682;Note=MYB transcription factor MYB114 isoform X2 [Glycine max]%3B IPR009057 (Homeodomain-like)%3B GO:0003677 (DNA binding)%2C GO:0003682 (chromatin binding)

cicar.CDCFrontier.gnm1.ann1.nRhs.gene_models_main.gff3.gz

cicar.CDCFrontier.gnm1.C11095950	GLEAN	gene	138	470	0.999968	+	.	ID=cicar.CDCFrontier.gnm1.ann1.Ca_28062;Name=cicar.CDCFrontier.Ca_28062;evid_id=GAR_10000002;Note=SAUR-like auxin-responsive protein family%3B IPR003676 (Auxin-induced protein%2C ARG7);Dbxref=InterPro:IPR003676,PANTHER:PTHR31374,PANTHER:PTHR31374:SF0,Pfam:PF02519;

cicar.ICC4958.gnm2.ann1.LCVX.gene_models_main.gff3.gz

cicar.ICC4958.gnm2.Ca1	cicar.ICC4958.gnm2.ann1	gene	6359	6790	.	+	.	ID=cicar.ICC4958.gnm2.ann1.Ca_00001;Name=cicar.ICC4958.Ca_00001;Dbxref=Gene3D:G3DSA:3.40.50.720,InterPro:IPR009036,InterPro:IPR016040,PANTHER:PTHR10953,PANTHER:PTHR10953:SF29,Superfamily:SSF69572;Note=NEDD8-activating enzyme E1 regulatory subunit-like protein%3B IPR016040 (NAD(P)-binding domain)

Glycine/max/annotations/Lee.gnm1.ann1.6NZV/glyma.Lee.gnm1.ann1.6NZV.gene_models_main.gff3.gz

glyma.Lee.gnm1.Gm01	phytozomev13	gene	37775	37993	.	+	.	ID=glyma.Lee.gnm1.ann1.GlymaLee.01G000100;Name=glyma.Lee.gnm1.ann1.GlymaLee.01G000100;Note=Unknown protein

Glycine/max/annotations/Wm82_ISU01.gnm2.ann1.FGFB/glyma.Wm82_ISU01.gnm2.ann1.FGFB.gene_models_main.gff3.gz

glyma.Wm82_ISU01.gnm2.Gm01	phytozomev13	gene	78503	103594	.	-	.	ID=glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G000050;Name=glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G000050;Dbxref=Gene3D:G3DSA:3.30.390.10,Gene3D:G3DSA:3.40.50.970,Prosite:PS51257,Superfamily:SSF52518;Note=protein PHYLLO%2C chloroplastic-like isoform X5 [Glycine max]

Glycine/soja/annotations/PI483463.gnm1.ann1.3Q3Q/glyso.PI483463.gnm1.ann1.3Q3Q.gene_models_main.gff3.gz

glyso.PI483463.gnm1.Gs01	phytozomev13	gene	42343	43123	.	-	.	ID=glyso.PI483463.gnm1.ann1.GlysoPI483463.01G000100;Name=glyso.PI483463.gnm1.ann1.GlysoPI483463.01G000100;Dbxref=Gene3D:G3DSA:3.30.390.10;Note=protein PHYLLO%2C chloroplastic-like isoform X4 [Glycine max]

Glycine/soja/annotations/W05.gnm1.ann1.T47J/glyso.W05.gnm1.ann1.T47J.gene_models_main.gff3.gz

glyso.W05.gnm1.Chr01	maker	gene	60339	60901	.	-	.	ID=glyso.W05.gnm1.ann1.Glysoja.01G000001;Name=glyso.W05.gnm1.ann1.Glysoja.01G000001;Dbxref=Gene3D:G3DSA:3.30.390.10;Note=protein PHYLLO%2C chloroplastic-like isoform X1 [Glycine max]

Lupinus/albus/annotations/Amiga.gnm1.ann1.3GKS/lupal.Amiga.gnm1.ann1.3GKS.gene_models_main.gff3.gz

lupal.Amiga.gnm1.Lalb_Chr00c01	EuGene	gene	40143	40433	.	+	.	ID=lupal.Amiga.gnm1.ann1.gene:Lalb_Chr00c01g0403611;Name=lupal.Lalb_Chr00c01g0403611;locus_tag=Lalb_Chr00c01g0403611;Dbxref=PANTHER:PTHR11439;Note=Retrotransposon protein%2C putative%2C unclassified n%3D1 Tax%3DOryza sativa subsp. japonica RepID%3DQ10SZ0_ORYSJ

Lupinus/angustifolius/annotations/Tanjil.gnm1.ann1.nnV9/lupan.Tanjil.gnm1.ann1.nnV9.gene_models_main.gff3.gz

lupan.Tanjil.gnm1.NLL-01	lupan.Tanjil.gnm1.ann1.nnV9	gene	603	4044	0.696	+	.	ID=lupan.Tanjil.gnm1.ann1.Lup027320;Name=lupan.Lup027320;source_id=Lupinus_GLEAN_10030675;identical_support_id=CUFF72.441.1;Dbxref=Gene3D:G3DSA:1.20.1250.20,InterPro:IPR001917,InterPro:IPR003663,InterPro:IPR005828,InterPro:IPR005829,InterPro:IPR016196,InterPro:IPR020846,JCVI_TIGRFAMS:TIGR00879,PANTHER:PTHR24063,PANTHER:PTHR24063:SF171,PRINTS:PR00171,Pfam:PF00083,Prosite:PS00216,Prosite:PS00217,Prosite:PS00599,Prosite:PS50850,Superfamily:SSF103473;Ontology_term=GO:0005215,GO:0006810,GO:0008152,GO:0016020,GO:0016021,GO:0016740,GO:0022857,GO:0022891,GO:0055085;Note=Membrane transporter D1 n%3D3 Tax%3DAndropogoneae RepID%3DB6U4Q3_MAIZE%3B IPR001917 (Aminotransferase%2C class-II%2C pyridoxal-phosphate binding site)%2C IPR005828 (General substrate transporter)%2C IPR016196 (Major facilitator superfamily domain%2C general substrate transporter)%3B GO:0005215 (transporter activity)%2C GO:0006810 (transport)%2C GO:0008152 (metabolic process)%2C GO:0016020 (membrane)%2C GO:0016021 (integral component of membrane)%2C GO:0016740 (transferase activity)%2C GO:0022857 (transmembrane transporter activity)%2C GO:0022891 (substrate-specific transmembrane transporter activity)%2C GO:0055085 (transmembrane transport)

Medicago/truncatula/annotations/HM056.gnm1.ann1.CHP6/medtr.HM056.gnm1.ann1.CHP6.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM058.gnm1.ann1.LXPZ/medtr.HM058.gnm1.ann1.LXPZ.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM060.gnm1.ann1.H41P/medtr.HM060.gnm1.ann1.H41P.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM095.gnm1.ann1.55W4/medtr.HM095.gnm1.ann1.55W4.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM125.gnm1.ann1.KY5W/medtr.HM125.gnm1.ann1.KY5W.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM129.gnm1.ann1.7FTD/medtr.HM129.gnm1.ann1.7FTD.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM185.gnm1.ann1.GB3D/medtr.HM185.gnm1.ann1.GB3D.gene_models_main.gff3.gz
Medicago/truncatula/annotations/HM324.gnm1.ann1.SQH2/medtr.HM324.gnm1.ann1.SQH2.gene_models_main.gff3.gz

medtr.HM324.gnm1.scaffold_0	.	gene	5673	6194	.	+	.	ID=medtr.HM324.gnm1.ann1.g1;Name=medtr.HM324.g1;Dbxref=InterPro:IPR010259,Pfam:PF05922;Ontology_term=GO:0004252,GO:0042802,GO:0043086;Note=subtilisin-like protease-like isoform X7 [Glycine max]%3B IPR010259 (Proteinase inhibitor I9)%3B GO:0004252 (serine-type endopeptidase activity)%2C GO:0042802 (identical protein binding)%2C GO:0043086 (negative regulation of catalytic activity)

Trifolium/pratense/annotations/MilvusB.gnm2.ann1.DFgp/tripr.MilvusB.gnm2.ann1.DFgp.gene_models_main.gff3.gz

tripr.MilvusB.gnm2.Tp1	ensembl	gene	1135	2485	.	-	.	ID=tripr.MilvusB.gnm2.ann1.gene2499;Name=tripr.gene2499;Note=F1F0-ATPase inhibitor protein%252C putative%253B IPR007648 (ATPase inhibitor%252C IATP%252C mitochondria)%253B GO:0004857 (enzyme inhibitor activity)%252C GO:0005739 (mitochondrion)%252C GO:0045980 (negative regulation of nucleotide metabolic process)%253B*-**%253B AT5G04750.1;Dbxref=Coils:Coil,InterPro:IPR007648,Pfam:PF04568;Ontology_term=GO:0004857,GO:0005739,GO:0045980

Vigna/unguiculata/annotations/IT97K-499-35.gnm1.ann2.FD7K/vigun.IT97K-499-35.gnm1.ann2.FD7K.gene_models_main.gff3.gz

vigun.IT97K-499-35.gnm1.Vu01	phytozomev13	gene	1951	3899	.	+	.	ID=vigun.IT97K-499-35.gnm1.ann2.Vigun01g000100;Name=vigun.IT97K-499-35.Vigun01g000100;ancestorIdentifier=Vigun01g000100.v1.1;Dbxref=InterPro:IPR011108,PANTHER:PTHR11203,PANTHER:PTHR11203:SF8,Pfam:PF07521,Superfamily:SSF56281;Note=cleavage and polyadenylation specificity factor 73 kDa subunit-II%3B IPR011108 (RNA-metabolising metallo-beta-lactamase)
@StevenCannon-USDA
Copy link

StevenCannon-USDA commented Dec 11, 2023

Here are my opinions about those cases:

cajca.C.cajan_19181 ==> C.cajan_19181
cicar.CDCFrontier.Ca_28062 ==> Ca_28062
cicar.ICC4958.Ca_00001 ==> Ca_00001
glyma.Lee.gnm1.ann1.GlymaLee.01G000100 ==> GlymaLee.01G000100
glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G000050 ==> GmISU01.01G000050  *[1]
glyso.PI483463.gnm1.ann1.GlysoPI483463.01G000100 ==> GlysoPI483463.01G000100
glyso.W05.gnm1.ann1.Glysoja.01G000001 ==> Glysoja.01G000001
lupal.Lalb_Chr00c01g0403611 ==> Lalb_Chr00c01g0403611
lupan.Lup027320 ==> Lup027320
medtr.HM324.g1 ==> g1     *[2]
tripr.gene2499 ==> gene2499
vigun.IT97K-499-35.Vigun01g000100 ==> Vigun01g000100

Notes:
*[1]
Wm82_ISU01 will probably go away soon, to be replaced by an annotation called
Wm82.gnm6.ann1, with names looking like glyma.01G00100 (derived from Wm82.gnm4.ann4 names when possible).

*[2]
For the Medicago genes, here is a full list of the forms:

for file in */*gene_models_main.gff3.gz; do 
  zcat $file | awk -v FS="\t" '$3~/gene/ {print $9}' | head -1 | 
  perl -pe 's/ID=([^;]+);Note=[^;]+;Name=([^;]+);.+/$1\t$2/' |
  perl -pe 's/ID=([^;]+);Name=([^;]+);.+/$1\t$2/'; 
done
medtr.A17_HM341.gnm4.ann2.Medtr1g004930	Medtr1g004930
medtr.A17.gnm5.ann1_6.MtrunA17CPg0492171	MtrunA17CPg0492171
medtr.HM004.gnm1.ann1.g1	HM004.g1
medtr.HM010.gnm1.ann1.g1	HM010.g1
medtr.HM022.gnm1.ann1.g1	HM022.g1
medtr.HM023.gnm1.ann1.g1	HM023.g1
medtr.HM034.gnm1.ann1.g1	HM034.g1
medtr.HM050.gnm1.ann1.g1	HM050.g1
medtr.HM056.gnm1.ann1.g1	medtr.HM056.g1
medtr.HM058.gnm1.ann1.g1	medtr.HM058.g1
medtr.HM060.gnm1.ann1.g1	medtr.HM060.g1
medtr.HM095.gnm1.ann1.g1	medtr.HM095.g1
medtr.HM125.gnm1.ann1.h3436.02	medtr.HM125.h3436.02
medtr.HM129.gnm1.ann1.g1	medtr.HM129.g1
medtr.HM185.gnm1.ann1.g1	medtr.HM185.g1
medtr.HM324.gnm1.ann1.g1	medtr.HM324.g1
medtr.R108_HM340.gnm1.ann1.BZG31_000s000010	BZG31_000s000010
medtr.R108.gnmHiC_1.ann1.MtrunR108HiC_000001	MtrunR108HiC000001

For these, I'd like a second opinion from @adf-ncgr and Joann if appropriate.
Regularity says the genes in the Zhou ... Young set should have the form "g#".
But I feel a little squeamish about this. These genomes were all released and described as a group, and the assemblies and accessions are referred to in the main paper (Zhou, Silverstein et al., 2017) by the five-character HM### string. That said, I don't see particular instances in the paper where particular genes
are discussed by name, so I don't feel I can make a strong argument for going beyond "g#".

@sammyjava
Copy link
Contributor Author

Thanks for opinions, @StevenCannon-USDA , I'll put these into a to-do checkbox list here and I'll start updating them after giving @adf-ncgr and @joannmudge a chance to object. As for the Zhou, Silverstein, at al. genomes, I agree that the Name attribute should be just the final piece (g1) for regularity, but we do sacrifice regularity at times for Higher Reasons.

  • cajca.C.cajan_19181 ==> C.cajan_19181
  • cicar.CDCFrontier.Ca_28062 ==> Ca_28062
  • cicar.ICC4958.Ca_00001 ==> Ca_00001
  • glyma.Lee.gnm1.ann1.GlymaLee.01G000100 ==> GlymaLee.01G000100
  • glyma.Wm82_ISU01.gnm2.ann1.GmISU01.01G000050 ==> GmISU01.01G000050 *[1]
  • glyso.PI483463.gnm1.ann1.GlysoPI483463.01G000100 ==> GlysoPI483463.01G000100
  • glyso.W05.gnm1.ann1.Glysoja.01G000001 ==> Glysoja.01G000001
  • lupal.Lalb_Chr00c01g0403611 ==> Lalb_Chr00c01g0403611
  • lupan.Lup027320 ==> Lup027320
  • tripr.gene2499 ==> gene2499
  • vigun.IT97K-499-35.Vigun01g000100 ==> Vigun01g000100

@adf-ncgr
Copy link
Contributor

OK, I'm in agreement with most of these, but I think the original Names for tripr were actually like "Tp57577_TGAC_v2_gene10066" not just "gene10066" so should we use that instead, like Phytozome and Ensembl seem to do? I'm still a little unclear about what the principal is here (originalism or aesthetics), though we once tried to pin it down here: legumeinfo/datastore-specifications#44

I would personally vote to keep medtr.HMxxx.g1 (or at least HMxxx.g1) which is seemingly no more problematic than having GlymaLee as part of a name, it just happens to also be identical with part of our full yuck system. But if we think g1 is better for any given medicago accession, I think that implies that strict Name originalism is the principle here, no matter how bad we think the names are, meaning we should be stuck with Tp57577_TGAC_v2_gene10066.

But whatever we decide, let's take it as an opportunity to resolve the open questions in legumeinfo/datastore-specifications#44

@StevenCannon-USDA
Copy link

I think we're close to convergence, and are down to the point of splitting hairs - which I guess is unavoidable.
Here's the spec as it stands:
https://github.com/legumeinfo/datastore-specifications/tree/main/Genus/species/annotations

And a key clause:

Where available in the original annotations, the names should come from those annotation files, with the possible exception of stripping type identifiers (e.g. "gene:"), or shortening exceptionally cumbersome auto-generated strings or lengthy prefixes added in the original annotation form if those prefixes do not contribute to the uniqueness of the names within the annotation file. Such exceptions will need to be considered on a case-by-casse basis.

I would say that "exceptionally cumbersome strings ... if those prefixes do not contribute to the uniqueness of the names within the annotation file" is a fair description of Tp57577_TGAC_v2_gene10066. I mean: the Trifolium team has encoded Genus (T), species (p), accession (57577 I think), sequencing center (TGAC), and assembly version v2. I think this is a worthy case for an exception (shortening it to "gene10066"). But I won't fight anyone over it. If Sam is implementing, I say: go ahead and do what you think is right, and we'll be prepared to be delighted.

@adf-ncgr
Copy link
Contributor

Thanks @StevenCannon-USDA, sounds like that clause is indeed the final refuge for the hair-splitters! I am in favor of shortening where there is substantial overlap with what full yuck is accomplishing. I think this would mean that we'd allow:
Name=Lcu.2RBY.1g010820 ID=lencu.CDC_Redberry.gnm2.ann1.1g010820
if we need to keep the IDs below some max length limit imposed by certain tools (e.g. BLAST)? Name here is "original". Or would we require that Name be 1g010820 if we invoked the "lengthy prefixes" clause on this one?

@StevenCannon-USDA
Copy link

@adf-ncgr - yeah, I think I'd leave Lcu.2RBY.1g010820 (which means changing the ID in that case).

... When you're running an Airbnb and some guests insist on bringing all their own furniture.

@sammyjava
Copy link
Contributor Author

This issue (see the title of this issue) is about the gensp prefixes, which it appears we all agree should be dropped. A protocol for how we populate the Name attribute otherwise is certainly a Good Thing. I don't see any argument for keeping the gensp prefix here, so I'll yank those from the appropriate places, and we can move the discussion of Names in general back to legumeinfo/datastore-specifications#44 . I'll keep this issue open just so I can hit my checkboxes.

@sammyjava
Copy link
Contributor Author

And yes, in the few cases where Name is full-yuck, I'll de-yuckify it down to the non-yuck portion. (example: glyma.Lee.gnm1.ann1.GlymaLee.01G000100 ==> GlymaLee.01G000100).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants