Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New genome+annotations for Arachis stenosperma (GenBank annotation) #189

Open
10 of 12 tasks
StevenCannon-USDA opened this issue Jan 17, 2024 · 24 comments
Open
10 of 12 tasks
Assignees

Comments

@StevenCannon-USDA
Copy link

StevenCannon-USDA commented Jan 17, 2024

Main steps for adding new genome and annotation collections

Genus/species/collection names:

What are the collection types and names? Example:

  • Arachis/stenosperma/annotations/V10309.gnm1.ann2.CZRZ

  • Arachis/stenosperma/genomes/V10309.gnm1.PFL2

  • Add collection(s) to the Data Store (annex)

  • Validate the README(s)

  • Update about_this_collection.yml

  • Calculate AHRD functional annotations

  • Calculate gene family assignments (.gfa)

  • Add to pan-gene set

  • Load relevant mine

  • Add BLAST targets

  • Incorporate into GCV

  • Update the jekyll collections listing

  • Update browser configs

  • run BUSCO

@adf-ncgr
Copy link
Contributor

@StevenCannon-USDA are you sure about calling these gnm1.ann2 annotations? I guess @jd-campbell had started a gnm1.ann1 but since it was never made public (that I know of) it seems a little weird to start counting at 2. But then again it wouldn't be the first time versioning got weird, so just let me know if you think this best and I'll go with it.

@StevenCannon-USDA
Copy link
Author

@adf-ncgr - You are right. Thanks for catching this. I'll fix this upstream (probably not until this evening though).

@StevenCannon-USDA
Copy link
Author

StevenCannon-USDA commented Jan 18, 2024

Looking into this more closely:
@jd-campbell did an annotation on this assembly in 2021-2022. I believe we left it at the point of having BRAKER and Direct Information Hi- and Low confidence prediction, but with lack of clarity about how to pick a trusted set. So, this annotation was effectively shelved.

At that time (~2021), we had reserved to keys to use for stenosperma: PFL2 for gnm1 and CZRZ for gnm1.ann1.
However, because ann1 was never finalized, we never really used key CZRZ.

At this point, we could ...
(1) rename the GenBank annotation as gnm1.ann1
(2) use the GenBank annotation as currently prepared, as gnm1.ann2
... And further, could either mint a new annotation key for gnm1.ann2, or repurpose CZRZ.

Of those options, I am leaning toward the option of least resistance - which is to use gnm1.ann2.CZRZ for the GenBank annotation; and just live with the fact that there is likely never to be an ann1. This would be similar to the situation for Glycine max Wm82, where we have gnm1, gnm2, gnm4, gnm6.

I don't feel strongly about this. Recalculating everything by replaying the notes would only take 15 minutes.

I guess I'll offer the proposal above (the first annotation on gnm1 will be called ann2) in the spirit of a Request for Objections. Seeking opinions/votes particularly from @sdash-github, @jd-campbell, @adf-ncgr

@adf-ncgr
Copy link
Contributor

I don't object to 2), but my mild preference would be for 1). No preference on reusing the key or not. It's too bad that soybean didn't hop from gnm4 to gnm8, that would have been a nice numerical sequence (and the number of Wm82* genomes seems to grow exponentially no matter what!)
thanks for following up

@StevenCannon-USDA
Copy link
Author

I'd say the mild preference (for 1) wins. I'll recalculate the collections now.

@StevenCannon-USDA
Copy link
Author

OK: option 1 has been implemented. The annotations collection is now Arachis/stenosperma/V10309.gnm1.ann1 (again in the annex).

For the record: replaying the data-preparation notes took half an hour - so the curation job is getting faster (though of course preparing the config file and modifying the notes for a new data set will still be a significant job -- maybe half a day or so, barring data curveballs).

@adf-ncgr
Copy link
Contributor

Thanks Steven- will get it into the AHRD queue with the others

@adf-ncgr
Copy link
Contributor

OK, the AHRD/BUSCO/GFA stuff is done on this. Should we go ahead and move from annex to public? Some of the following steps will probably prefer if it's located at the final resting place (e.g. the browser config updates which will want to embed stable URLs in the config file). I don't think the primary datastore contents will be updated much if any following those steps, so seems like a logical time to get it over there, but let me know your thoughts. Same will apply to the other couple of genomes which I've got at more or less the same level of analysis now. Happy to do the moving honors if you agree that is a sensible point at which to do so, @StevenCannon-USDA

Also, one thing possibly worth discussing further on this particular dataset (and other future NCBI-sourced annotations), it looks like you've decided to go ahead with the idea of coining transcript/protein ids by tacking on .digit suffixes to the gene ids as is common practice outside of NCBI. I don't see any major issue with that, but I do think that the featid_map file ought to record the mapping from what they used (ie accession numbers) to what we're using, ie not:
LOC130969383.1 arast.V10309.gnm1.ann1.LOC130969383.1
as currently in the file but instead:
rna-XM_057895071.1 arast.V10309.gnm1.ann1.LOC130969383.1
or maybe just
XM_057895071.1 arast.V10309.gnm1.ann1.LOC130969383.1
since if we were linking to NCBI it would probably be the accession number without the rna- prefix that would be used. Thoughts?

@StevenCannon-USDA
Copy link
Author

StevenCannon-USDA commented Jan 19, 2024

Thanks for the quick work.
Moving them to public: sure. Feel free to do so - or do delegate back to me if you prefer.

About linking the LOC IDs back to the transcript accession IDs: I agree. I just wasn't up to it at the time. Dealing with the GenBank GFFs left me feeling beat up. I was slow to realize that not only do the mRNA IDs have no string-based correspondence with their parent gene IDs, but that different splice variants within a gene also have no string-based correspondence with one another:

cat genomic.gff | awk '$3~/gene|mRNA/ && $9~/LOC130944731/ {print $3,$9}' | cut -f1-2 -d';'
gene ID=gene-LOC130944731;Dbxref=GeneID:130944731
mRNA ID=rna-XM_057873233.1;Parent=gene-LOC130944731
mRNA ID=rna-XM_057873225.1;Parent=gene-LOC130944731

The (new) script that "fixes" this is rename_gff_mRNA_IDs.pl

To do next: I guess to have it also report the mapping ...

cat genomic.gff | awk '$3~/mRNA/ && $9~/LOC130944731/ {print $9}' | 
  perl -pe 's/ID=([^;]+);Parent=([^;]+);.+/$1\t$2/'
rna-XM_057873233.1	gene-LOC130944731
rna-XM_057873225.1	gene-LOC130944731

... and then teach ds_souschef to use that.

@adf-ncgr
Copy link
Contributor

Hi again- quick note about Arachis/stenosperma/genomes/V10309.gnm1.PFL2 for @StevenCannon-USDA; looks like there are some minor differences between what's already in the v2 area of the datastore and what's in the annex. For example the one in v2 has chromosome names chr01-chr10 whereas the one in the annex has Chr1-Chr10 (all with full yuck of course). Since the gff file matches the new version in the annex, I assume we want to replace the older one with the new one but maybe you could take a quick look and ensure that the differences are all intentional before moving ahead? thanks!

@StevenCannon-USDA
Copy link
Author

Good to check.
I do think we should use the new collection. These sequences come from GenBank. They should be the same as the older ones (the chromosome and scaffold sizes are indeed the same), but I think the principle is right. And the GFF feature correspondence matters of course.

@adf-ncgr
Copy link
Contributor

Axctually, I just remembered that changing the chromosome names will have at least one implication, namely that some genome alignments we did previously using that genome against duranensis will need to be redone. Not a big deal in and of itself, and I can't think of other things outside of BLAST that would use the un-annotated genomes. But it looks like the NCBI sequence report just refers to them as 1-10 so presumably adding Chr or chr%02d is a subjective choice on our end. I kind of like using zero-padding on genomes with >=10 chromosomes, and it looks like every other Arachis genome has settled on the chr%02d form, so if you don't object or have other reasons maybe I will sed them back to that (including the gff). Let me know if you disagree!

@StevenCannon-USDA
Copy link
Author

No objections. Better to do the patch than to have to re-run the alignments.

@adf-ncgr
Copy link
Contributor

Great, thanks for tolerating my obsession with trivial matters of consistency- I've made the changes for this and will proceed with the remaining tasks downstream.

@jd-campbell
Copy link

@adf-ncgr and @StevenCannon-USDA
I agree with labeling the NCBI annotation as ann1.
The arast annotation I created was a rough draft.

@adf-ncgr
Copy link
Contributor

@StevenCannon-USDA, getting an odd error (though claiming to be a non-error!) in the JBrowse2 in some regions:
NonError: "231: multi-line feature \"arast.V10309.gnm1.ann1.TRNAK-CUU\" has inconsistent types: \"tRNA\", \"gene\""
This seems to be due to the gene and tRNA feature that is its child having the same ID value. I think the handling of ncRNA genes can be a bit fiddly, and there may be reasons to segregate them out into different files (not that simply doing so would fully resolve this issue, or is a necessary condition of solving it). Let me know if I should try to formulate an RFO about this or file something as an issue against ds_souschef.

@StevenCannon-USDA
Copy link
Author

I think I would handle this as part of the pre-processing work - near line 90 in this file:
https://github.com/legumeinfo/datastore-specifications/blob/main/PROTOCOLS/ds_souschef_prep_examples/NCBI_GenBank/notes_arast.V10309.gnm1.ann1.sh

@adf-ncgr
Copy link
Contributor

OK, but what exactly are you proposing to do- just re-ID the tRNA features so that there is no collision with the gene IDs?
One concern I have with leaving ncRNA genes in the gene_models_main file is that unless they are tagged in some way that allows them to be filtered out easily (e.g. the biotype attributes that I think are used in the original genbank gff but have gotten stripped out in these versions), then they will end up as "orphans" in the GCV, since by definition they will not match a gene family (at least in the current scheme of things where we use proteins for gene families).

@StevenCannon-USDA
Copy link
Author

For the reasons you give, I think I would remove them (probably into another file).

@StevenCannon-USDA
Copy link
Author

I can tackle this if you would like -- but I probably won't get to it until next week.

@adf-ncgr
Copy link
Contributor

I'm happy to do it as post-processing on this one if you want to tackle making it part of the pre-processing in future.

@StevenCannon-USDA
Copy link
Author

I won't object. (Always happy to make you happy :-) )

adf-ncgr added a commit to legumeinfo/datastore-metadata that referenced this issue Feb 16, 2024
@adf-ncgr
Copy link
Contributor

OK, I've done the basic partitioning, although the separated non-coding file still has what JBrowse2 considers a "NonError". So if we offer that as a separate browser track, it will probably still need to get fixed. But I'm going to quit while I'm ahead...

@StevenCannon-USDA
Copy link
Author

Great. Thank you, and a good weekend to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants