Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New genome and annotations for Chamaecrista fasciculata (two haplotypes) #208

Open
10 of 13 tasks
StevenCannon-USDA opened this issue Jun 28, 2024 · 7 comments
Open
10 of 13 tasks
Assignees

Comments

@StevenCannon-USDA
Copy link

StevenCannon-USDA commented Jun 28, 2024

Main steps for adding new genome and annotation collections

Genus/species/collection names:

Haplotype 1:

  • Chamaecrista/fasciculata/genomes/ISC494698.gnm1.8Q19
  • Chamaecrista/fasciculata/annotations/ISC494698.gnm1.ann1.G7XW

Haplotype 2:

  • Chamaecrista/fasciculata/genomes/ISC494698.gnm1_hap2.G6BY

  • Chamaecrista/fasciculata/annotations/ISC494698.gnm1_hap2.ann1.WXZF

  • Add collection(s) to the Data Store, including commits to datastore-metadata

  • Validate the README(s)

  • Update about_this_collection.yml

  • Calculate AHRD functional annotations

  • Calculate gene family assignments (.gfa)

  • [N/A ] Add to pan-gene set

  • Load relevant mine

  • Add BLAST targets

  • Incorporate into GCV

  • Update the jekyll collections listing

  • Update browser configs

  • run BUSCO

  • Update DSCensor

  • Add LINKOUTS to datastore, refresh linkout service

@StevenCannon-USDA
Copy link
Author

This one is back in play, following our discussion about handling haplotype-resolved assemblies.

@adf-ncgr
Copy link
Contributor

@StevenCannon-USDA should have the AHRDs on these two completed soon and will move from annex to main datastore. My preference would be to move them both there since it seems like it would make sense to include them both in at least some (if not all) downstream systems. But wanted to confirm with you since I think originally you were planning to leave secondary haplotypes in the annex. Also one very minor note, it seems that the procedure you're using for the upstream processing is producing uncompressed gff3 for the gene_models_main files, although they have the .gz suffix. Not really a problem since we have to add the AHRD stuff in and redo compression/indexing but it is a bit confusing when gunzip complains...

@StevenCannon-USDA
Copy link
Author

move them both there since it seems like it would make sense to include them both in at least some (if not all) downstream systems.
I agree now that moving them both to the main Data Store is best.

Thanks for the alert about the uncompressed GFF3s. I suspect that was due to some additional manual stuff I did when the automated compression failed (I think) due to an interrupted session.

@adf-ncgr
Copy link
Contributor

OK, the data content related tasks (AHRD/BUSCO/gfa) should be complete and I've moved the folders into the main datastore; downstream steps will proceed as time permits but if there's any you consider higher priority than others let me know.

Regarding the compression, it definitely was an issue on both haplotypes and I feel like I've seen it before but not %100 sure about that. In any case if I see it again I'll let you know.

@StevenCannon-USDA
Copy link
Author

OK, thank you.

I'll also investigate the compression issue -- at least next time I run the process.
The script responsible should be
/usr/local/www/data/datastore-specifications/scripts/compress_and_index.sh
and the code in question is:

for file in $filepath/*.f?a $filepath/*.gff3 $filepath/*tsv $filepath/*bed; do
  if test -f $file; then
    echo "Compressing $file"
    bgzip -l9 $file &
  fi
done
wait

@adf-ncgr
Copy link
Contributor

well that looks pretty straightforward- but now that I think about it some more I don't think an interrupted session would explain the observed behavior which is as if the original file were simply renamed with a .gz suffix. Is it possible that there's something else that just names it with a gz extension (in which case the code above wouldn't even see it there)?

@StevenCannon-USDA
Copy link
Author

"Is it possible that there's something else that just names it with a gz extension"

Helpful suggestion/clue. You are right.
Here's the source of the problem. In the ds_souschef.pl configs chafa.ISC494698.gnm1_hap2.ann1.yml and chafa.ISC494698.gnm1.ann1.yml, the "to" suffix was given as gene_models_main.gff3.gz, but it should have been just gene_models_main.gff3, since the output is not gzipped by ds_souschef.pl.

  - 
    from: gene_strip.gff3.gz
    to: gene_models_main.gff3.gz
    description: "Gene models - main"

I'll plan to add checks for this in ds_souschef.pl once I've finished some other tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants