Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deposit VCF Zarr version of mafft aligned Viridian sequences and metadata to Figshare #262

Open
jeromekelleher opened this issue Dec 16, 2024 · 3 comments

Comments

@jeromekelleher
Copy link
Owner

Sc2ts now uses VCF Zarr preprint as its input format and has methods for ingesting data in FASTA and TSV format to create a Zarr zipfile. This whole thing comes to about 300MiB, so it's totally feasible to just deposit to Figshare. (While this is a bit larger than the compressed FASTAs that Viridian ships, it's a lot more accessible, giving fast access to the data in both the sample and variant axes, as well as keeping all the metadata in the same place.)

@iqbal-lab would you be OK with us repackaging the Viridian data like this? It would make sc2ts much more reproducible, as the user could now just download the full dataset in one go and start working immediately. It would also be helpful for me, as I would like to write a case study about the data in the VCF Zarr paper (a whole pandemic worth of data in one file that can be accessed by variant or sample in milliseconds is pretty useful, in my book!).

Things we need:

  1. Fully documented pipeline for mafft alignment (@szhan can you comment here?)
  2. Some "description" fields to accompany the metadata would be very helpful, as it's not entirely obvious what some of the metadata fields mean or where they came from.
@szhan
Copy link
Collaborator

szhan commented Dec 16, 2024

Notes about the MAFFT alignments are here #210.

@iqbal-lab
Copy link

@martinghunt and I are fine with this, our data is all open and released. I think in terms of metadata, is there any chance you could dump a list of what metadata you are asking for? I think it is all from the ENA, apart from date_tree which is the result of an algorithm comparing dates with other sources (Genbank, gisaid). Not sure where the ENA metadata is define

@jeromekelleher
Copy link
Owner Author

@martinghunt and I are fine with this, our data is all open and released.

Fantastic. We'll be very careful to correctly attribute the origins and make it clear we're just repackaging. I'll ping you both when there's a figshare link to look at to make sure everyone is happy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants