Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple data curation and upload #162

Open
joverlee521 opened this issue Sep 11, 2024 · 4 comments
Open

Decouple data curation and upload #162

joverlee521 opened this issue Sep 11, 2024 · 4 comments

Comments

@joverlee521
Copy link
Contributor

Additional context in Slack

A part of the data curation occurs during vdb/upload and tdb/upload making it difficult to debug data curation issues and hard to share data curation steps with external groups.

Potential solutions

  1. Detangle data curation and data upload within fauna.
  2. Start brand new ingest workflows for curation where the results are then optionally uploaded to fauna.
@joverlee521
Copy link
Contributor Author

(2) also separately brought up by @jameshadfield and +1 by @trvrb in Slack, so that's the direction we should take here!

@jameshadfield
Copy link
Member

Having worked in fauna for the first time in a few years, this decoupling would be much welcome. For the work I was doing in avian flu (no titers!) I'd propose (3):

  1. Use fauna to mirror GISAID (indexing on isolate_id and accession), i.e. fauna contains no curation at all. We then have ingest pipelines which start by downloading from fauna, curate the data, and then either use it directly or upload to S3.

@j23414
Copy link
Contributor

j23414 commented Dec 19, 2024

About time! 🥳

@joverlee521
Copy link
Contributor Author

  1. Use fauna to mirror GISAID (indexing on isolate_id and accession), i.e. fauna contains no curation at all. We then have ingest pipelines which start by downloading from fauna, curate the data, and then either use it directly or upload to S3.

+1 for having a copy of the raw data from GISAID. I chatted with @tsibley about this today and we talked about the option to cut fauna/rethinkdb out of the process completely:

Weekly updates include uploading the downloaded GISAID metadata Excel + sequence FASTA to the private S3 bucket in some standard file structure. The ingest workflow would process only the latest files on S3 and update a curated cache. However, the ingest workflow would have the option to re-ingest all of the previously downloaded data if we were to make changes to the curation scripts.

Note that this still does not solve the out-of-date issue with the cache since we don't have an easy way to filter by updated date in GISAID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants