Decouple data curation and upload #162

joverlee521 · 2024-09-11T22:01:35Z

Additional context in Slack

A part of the data curation occurs during vdb/upload and tdb/upload making it difficult to debug data curation issues and hard to share data curation steps with external groups.

Potential solutions

Detangle data curation and data upload within fauna.
Start brand new ingest workflows for curation where the results are then optionally uploaded to fauna.

joverlee521 · 2024-12-19T21:07:25Z

(2) also separately brought up by @jameshadfield and +1 by @trvrb in Slack, so that's the direction we should take here!

jameshadfield · 2024-12-19T21:11:18Z

Having worked in fauna for the first time in a few years, this decoupling would be much welcome. For the work I was doing in avian flu (no titers!) I'd propose (3):

Use fauna to mirror GISAID (indexing on isolate_id and accession), i.e. fauna contains no curation at all. We then have ingest pipelines which start by downloading from fauna, curate the data, and then either use it directly or upload to S3.

j23414 · 2024-12-19T22:59:33Z

About time! 🥳

joverlee521 · 2025-01-07T22:04:31Z

Use fauna to mirror GISAID (indexing on isolate_id and accession), i.e. fauna contains no curation at all. We then have ingest pipelines which start by downloading from fauna, curate the data, and then either use it directly or upload to S3.

+1 for having a copy of the raw data from GISAID. I chatted with @tsibley about this today and we talked about the option to cut fauna/rethinkdb out of the process completely:

Weekly updates include uploading the downloaded GISAID metadata Excel + sequence FASTA to the private S3 bucket in some standard file structure. The ingest workflow would process only the latest files on S3 and update a curated cache. However, the ingest workflow would have the option to re-ingest all of the previously downloaded data if we were to make changes to the curation scripts.

Note that this still does not solve the out-of-date issue with the cache since we don't have an easy way to filter by updated date in GISAID.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple data curation and upload #162

Decouple data curation and upload #162

joverlee521 commented Sep 11, 2024

joverlee521 commented Dec 19, 2024

jameshadfield commented Dec 19, 2024

j23414 commented Dec 19, 2024

joverlee521 commented Jan 7, 2025

Decouple data curation and upload #162

Decouple data curation and upload #162

Comments

joverlee521 commented Sep 11, 2024

Potential solutions

joverlee521 commented Dec 19, 2024

jameshadfield commented Dec 19, 2024

j23414 commented Dec 19, 2024

joverlee521 commented Jan 7, 2025