-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple data curation and upload #162
Comments
(2) also separately brought up by @jameshadfield and +1 by @trvrb in Slack, so that's the direction we should take here! |
Having worked in fauna for the first time in a few years, this decoupling would be much welcome. For the work I was doing in avian flu (no titers!) I'd propose (3):
|
About time! 🥳 |
+1 for having a copy of the raw data from GISAID. I chatted with @tsibley about this today and we talked about the option to cut fauna/rethinkdb out of the process completely: Weekly updates include uploading the downloaded GISAID metadata Excel + sequence FASTA to the private S3 bucket in some standard file structure. The ingest workflow would process only the latest files on S3 and update a curated cache. However, the ingest workflow would have the option to re-ingest all of the previously downloaded data if we were to make changes to the curation scripts. Note that this still does not solve the out-of-date issue with the cache since we don't have an easy way to filter by updated date in GISAID. |
Additional context in Slack
A part of the data curation occurs during vdb/upload and tdb/upload making it difficult to debug data curation issues and hard to share data curation steps with external groups.
Potential solutions
The text was updated successfully, but these errors were encountered: