Jira | Dataset | Source | Scheduled / One-time |
---|---|---|---|
COV-24 | John Hopkins Data | here | Scheduled |
COV-12 | IDPH County-level data | here, here and here | Scheduled |
COV-79 | IDPH Zipcode data | here | Scheduled |
COV-273 | IDPH Facility data | here (JSON) | Scheduled |
COV-1014 | IDPH Regional ICU Capacity | here (JSON) | Scheduled |
COV-1019 | IDPH Hospital Utilization | here (JSON) | Scheduled |
COV-720 | IDPH Vaccine | here | scheduled |
COV-925 | IDPH Vaccine to S3 | IDPH Vaccine data | scheduled |
COV-18 | nCOV2019 | here | One-time |
COV-97 | DS4C | Kaggle | One-time |
COV-126 | DSCI | Kaggle | One-time |
COV-172 | DSFSI | here | One-time |
COV-192 | OWID2 | here | Scheduled |
COV-237 | Chicago Neighborhoods Data | here (JSON) | Scheduled |
COV-220 | COXRAY | Kaggle | One-time |
COV-422 | SSR | Controlled data | One-time (for now) |
COV-450 | VAC-TRACKER | here | scheduled |
COV-453 | CHESTX-RAY8 | here | One-time |
COV-521 | ATLAS | here | One-time |
COV-465 | NCBI-FILE | bucket | scheduled |
COV-482 | NCBI-MANIFEST | bucket | scheduled |
COV-465 | NCBI | bucket | scheduled |
COV-532 | COM-MOBILITY | here | scheduled |
To deploy the daily/weekly ETLs, use the following setup in adminVM in crontab
:
crontab -e
And add the following:
USER=<username with submission access>
S3_BUCKET=<name of bucket to upload data to>
# Sample format for a new job is as follows. Please create a new job for an etl in the same format and make sure it's execution does not overlap with any other jobs ( Just to avoid causing overload )
0 6 * * * (if [ -f $HOME/cloud-automation/files/scripts/covid19-etl-job.sh ]; then JOB_NAME=jhu bash $HOME/cloud-automation/files/scripts/covid19-etl-job.sh; else echo "no covid19-etl-job.sh"; fi) > $HOME/covid19-etl-$JOB_NAME-cronjob.log 2>&1
Note: The time in adminVM is in UTC.
This is local-only ETL.
It requires data available locally.
Before running the ETL, the data, which is available here and requires Kaggle account.
The content of archive should go into the folder ./data
(this can be changed via COXRAY_DATA_PATH
in coxray.py
and coxray_file.py
) resulting in the following structure:
covid19-tools
...
├── data
│ ├── annotations
│ │ └── ...
│ ├── images
│ │ └── ...
│ └── metadata.csv
...
The ETL is consist of two parts: COXRAY_FILE
- for file upload and COXRAY
for metadata submission.
COXRAY_FILE
should run first. It will upload the files.
COXRAY
should run after COXRAY_FILE
and it will create clinical data and it will link it to files in indexd.
This is local-only ETL.
It requires data available locally.
Before running the ETL, the data, which is available here.
The repository should be cloned into the folder ./data
(this can be changed via CHESTXRAY8_DATA_PATH
in chestxray8.py
) resulting in the following structure:
covid19-tools
...
├── data
│ ├── COVID-19
│ │ ├── X-Ray Image DataSet
│ │ │ ├── No_findings
│ │ │ ├── Pneumonia
│ │ │ └── Pneumonia
...
There are 3 NCBI ETL processes:
NCBI_MANIFEST
: Index virus sequence object data in indexd.NCBI_FILE
: Split the clinical metadata into multiple files by accession numbers, and index the files in indexd.NCBI
: Submit NCBI clinical data to the graph by creating metadata records for the files indexed by NCBI_FILE.
While either NCBI_MANIFEST or NCBI_FILE can run first, NCBI needs to run last because it needs the indexd information from the other two. It is common for object data to become available before the associated metadata, so the NCBI_MANIFEST
job might index files that we don't have metadata for yet, in that case the files are not linked to the graph.
The input data for NCBI_MANIFEST is available in public bucket sra-pub-sars-cov2
.
The input data for NCBI and NCBI_FILE are available in public bucket sra-pub-sars-cov2-metadata-us-east-1
with the following structure:
covid19-tools
...
├── sra-pub-sars-cov2-metadata-us-east-1"
│ |── contigs
│ │ │ ├── contigs.json
│ │ pipetide
├── ├── ├ │── pipetide.json
│ │
...
Deployment: NCBI ETL needs a google cloud setup to access the biqquery public table. For Gen3, the credential needs to put in
Gen3Secrets/g3auto/covid19-etl/default.json
Notes:
- An accession number is supposed in the format of
[SDE]RR\d+
. SRR for data submitted to NCBI, ERR for EMBL-EBI (European Molecular Biology Laboratory), and DRR for DDBJ (DNA Data Bank of Japan) - NCBI_MANIFEST ETL uses
last_submission_identifier
field of the project node to keep track the last submission datetime. That prevents the etl from checking and re-indexing the files which were already indexed. - Virus sequence run taxonomy without a matching submitter id in virus sequence link to CMC only, otherwise link to both CMC and virus sequence