Steps for updating the TIGA dataset from sources.
- R 3.6+; readr, data.table, igraph, muStat, RMySQL (Webapp: shiny, DT, shinyBS, shinysky, plotly)
- Python 3.7+; pandas, BioClients
- Java 8+; Jena, IU_IDSL_JENA
- Download latest files from the NHGRI-EBI GWAS Catalog. See FTP site for latest and all releases. Required files:
- gwas-catalog-studies_ontology-annotated.tsv
- gwas-catalog-associations_ontology-annotated.tsv
- RUN commands in Go_TIGA_Workflow.sh, as described here.
- Download from Experimental Factor Ontology (EFO): * efo.owl
- Clean studies: * gwascat_gwas.R
- Clean, separate OR_or_beta into oddsratio, beta columns: * gwascat_assn.R
- Convert EFO OWL to TSV:
*
java -jar iu_idsl_jena-0.0.1-SNAPSHOT-jar-with-dependencies.jar
- From EFO TSV create GraphML: * efo_graph.R
- Clean traits: * gwascat_trait.R
- MAPPED GENES: Separate mapped into up-/down-stream. * snp2gene_mapped.pl
- Get iCite RCRs for studies via PMIDs:
*
python3 -m BioClients.icite.Client get_stats
- Get Ensembl annotations for mapped genes via EnsemblIds:
*
python3 -m BioClients.ensembl.Client get_info
- Get IDG TCRD gene annotations:
*
python3 -m BioClients.idg.tcrd.Client listTargets
- Run commands in Go_gwascat_DbCreate.sh building MySql db. Writes file
gwas_counts.tsv
. - Pre-process and filter. Studies, genes and traits may be removed due to insufficient evidence, with reasons recorded. * tiga_gt_prepfilter.R
- Provenance for gene-trait pairs (STUDY_ACCESSION, PUBMEDID). * tiga_gt_provenance.R
- Generate variables, statistics, evidence features for gene-trait pairs. * tiga_gt_variables.R
- Score and rank gene-trait pairs based on selected variables. * tiga_gt_stats.R
- TIGA web app requires files:
- gwascat_gwas.tsv
- filtered_genes.tsv
- filtered_studies.tsv
- filtered_traits.tsv
- gt_provenance.tsv.gz
- gt_stats.tsv.gz
- efo_graph.graphml.gz
- gwascat_release.txt
- efo_release.txt
- tcrd_info.tsv
- TIGA download files should be copied to the TIGA Download Directory for automated access.
- Split comma separated fields, convert to UTF-8 characters.
- Gene-trait association variables:
N_study
: studies supporting gene-trait associationN_snp
: SNPs involved with gene-trait associationN_snpw
(*): SNPs involved with gene-trait association weighted by genomic distanceRCRAS
(*): RCR Aggregated ScorepValue
(*): max SNP pValuesOR
: median(OR), where OR = odds ratioN_beta
: count of supporting beta valuesgeneNtrait
: total traits associated with genetraitNgene
: total genes associated with trait
- Gene-trait scores and ranks:
meanRank
: meanRank based on variables selected(*) by benchmark validation.meanRankScore
:100 - Percentile(meanRank)
- MySql database intended for transition toward IDG TCRD integration (currently not required for TIGA app).