Skip to content

Latest commit

 

History

History
75 lines (68 loc) · 3.5 KB

WORKFLOW.md

File metadata and controls

75 lines (68 loc) · 3.5 KB

TIGA Workflow

Steps for updating the TIGA dataset from sources.

Dependencies

  • R 3.6+; readr, data.table, igraph, muStat, RMySQL (Webapp: shiny, DT, shinyBS, shinysky, plotly)
  • Python 3.7+; pandas, BioClients
  • Java 8+; Jena, IU_IDSL_JENA

Steps

  1. Download latest files from the NHGRI-EBI GWAS Catalog. See FTP site for latest and all releases. Required files:
    • gwas-catalog-studies_ontology-annotated.tsv
    • gwas-catalog-associations_ontology-annotated.tsv
  2. RUN commands in Go_TIGA_Workflow.sh, as described here.
  3. Download from Experimental Factor Ontology (EFO): * efo.owl
  4. Clean studies: * gwascat_gwas.R
  5. Clean, separate OR_or_beta into oddsratio, beta columns: * gwascat_assn.R
  6. Convert EFO OWL to TSV: * java -jar iu_idsl_jena-0.0.1-SNAPSHOT-jar-with-dependencies.jar
  7. From EFO TSV create GraphML: * efo_graph.R
  8. Clean traits: * gwascat_trait.R
  9. MAPPED GENES: Separate mapped into up-/down-stream. * snp2gene_mapped.pl
  10. Get iCite RCRs for studies via PMIDs: * python3 -m BioClients.icite.Client get_stats
  11. Get Ensembl annotations for mapped genes via EnsemblIds: * python3 -m BioClients.ensembl.Client get_info
  12. Get IDG TCRD gene annotations: * python3 -m BioClients.idg.tcrd.Client listTargets
  13. Run commands in Go_gwascat_DbCreate.sh building MySql db. Writes file gwas_counts.tsv.
  14. Pre-process and filter. Studies, genes and traits may be removed due to insufficient evidence, with reasons recorded. * tiga_gt_prepfilter.R
  15. Provenance for gene-trait pairs (STUDY_ACCESSION, PUBMEDID). * tiga_gt_provenance.R
  16. Generate variables, statistics, evidence features for gene-trait pairs. * tiga_gt_variables.R
  17. Score and rank gene-trait pairs based on selected variables. * tiga_gt_stats.R
  18. TIGA web app requires files:
    1. gwascat_gwas.tsv
    2. filtered_genes.tsv
    3. filtered_studies.tsv
    4. filtered_traits.tsv
    5. gt_provenance.tsv.gz
    6. gt_stats.tsv.gz
    7. efo_graph.graphml.gz
    8. gwascat_release.txt
    9. efo_release.txt
    10. tcrd_info.tsv
  19. TIGA download files should be copied to the TIGA Download Directory for automated access.

Notes

  • Split comma separated fields, convert to UTF-8 characters.
  • Gene-trait association variables:
    • N_study: studies supporting gene-trait association
    • N_snp: SNPs involved with gene-trait association
    • N_snpw(*): SNPs involved with gene-trait association weighted by genomic distance
    • RCRAS(*): RCR Aggregated Score
    • pValue(*): max SNP pValues
    • OR: median(OR), where OR = odds ratio
    • N_beta: count of supporting beta values
    • geneNtrait: total traits associated with gene
    • traitNgene: total genes associated with trait
  • Gene-trait scores and ranks:
    • meanRank: meanRank based on variables selected(*) by benchmark validation.
    • meanRankScore: 100 - Percentile(meanRank)
  • MySql database intended for transition toward IDG TCRD integration (currently not required for TIGA app).