datacommonsorg · spiekos · Aug 18, 2023 · Nov 22, 2023 · Nov 22, 2023 · Nov 22, 2023
diff --git a/scripts/biomedical/diseasesAtJensenLab/README.md b/scripts/biomedical/diseasesAtJensenLab/README.md
@@ -0,0 +1,120 @@
+# Importing the Diseases at Jensen Lab
+
+## Table of Contents
+1. [About the Dataset](#about-the-dataset)
+   1. [Download Data](#download-data)
+   2. [Overview](#overview)
+   3. Notes and Caveats](#notes-and-caveats)
+   4. [dcid Generation](#dcid-generation)
+   5. [License](#license
+   6. [Dataset Documentation and Relevant Links](#dataset-documentation-and-relevant-links)
+2. [About the import](#about-the-import)
+   1. [Artifacts](#artifacts)
+      1. [Scripts](#scripts)
+      2. [tMCF Files](#tmcf-files)
+   3. [Import Procdeure](#import-procedure)
+   4. [Tests](#tests) 
+
+## About the Dataset
+
+[The DISEASES at Jensen Lab](https://diseases.jensenlab.org/About) is a "weekly updated web resource that integrates evidence on disease-gene associations from automatic text mining, manually curated literature, cancer mutation data, and genome-wide association studies." The data has evidence with confidence scores that facilitate comparison of the different types and sources of evidence.
+
+### Download Data
+
+The Diseases database can be downloaded from their official website found [here](https://diseases.jensenlab.org/Downloads). We downloaded and cleaned the full versions of the following files:
+
+- Text mining channel
+- Knowledge channel
+- Experiments channel
+
+### Overview
+
+"The files contain all links in the DISEASES database. All files start with the following four columns: gene identifier, gene name, disease identifier, and disease name. The knowledge files further contain the source database, the evidence type, and the confidence score. The experiments files instead contain the source database, the source score, and the confidence score. Finally, the textmining files contain the z-score, the confidence score, and a URL to a viewer of the underlying abstracts."
+
+### Notes and Caveats
+
+The disease for each association is indicated either by either a Disease Ontology ID (DOID) or an ICD10 code. For the associations of a gene with an ICD10 code, there is a heirarchical repetitive nature with how the ICD10 code is represented in the original data. For example in the experiment file there is the following association:
+| Gene Identifier | Gene Name | Disease Identifier | Disease Name | Source Database | Source Score | Confidence Score |
+| --- | --- | --- | --- | --- | ---  | ---  |
+| ENSP00000004982 | HSPB6 | ICD10:N | ICD10:N | TIGA | MeanRankScore = 92 | 1.926 |
+| ENSP00000004982 | HSPB6 | ICD10:N0 | ICD10:N0 | TIGA | MeanRankScore = 92 | 1.926 |
+| ENSP00000004982 | HSPB6 | ICD10:N04 | ICD10:N04 | TIGA | MeanRankScore = 92 | 1.926 |
+| ENSP00000004982 | HSPB6 | ICD10:N3 | ICD10:N3 | TIGA | MeanRankScore = 92 | 1.926 |
+| ENSP00000004982 | HSPB6 | ICD10:N39 | ICD10:N39 | TIGA | MeanRankScore = 92 | 1.926 |
+| ENSP00000004982 | HSPB6 | ICD10:N399 | ICD10:N399 | TIGA | MeanRankScore = 92 | 1.926 |
+| ENSP00000004982 | HSPB6 | ICD10:root | ICD10:root | TIGA | MeanRankScore = 92  | 1.926 |
+
+As you can see there is a cascading representation of the associated ICD10 codes of 'ICD10:N', 'ICD10:N0', 'ICD10:N04' and a second tree of 'ICD10:N3', 'ICD10:N39', 'ICD10:399'. 'ICD10:N', 'ICD10:N0', 'ICD10:N3', and 'ICD10:root' all do not correspond to any ICD10 codes and thus these lines were removed along with any other line in which an ICD10 code had one or two digits or was root following 'ICD10:'. Then for this particular association, the lowest unique tree leaves were taken in as associations with the Gene 'HSP86'. In this case that is 'ICD10:N04' and 'ICD10:N399'. The remaining line with 'ICD10:N39' was discarded as being a less specific referal than 'ICD10:N399'. Finally, the ICD10 codes were reformatted as necessary so that they follow the proper convention. There is a '.' following the regex string of [A-Z][0-9][0-9]. So, codes like 'ICD10:N399' were converted into the appropriate format of 'ICD10:N39.9' through insertion of the missing '.'.
+
+### dcid Generation
+
+Dcids for DiseaseGeneAssociation nodes were generated as follow either:
+'bio/DOID_<DOID>_<geneSymbol>_<dataSource>'
+'bio/ICD10_<trailing_ICD10Code>_<geneSymbol>_<dataSource>'
+where the <DOID> and <trailing_ICD10Code> represent the id following the ':', <geneSymbol> represents the Gene's gene symbol and the dataSource is either 'experiments', 'knowledge', or 'textmining'. For example: `bio/DOID_0050177_SEMA3F_experiments` and `bio/DOID_0050736_SEMA3F_experiments`.
+
+### License
+
+This dataset is under a Creative Commons CC BY license.
+
+### Dataset Documentation and Relevant Links
+
+"DISEASES is a weekly updated web resource that integrates evidence on disease-gene associations from automatic text mining, manually curated literature, cancer mutation data, and genome-wide association studies." The full description of the dataset can be found [here](https://diseases.jensenlab.org/About). A description of the contents of the files and the links to download the DISEASE data as csv files can be found [here](https://diseases.jensenlab.org/Downloads).
+
+The dataset is further documented in the following two studies:
+- [DISEASES 2.0: a weekly updated database of disease–gene associations from text mining and data integration](https://academic.oup.com/database/article/doi/10.1093/database/baac019/6554833?login=false)
+- [DISEASES: Text mining and data integration of disease-gene associations](https://www.sciencedirect.com/science/article/pii/S1046202314003831)
+
+## About the import
+
+### Artifacts
+
+#### Scripts
+
+##### Bash Script
+
+[`download.sh`](scripts/download.sh) downloads the experimental, manually curated, and text mining data from DISEASES at Jensen Lab.
+[`run.sh`](scripts/run.sh) converts raw data from DISEASES into csv files formatted for import into the Data Commons knowledge graph.
+[`tests.sh`](scripts/tests.sh) runs standard tests on CSV + tMCF pairs to check for proper formatting.
+
+##### Python Script
+
+[`format_disease_jensen_lab.py`](scripts/format_disease_jensen_lab.py) parses the raw .tsv files with DISEASES at Jensen Lab into well formatted csv files with generated dcids and links to Gene and ICD10Code nodes.
+
+#### tMCF Files
+
+[`codingGenes-knowledge.tmcf`](tmcfs/codingGenes-knowledge.tmcf) contains the tmcf mapping to the csv of coding genes curated manually.
+
+[`nonCodingGenes-knowledge.tmcf`](tmcfs/nonCodingGenes-knowledge.tmcf) contains the tmcf mapping to the csv of non-coding genes curated manually.
+
+[`codingGenes-textmining.tmcf`](tmcfs/codingGenes-textmining.tmcf) contains the tmcf mapping to the csv of coding genes using textmining.
+
+[`nonCodingGenes-textmining.tmcf`](tmcfs/nonCodingGenes-textmining.tmcf) contains the tmcf mapping to the csv of non-coding genes using textmining.
+
+[`experiment.tmcf`](tmcfs/experiment.tmcf) contains the tmcf mapping to the csv of coding genes curated experimentally.
+
+### Import Procedure
+
+Download the most recent versions of DISEASES for experiment, manually curated, and text mining files:
+
+```bash
+sh download.sh
+```
+
+Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file:
+
+```bash
+sh run.sh
+```
+
+### Tests
+
+Run Data Commons's java -jar import tool to ensure that all schema used in the import is present in the graph, all referenced nodes are present in the graph, along with other warnings. Please note that empty tokens for some columns are expected as this reflects the original data. The imports create the linked Gene and ICD10Codes alongside the DiseaeGeneAssociation nodes that reference them. This resolves any concern about missing reference warnings concerning these node types by the test. Finally, there are not ICD10Codes associated with every disease, so this column is sometimes blank. Warnings concerning empty dcid references can therefore be ignored.
+
+To run tests:
+
+```bash
+sh tests.sh
+```
+
+This will generate an output file for the results of the tests on each csv + tmcf pair
diff --git a/scripts/biomedical/diseasesAtJensenLab/scripts/download.sh b/scripts/biomedical/diseasesAtJensenLab/scripts/download.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+
+mkdir input; cd input
+
+# downloads the diseases at Jensen Lab files
+curl https://download.jensenlab.org/human_disease_textmining_full.tsv --output human_disease_textmining_full.tsv
+curl https://download.jensenlab.org/human_disease_knowledge_full.tsv --output human_disease_knowledge_full.tsv
+curl https://download.jensenlab.org/human_disease_experiments_full.tsv --output human_disease_experiments_full.tsv
diff --git a/scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py b/scripts/biomedical/diseasesAtJensenLab/scripts/format_disease_jensen_lab.py
@@ -0,0 +1,236 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Author: Suhana Bedi
+Date: 08/12/2023
+Edited By: Samantha Piekos
+Last Edited: 02/29/24
+Name: format_disease_jensen_lab
+Description: converts a three input .txt files from Diseases at
+Jensen Lab into output .csv files with formatted dcids, 
+NonCoding RNA types, and gene ensemblIDs.
+@file_input: input .txt files from Diseases at Jensen Lab 
+@file_output: formatted .csv files for Diseases at Jensen Lab
+"""
+
+# load environment
+import sys
+import numpy as np
+import pandas as pd
+import time
+
+
+# declare universal variables
+HGNC_DICT = {'HGNC:9982':'RFX1', 'HGNC:9979':'RFPL2'}
+
+
+def filter_for_lowest_ICD10_level(df):
+	# initiate values
+	indices_to_drop = []
+	previous_ICD10 = ''
+	previous_gene = ''
+	previous_index = -1
+	# check if ICD10 code is more specific than previous row for the same gene
+	# if it is add the previous ICD10 code index to the list of indices to drop
+	for index, row in df.iterrows():
+		current_ICD10 = row['ICD10']
+		current_gene = row['Gene']
+		if current_gene == previous_gene and current_ICD10.startswith(previous_ICD10):
+			indices_to_drop.append(previous_index)
+		# update reference values
+		previous_ICD10 = current_ICD10
+		previous_gene = current_gene
+		previous_index = index
+	# drop rows with less specific ICD10 codes for a given gene
+	df.drop(indices_to_drop, axis=0, inplace=True)
+	return df
+
+
+def fix_ICD10_formatting(df, col):
+	# if string in specified column is 10 characters or longer
+	# add '.' in the 10th position
+    mask = df[col].str.len() >= 10
+    df.loc[mask, col] = df.loc[mask, col].str[:9] + '.' + df.loc[mask, col].str[9:]
+    return df
+
+
+def format_icd10_code_dcids(df):
+	df_doid = df.dropna(subset=['DOID'])  # make a doid specific table
+	## filter out ICD-10 codes based on true existence - remove codes like C1, C2 and root
+	df['count'] = np.where(df['ICD10'] == df['ICD10'],df['ICD10'].str.split(':'),np.nan)
+	df['count'] = np.where(df['count'] == df['count'],df['count'].str[1],np.nan)
+	# remove non-existing ICD-10 codes
+	df = df[df['count']!='root']
+	df.loc[:, 'count'] = df['count'].str.len()
+	df = df[df['count'] > 2]
+	df = filter_for_lowest_ICD10_level(df)
+	# fix ill formatted ICD10 codes
+	df = fix_ICD10_formatting(df, 'Disease')
+	df = fix_ICD10_formatting(df, 'Disease_Name')
+	df = fix_ICD10_formatting(df, 'ICD10')
+	# join back with the doid rows
+	df_final = pd.concat([df, df_doid]).sort_index()
+	return df_final
+
+
+def format_doid_icd(df):
+	df['DOID'] = np.where(df['Disease'].str.contains('DOID'),df['Disease'],np.nan)
+	df['ICD10'] = np.where(df['Disease'].str.contains('ICD10'),df['Disease'],np.nan)
+	## identify ensemblIDs and set as new column
+	df['ensemblID'] = np.where(df['Id'].str.contains('ENSP00000'),df['Id'],np.nan)
+	## format icd10 dcids filtering for the most specific ICD10 code
+	df = format_icd10_code_dcids(df)
+	return df
+
+
+def check_for_illegal_charc(s):
+    """Checks for illegal characters in a string and prints an error statement if any are present
+    Args:
+        s: target string that needs to be checked
+
+    """
+    list_illegal = ["'", "*" ">", "<", "@", "]", "[", "|", ":", ";" " "]
+    if any([x in s for x in list_illegal]):
+        print('Error! dcid contains illegal characters!', s)
+
+
+def check_for_dcid(row):
+    check_for_illegal_charc(str(row['dcid']))
+    check_for_illegal_charc(str(row['GeneDcid']))
+    check_for_illegal_charc(str(row['ICD10']))
+    return row
+
+
+def format_disease_gene_cols(df, data_type):
+    df['Gene'] = df['Gene'].map(HGNC_DICT).fillna(df['Gene'])
+    df['Gene'] = df['Gene'].str.replace('@', '')
+    df['GeneDcid'] = 'bio/' + df['Gene'].astype(str)
+    df['GeneDcid'] = df['GeneDcid'].str.replace('-', '_')
+    df['DOID'] = 'dcid:bio/' + df['DOID'].str.replace(':', '_')
+    df['DOID'] = df['DOID'].replace('dcid:bio/nan', np.nan)
+    df['ICD10'] = df['ICD10'].str.replace(':', '/')
+    df['DiseaseDcid'] = df['DOID'].fillna('dcid:'+df['ICD10'])
+    df['dcid'] = 'bio/' + df['Disease'] + '_' + df['Gene'] + data_type
+    df['dcid'] = df['dcid'].str.replace(':', '_')
+    return df
+
+
+def format_RNA_type(df_tm):
+	gene_list = ['orf', 'ZNF']
+	sno_rna_list = ['sno', 'SNOR', 'SCAR']
+	linc_rna_list = ['LINC', 'linc', 'MIR']
+	df_tm['RNA_type'] = 'dcs:NonCodingRNATypeLongNonCodingRNA'
+	df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('|'.join(gene_list)),'Gene', df_tm['RNA_type'])
+	df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('rRNA'),'dcs:NonCodingRNATypeRibosomalRNA', df_tm['RNA_type'])
+	df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('|'.join(linc_rna_list)),'dcs:NonCodingRNATypeLongIntergenicNonCodingRNA', df_tm['RNA_type'])
+	df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('|'.join(sno_rna_list)),'dcs:NonCodingRNATypeSmallNucleolarRNA', df_tm['RNA_type'])
+	df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('miR'),'dcs:NonCodingRNATypeMicroRNA', df_tm['RNA_type'])
+	df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('circ'),'dcs:NonCodingRNATypeCircularRNA', df_tm['RNA_type'])
+	df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('pRNA'),'dcs:NonCodingRNATypePromoterAssociatedRNA', df_tm['RNA_type'])
+	return df_tm
+
+
+def format_dcids(df, data_type):
+	df = format_doid_icd(df)
+	df = format_disease_gene_cols(df, data_type)
+	df = df.apply(lambda x: check_for_dcid(x),axis=1)
+	return df 
+
+
+def format_data_type_specific_info(df, data_type):
+	if data_type == 'experiments':
+		df['associationType'] = 'dcs:AssociationTypeExperiment'
+		df['source-score'] = df['source-score'].str.split('=')
+		df['source-score'] = np.where(df['source-score'] == df['source-score'],df['source-score'].str[1],np.nan)
+		df.update('"' +
+				  df[['Disease_Name', 'score-db']].astype(str) + '"')
+	if data_type=='knowledge':
+		df['associationType'] = 'dcs:AssociationTypeManualCuration'
+		df.update('"' +
+				  df[['Disease_Name', 'score-db']].astype(str) + '"')
+	if data_type =='textmining':
+		df['associationType'] = 'dcs:AssociationTypeTextMining'
+		df.update('"' +
+				  df[['Disease_Name', 'url']].astype(str) + '"')
+	return df
+
+
+def clean_data(df, data_type):
+	df_tm = df
+	searchfor = ['ENSP00', 'LINC', 'linc'] ## filter out non coding RNAs
+	df = df[df['Id'].str.contains("ENSP00")]
+	df = df[~df.Gene.str.contains('|'.join(searchfor))]
+	df_tm = df_tm[~df_tm.isin(df)].dropna() ## df with only non coding RNAs
+	df_tm = df_tm[~df_tm['Gene'].str.contains("chr")]
+	df_tm = df_tm[~df_tm['Gene'].str.contains("ENSP00")]
+	df = format_dcids(df, data_type)
+	df_tm = format_dcids(df_tm, data_type)
+	df_tm = format_RNA_type(df_tm) ## filter out genes from df with non coding RNA
+	df_gene = df_tm.loc[df_tm['RNA_type']=='Gene'] ## filter out genes from df with non coding RNA
+	df_tm = df_tm[~df_tm['RNA_type'].str.contains("Gene")]
+	df_gene.drop(['RNA_type'],axis=1,inplace=True)
+	df = df._append(df_gene).reset_index(drop=True)
+	df = format_data_type_specific_info(df, data_type)
+	df_tm = format_data_type_specific_info(df_tm, data_type)
+	return df, df_tm
+
+
+def generate_column_names(data_type):
+	# return column names corresponding to the data type of the file
+	col_names = []
+	if data_type == 'experiments':
+		col_names= ['Id', 'Gene', 'Disease', 'Disease_Name', 'score-db', 'source-score', 'confidence']
+	if data_type == 'knowledge':
+		col_names =  ['Id', 'Gene', 'Disease', 'Disease_Name', 'score-db', 'evidence', 'confidence']
+	if data_type == 'textmining':
+		col_names = ['Id', 'Gene', 'Disease', 'Disease_Name', 'z-score', 'confidence', 'url']
+	return col_names
+
+
+def write_df_to_csv(df, data_type, coding=True):
+	# check if df is empty
+	if df.shape[0] == 0:
+		return
+	# write filepath for output file
+	if coding:
+		filepath = 'CSVs/codingGenes-' + data_type + '.csv'
+	else:
+		filepath = 'CSVs/nonCodingGenes-' + data_type + '.csv'
+	# write df to csv file
+	df.to_csv(filepath, doublequote=False, escapechar='\\')
+	return
+
+
+def format_csv(data_type):
+	start_time = time.time()
+	filepath = 'input/human_disease_'+data_type+'_full.tsv'
+	col_names = generate_column_names(data_type)
+	df = pd.read_csv(filepath, sep = '\t', header=None)
+	df.columns = col_names
+	df_coding_genes, df_non_coding_genes = clean_data(df, data_type)
+	write_df_to_csv(df_coding_genes, data_type)
+	write_df_to_csv(df_non_coding_genes, data_type, coding=False)
+	processing_time = "%s seconds" % round((time.time() - start_time), 2)
+	print('Finished processing ' + data_type + ' data in ' + processing_time + '!')
+
+
+def main():
+	format_csv('experiments')
+	format_csv('knowledge')
+	format_csv('textmining')
+
+
+if __name__ == '__main__':
+    main() 
+
diff --git a/scripts/biomedical/diseasesAtJensenLab/scripts/run.sh b/scripts/biomedical/diseasesAtJensenLab/scripts/run.sh
@@ -0,0 +1,6 @@
+#!/bin/bash
+
+mkdir -p CSVs
+
+# runs the script 
+python3 scripts/format_disease_jensen_lab.py