Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diseases at jensen lab #938

Closed
wants to merge 36 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
b83cb5c
feat: add diseases import files
Aug 18, 2023
48e41ce
Update format_disease_jensen_lab.py
spiekos Nov 22, 2023
fa431df
Update and rename genes-experiment.tmcf to experiment.tmcf
spiekos Nov 22, 2023
108db68
Update experiment.tmcf
spiekos Nov 22, 2023
d3ff3e1
Update format_disease_jensen_lab.py
spiekos Nov 30, 2023
065648e
Update and rename codingGenes_manual.tmcf to codingGenes-manual.tmcf
spiekos Nov 30, 2023
799c480
Update and rename nonCodingGenes_manual.tmcf to nonCodingGenes-manual…
spiekos Nov 30, 2023
4ab118b
Update format_disease_jensen_lab.py
spiekos Nov 30, 2023
ccb0c30
Update and rename codingGenes_textmining.tmcf to codingGenes-textMini…
spiekos Nov 30, 2023
b2c4256
Update and rename nonCodingGenes_textmining.tmcf to nonCodingGenes-te…
spiekos Nov 30, 2023
fce7a6c
Update README.md
spiekos Nov 30, 2023
dba39a0
Update README.md
spiekos Nov 30, 2023
35c27b2
Update format_disease_jensen_lab.py
spiekos Nov 30, 2023
bf3332f
Update run.sh
spiekos Nov 30, 2023
5f14f55
Update codingGenes-manual.tmcf
spiekos Nov 30, 2023
650e42d
Update format_disease_jensen_lab.py
spiekos Feb 24, 2024
c84235c
Update run.sh
spiekos Feb 24, 2024
6b05413
Update README.md
spiekos Feb 24, 2024
292ebdc
move to scripts subdirectory
spiekos Feb 24, 2024
995381d
move to scripts subdirectory
spiekos Feb 24, 2024
daea7fe
Merge branch 'master' into diseasesAtJensenLab
spiekos Feb 24, 2024
ee092a8
Add files via upload
spiekos Feb 24, 2024
458e1bb
Update scripts
spiekos Mar 5, 2024
58ddcfb
update tmcf files
spiekos Mar 5, 2024
27ec5f5
Delete scripts/biomedical/diseasesAtJensenLab/tmcfs/codingGenes-manua…
spiekos Mar 5, 2024
2dc32f7
Delete scripts/biomedical/diseasesAtJensenLab/tmcfs/nonCodingGenes-te…
spiekos Mar 5, 2024
fd65d15
Delete scripts/biomedical/diseasesAtJensenLab/tmcfs/nonCodingGenes-ma…
spiekos Mar 5, 2024
652b07e
Delete scripts/biomedical/diseasesAtJensenLab/tmcfs/codingGenes-textM…
spiekos Mar 5, 2024
35126ad
Update README.md
spiekos Mar 5, 2024
1211c7a
Update README.md
spiekos Mar 5, 2024
7b831bd
Update README.md
spiekos Mar 5, 2024
1909e9c
Update README.md
spiekos Mar 5, 2024
b2cf3ce
Update README.md
spiekos Mar 5, 2024
04bbc17
Update README.md Table of Contents
spiekos Mar 5, 2024
a7ca907
Update README.md Table of Contents
spiekos Mar 5, 2024
e9adabb
Merge branch 'master' into diseasesAtJensenLab
spiekos Mar 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions scripts/biomedical/diseasesAtJensenLab/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Importing the Diseases at Jensen Lab

## Table of Contents
1. [About the Dataset](#about-the-dataset)
1. [Download Data](#download-data)
2. [Overview](#overview)
3. Notes and Caveats](#notes-and-caveats)
4. [dcid Generation](#dcid-generation)
5. [License](#license
6. [Dataset Documentation and Relevant Links](#dataset-documentation-and-relevant-links)
2. [About the import](#about-the-import)
1. [Artifacts](#artifacts)
1. [Scripts](#scripts)
2. [tMCF Files](#tmcf-files)
3. [Import Procdeure](#import-procedure)
4. [Tests](#tests)

## About the Dataset

[The DISEASES at Jensen Lab](https://diseases.jensenlab.org/About) is a "weekly updated web resource that integrates evidence on disease-gene associations from automatic text mining, manually curated literature, cancer mutation data, and genome-wide association studies." The data has evidence with confidence scores that facilitate comparison of the different types and sources of evidence.

### Download Data

The Diseases database can be downloaded from their official website found [here](https://diseases.jensenlab.org/Downloads). We downloaded and cleaned the full versions of the following files:

- Text mining channel
- Knowledge channel
- Experiments channel

### Overview

"The files contain all links in the DISEASES database. All files start with the following four columns: gene identifier, gene name, disease identifier, and disease name. The knowledge files further contain the source database, the evidence type, and the confidence score. The experiments files instead contain the source database, the source score, and the confidence score. Finally, the textmining files contain the z-score, the confidence score, and a URL to a viewer of the underlying abstracts."

### Notes and Caveats

The disease for each association is indicated either by either a Disease Ontology ID (DOID) or an ICD10 code. For the associations of a gene with an ICD10 code, there is a heirarchical repetitive nature with how the ICD10 code is represented in the original data. For example in the experiment file there is the following association:
| Gene Identifier | Gene Name | Disease Identifier | Disease Name | Source Database | Source Score | Confidence Score |
| --- | --- | --- | --- | --- | --- | --- |
| ENSP00000004982 | HSPB6 | ICD10:N | ICD10:N | TIGA | MeanRankScore = 92 | 1.926 |
| ENSP00000004982 | HSPB6 | ICD10:N0 | ICD10:N0 | TIGA | MeanRankScore = 92 | 1.926 |
| ENSP00000004982 | HSPB6 | ICD10:N04 | ICD10:N04 | TIGA | MeanRankScore = 92 | 1.926 |
| ENSP00000004982 | HSPB6 | ICD10:N3 | ICD10:N3 | TIGA | MeanRankScore = 92 | 1.926 |
| ENSP00000004982 | HSPB6 | ICD10:N39 | ICD10:N39 | TIGA | MeanRankScore = 92 | 1.926 |
| ENSP00000004982 | HSPB6 | ICD10:N399 | ICD10:N399 | TIGA | MeanRankScore = 92 | 1.926 |
| ENSP00000004982 | HSPB6 | ICD10:root | ICD10:root | TIGA | MeanRankScore = 92 | 1.926 |

As you can see there is a cascading representation of the associated ICD10 codes of 'ICD10:N', 'ICD10:N0', 'ICD10:N04' and a second tree of 'ICD10:N3', 'ICD10:N39', 'ICD10:399'. 'ICD10:N', 'ICD10:N0', 'ICD10:N3', and 'ICD10:root' all do not correspond to any ICD10 codes and thus these lines were removed along with any other line in which an ICD10 code had one or two digits or was root following 'ICD10:'. Then for this particular association, the lowest unique tree leaves were taken in as associations with the Gene 'HSP86'. In this case that is 'ICD10:N04' and 'ICD10:N399'. The remaining line with 'ICD10:N39' was discarded as being a less specific referal than 'ICD10:N399'. Finally, the ICD10 codes were reformatted as necessary so that they follow the proper convention. There is a '.' following the regex string of [A-Z][0-9][0-9]. So, codes like 'ICD10:N399' were converted into the appropriate format of 'ICD10:N39.9' through insertion of the missing '.'.

### dcid Generation

Dcids for DiseaseGeneAssociation nodes were generated as follow either:
'bio/DOID_<DOID>_<geneSymbol>_<dataSource>'
'bio/ICD10_<trailing_ICD10Code>_<geneSymbol>_<dataSource>'
where the <DOID> and <trailing_ICD10Code> represent the id following the ':', <geneSymbol> represents the Gene's gene symbol and the dataSource is either 'experiments', 'knowledge', or 'textmining'. For example: `bio/DOID_0050177_SEMA3F_experiments` and `bio/DOID_0050736_SEMA3F_experiments`.

### License

This dataset is under a Creative Commons CC BY license.

### Dataset Documentation and Relevant Links

"DISEASES is a weekly updated web resource that integrates evidence on disease-gene associations from automatic text mining, manually curated literature, cancer mutation data, and genome-wide association studies." The full description of the dataset can be found [here](https://diseases.jensenlab.org/About). A description of the contents of the files and the links to download the DISEASE data as csv files can be found [here](https://diseases.jensenlab.org/Downloads).

The dataset is further documented in the following two studies:
- [DISEASES 2.0: a weekly updated database of disease–gene associations from text mining and data integration](https://academic.oup.com/database/article/doi/10.1093/database/baac019/6554833?login=false)
- [DISEASES: Text mining and data integration of disease-gene associations](https://www.sciencedirect.com/science/article/pii/S1046202314003831)

## About the import

### Artifacts

#### Scripts

##### Bash Script

[`download.sh`](scripts/download.sh) downloads the experimental, manually curated, and text mining data from DISEASES at Jensen Lab.
[`run.sh`](scripts/run.sh) converts raw data from DISEASES into csv files formatted for import into the Data Commons knowledge graph.
[`tests.sh`](scripts/tests.sh) runs standard tests on CSV + tMCF pairs to check for proper formatting.

##### Python Script

[`format_disease_jensen_lab.py`](scripts/format_disease_jensen_lab.py) parses the raw .tsv files with DISEASES at Jensen Lab into well formatted csv files with generated dcids and links to Gene and ICD10Code nodes.

#### tMCF Files

[`codingGenes-knowledge.tmcf`](tmcfs/codingGenes-knowledge.tmcf) contains the tmcf mapping to the csv of coding genes curated manually.

[`nonCodingGenes-knowledge.tmcf`](tmcfs/nonCodingGenes-knowledge.tmcf) contains the tmcf mapping to the csv of non-coding genes curated manually.

[`codingGenes-textmining.tmcf`](tmcfs/codingGenes-textmining.tmcf) contains the tmcf mapping to the csv of coding genes using textmining.

[`nonCodingGenes-textmining.tmcf`](tmcfs/nonCodingGenes-textmining.tmcf) contains the tmcf mapping to the csv of non-coding genes using textmining.

[`experiment.tmcf`](tmcfs/experiment.tmcf) contains the tmcf mapping to the csv of coding genes curated experimentally.

### Import Procedure

Download the most recent versions of DISEASES for experiment, manually curated, and text mining files:

```bash
sh download.sh
```

Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file:

```bash
sh run.sh
```

### Tests

Run Data Commons's java -jar import tool to ensure that all schema used in the import is present in the graph, all referenced nodes are present in the graph, along with other warnings. Please note that empty tokens for some columns are expected as this reflects the original data. The imports create the linked Gene and ICD10Codes alongside the DiseaeGeneAssociation nodes that reference them. This resolves any concern about missing reference warnings concerning these node types by the test. Finally, there are not ICD10Codes associated with every disease, so this column is sometimes blank. Warnings concerning empty dcid references can therefore be ignored.

To run tests:

```bash
sh tests.sh
```

This will generate an output file for the results of the tests on each csv + tmcf pair
8 changes: 8 additions & 0 deletions scripts/biomedical/diseasesAtJensenLab/scripts/download.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
#!/bin/bash

mkdir input; cd input

# downloads the diseases at Jensen Lab files
curl https://download.jensenlab.org/human_disease_textmining_full.tsv --output human_disease_textmining_full.tsv
curl https://download.jensenlab.org/human_disease_knowledge_full.tsv --output human_disease_knowledge_full.tsv
curl https://download.jensenlab.org/human_disease_experiments_full.tsv --output human_disease_experiments_full.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Author: Suhana Bedi
Date: 08/12/2023
Edited By: Samantha Piekos
Last Edited: 02/29/24
Name: format_disease_jensen_lab
Description: converts a three input .txt files from Diseases at
Jensen Lab into output .csv files with formatted dcids,
NonCoding RNA types, and gene ensemblIDs.
@file_input: input .txt files from Diseases at Jensen Lab
@file_output: formatted .csv files for Diseases at Jensen Lab
"""

# load environment
import sys
import numpy as np
import pandas as pd
import time


# declare universal variables
HGNC_DICT = {'HGNC:9982':'RFX1', 'HGNC:9979':'RFPL2'}


def filter_for_lowest_ICD10_level(df):
# initiate values
indices_to_drop = []
previous_ICD10 = ''
previous_gene = ''
previous_index = -1
# check if ICD10 code is more specific than previous row for the same gene
# if it is add the previous ICD10 code index to the list of indices to drop
for index, row in df.iterrows():
current_ICD10 = row['ICD10']
current_gene = row['Gene']
if current_gene == previous_gene and current_ICD10.startswith(previous_ICD10):
indices_to_drop.append(previous_index)
# update reference values
previous_ICD10 = current_ICD10
previous_gene = current_gene
previous_index = index
# drop rows with less specific ICD10 codes for a given gene
df.drop(indices_to_drop, axis=0, inplace=True)
return df


def fix_ICD10_formatting(df, col):
# if string in specified column is 10 characters or longer
# add '.' in the 10th position
mask = df[col].str.len() >= 10
df.loc[mask, col] = df.loc[mask, col].str[:9] + '.' + df.loc[mask, col].str[9:]
return df


def format_icd10_code_dcids(df):
df_doid = df.dropna(subset=['DOID']) # make a doid specific table
## filter out ICD-10 codes based on true existence - remove codes like C1, C2 and root
df['count'] = np.where(df['ICD10'] == df['ICD10'],df['ICD10'].str.split(':'),np.nan)
df['count'] = np.where(df['count'] == df['count'],df['count'].str[1],np.nan)
# remove non-existing ICD-10 codes
df = df[df['count']!='root']
df.loc[:, 'count'] = df['count'].str.len()
df = df[df['count'] > 2]
df = filter_for_lowest_ICD10_level(df)
# fix ill formatted ICD10 codes
df = fix_ICD10_formatting(df, 'Disease')
df = fix_ICD10_formatting(df, 'Disease_Name')
df = fix_ICD10_formatting(df, 'ICD10')
# join back with the doid rows
df_final = pd.concat([df, df_doid]).sort_index()
return df_final


def format_doid_icd(df):
df['DOID'] = np.where(df['Disease'].str.contains('DOID'),df['Disease'],np.nan)
df['ICD10'] = np.where(df['Disease'].str.contains('ICD10'),df['Disease'],np.nan)
## identify ensemblIDs and set as new column
df['ensemblID'] = np.where(df['Id'].str.contains('ENSP00000'),df['Id'],np.nan)
## format icd10 dcids filtering for the most specific ICD10 code
df = format_icd10_code_dcids(df)
return df


def check_for_illegal_charc(s):
"""Checks for illegal characters in a string and prints an error statement if any are present
Args:
s: target string that needs to be checked

"""
list_illegal = ["'", "*" ">", "<", "@", "]", "[", "|", ":", ";" " "]
if any([x in s for x in list_illegal]):
print('Error! dcid contains illegal characters!', s)


def check_for_dcid(row):
check_for_illegal_charc(str(row['dcid']))
check_for_illegal_charc(str(row['GeneDcid']))
check_for_illegal_charc(str(row['ICD10']))
return row


def format_disease_gene_cols(df, data_type):
df['Gene'] = df['Gene'].map(HGNC_DICT).fillna(df['Gene'])
df['Gene'] = df['Gene'].str.replace('@', '')
df['GeneDcid'] = 'bio/' + df['Gene'].astype(str)
df['GeneDcid'] = df['GeneDcid'].str.replace('-', '_')
df['DOID'] = 'dcid:bio/' + df['DOID'].str.replace(':', '_')
df['DOID'] = df['DOID'].replace('dcid:bio/nan', np.nan)
df['ICD10'] = df['ICD10'].str.replace(':', '/')
df['DiseaseDcid'] = df['DOID'].fillna('dcid:'+df['ICD10'])
df['dcid'] = 'bio/' + df['Disease'] + '_' + df['Gene'] + data_type
df['dcid'] = df['dcid'].str.replace(':', '_')
return df


def format_RNA_type(df_tm):
gene_list = ['orf', 'ZNF']
sno_rna_list = ['sno', 'SNOR', 'SCAR']
linc_rna_list = ['LINC', 'linc', 'MIR']
df_tm['RNA_type'] = 'dcs:NonCodingRNATypeLongNonCodingRNA'
df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('|'.join(gene_list)),'Gene', df_tm['RNA_type'])
df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('rRNA'),'dcs:NonCodingRNATypeRibosomalRNA', df_tm['RNA_type'])
df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('|'.join(linc_rna_list)),'dcs:NonCodingRNATypeLongIntergenicNonCodingRNA', df_tm['RNA_type'])
df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('|'.join(sno_rna_list)),'dcs:NonCodingRNATypeSmallNucleolarRNA', df_tm['RNA_type'])
df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('miR'),'dcs:NonCodingRNATypeMicroRNA', df_tm['RNA_type'])
df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('circ'),'dcs:NonCodingRNATypeCircularRNA', df_tm['RNA_type'])
df_tm['RNA_type'] = np.where(df_tm["Gene"].str.contains('pRNA'),'dcs:NonCodingRNATypePromoterAssociatedRNA', df_tm['RNA_type'])
return df_tm


def format_dcids(df, data_type):
df = format_doid_icd(df)
df = format_disease_gene_cols(df, data_type)
df = df.apply(lambda x: check_for_dcid(x),axis=1)
return df


def format_data_type_specific_info(df, data_type):
if data_type == 'experiments':
df['associationType'] = 'dcs:AssociationTypeExperiment'
df['source-score'] = df['source-score'].str.split('=')
df['source-score'] = np.where(df['source-score'] == df['source-score'],df['source-score'].str[1],np.nan)
df.update('"' +
df[['Disease_Name', 'score-db']].astype(str) + '"')
if data_type=='knowledge':
df['associationType'] = 'dcs:AssociationTypeManualCuration'
df.update('"' +
df[['Disease_Name', 'score-db']].astype(str) + '"')
if data_type =='textmining':
df['associationType'] = 'dcs:AssociationTypeTextMining'
df.update('"' +
df[['Disease_Name', 'url']].astype(str) + '"')
return df


def clean_data(df, data_type):
df_tm = df
searchfor = ['ENSP00', 'LINC', 'linc'] ## filter out non coding RNAs
df = df[df['Id'].str.contains("ENSP00")]
df = df[~df.Gene.str.contains('|'.join(searchfor))]
df_tm = df_tm[~df_tm.isin(df)].dropna() ## df with only non coding RNAs
df_tm = df_tm[~df_tm['Gene'].str.contains("chr")]
df_tm = df_tm[~df_tm['Gene'].str.contains("ENSP00")]
df = format_dcids(df, data_type)
df_tm = format_dcids(df_tm, data_type)
df_tm = format_RNA_type(df_tm) ## filter out genes from df with non coding RNA
df_gene = df_tm.loc[df_tm['RNA_type']=='Gene'] ## filter out genes from df with non coding RNA
df_tm = df_tm[~df_tm['RNA_type'].str.contains("Gene")]
df_gene.drop(['RNA_type'],axis=1,inplace=True)
df = df._append(df_gene).reset_index(drop=True)
df = format_data_type_specific_info(df, data_type)
df_tm = format_data_type_specific_info(df_tm, data_type)
return df, df_tm


def generate_column_names(data_type):
# return column names corresponding to the data type of the file
col_names = []
if data_type == 'experiments':
col_names= ['Id', 'Gene', 'Disease', 'Disease_Name', 'score-db', 'source-score', 'confidence']
if data_type == 'knowledge':
col_names = ['Id', 'Gene', 'Disease', 'Disease_Name', 'score-db', 'evidence', 'confidence']
if data_type == 'textmining':
col_names = ['Id', 'Gene', 'Disease', 'Disease_Name', 'z-score', 'confidence', 'url']
return col_names


def write_df_to_csv(df, data_type, coding=True):
# check if df is empty
if df.shape[0] == 0:
return
# write filepath for output file
if coding:
filepath = 'CSVs/codingGenes-' + data_type + '.csv'
else:
filepath = 'CSVs/nonCodingGenes-' + data_type + '.csv'
# write df to csv file
df.to_csv(filepath, doublequote=False, escapechar='\\')
return


def format_csv(data_type):
start_time = time.time()
filepath = 'input/human_disease_'+data_type+'_full.tsv'
col_names = generate_column_names(data_type)
df = pd.read_csv(filepath, sep = '\t', header=None)
df.columns = col_names
df_coding_genes, df_non_coding_genes = clean_data(df, data_type)
write_df_to_csv(df_coding_genes, data_type)
write_df_to_csv(df_non_coding_genes, data_type, coding=False)
processing_time = "%s seconds" % round((time.time() - start_time), 2)
print('Finished processing ' + data_type + ' data in ' + processing_time + '!')


def main():
format_csv('experiments')
format_csv('knowledge')
format_csv('textmining')


if __name__ == '__main__':
main()

6 changes: 6 additions & 0 deletions scripts/biomedical/diseasesAtJensenLab/scripts/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

mkdir -p CSVs

# runs the script
python3 scripts/format_disease_jensen_lab.py
Loading
Loading