-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
e0d2e84
commit d89aa91
Showing
9 changed files
with
13,032 additions
and
0 deletions.
There are no files selected for viewing
2,659 changes: 2,659 additions & 0 deletions
2,659
openrefine/v0.10/IHEC_metadata_harmonization.v0.10.csv
Large diffs are not rendered by default.
Oops, something went wrong.
2,659 changes: 2,659 additions & 0 deletions
2,659
openrefine/v0.10/IHEC_metadata_harmonization.v0.10.extended.csv
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Version 0.10 | ||
|
||
This table described the columns included in the metadata table at [IHEC_metadata_harmonization.v0.10.csv](IHEC_metadata_harmonization.v0.10.csv). | ||
|
||
For explanations concerning the [extended version](IHEC_metadata_harmonization.v0.10.extended.csv), please see [version 0.9](https://github.com/IHEC/epimap-metadata-harmonization/releases/tag/v0.9). | ||
|
||
Please always keep in mind that we try to stay as close to the [IHEC Metadata Standard](https://github.com/IHEC/ihec-ecosystems/blob/master/docs/metadata/2.0/Ihec_metadata_specification.md) as possible. | ||
|
||
| Column | Examples | Explanation | | ||
|-------------------------------------------------|--------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| EpiRR | `IHECRE00000001.4` | EpiRR identifier. The number behind the dot (.) is the version. | | ||
| EpiRR_status | `Complete` `Partial` | Whether this epigenome is `Complete` or `Partial`. | | ||
| project | `CEEHRC` `BLUEPRINT` | The project from which the epigenome originated. | | ||
| biomaterial_type | `cell line` `primary cell` `primary cell culture` `primary tissue` | One of `primary cell`,`primary cell culture`, `cell line`, `primary tissue`. | | ||
| cell_type | `myeloid cell` `effector memory CD8-positive, alpha-beta T cell` | The cell type and main sample ontology classification for entries where `biomaterial_type` is `primary cell` or `primary cell culture`. | | ||
| line | `MCF 10A` | The cell line and main sample ontology classification for entries where `biomaterial_type` is `cell line`. | | ||
| tissue_type | `skeletal muscle tissue` `amygdala` | The cell line and main sample ontology classification for entries where `biomaterial_type` is `primary tissue`. | | ||
| sample_ontology_curie | `CL:0000990` `UBERON:0001876` `EFO:0001200` | The CURIE identifying the sample ontology term. <br/>Different ontologies are used, depending on the `biomaterial_type`:<br/> 'CL' for `primary cell` or `primary cell culture`, 'EFO' for `cell line` and 'UBERON' for `primary tissue`. | | ||
| sample_ontology_term_high_order_manual | `other` `T cell` | A manually refined higher level annotation describing the samples using ancestors in the ontology. | | ||
| markers | `CD3+ CD4+ CD45RA+` `CD3- CD19- CD56-` | Markers used to isolate and identify the cell type, when applicable. | | ||
| disease | `Breast Carcinoma` `Acute Promyelocytic Leukemia with PML-RARA` | This attribute reflects **the disease for this particular sample**, not the donor health condition. | | ||
| disease_ontology_curie | `NCIM:C0678222` `NCIM:C0023487` | The CURIE identifying the NCIM disease ontology term. | | ||
| disease_high_order_manual | `Healthy/None` `Cancer` `Disease` | A manually refined higher level annotation describing the diseases using only three categories: _Healthy/None_, _Cancer_, _Disease_. | | ||
| disease_intermediate_order_manual | `Carcinoma` `Leukemia` | A manually refined higher level annotation describing the disease for this particular sample using ancestors in the NCIT ontology. <br/>NCIM CURIEs were mapped to NCIT CURIES, see version 0.9 for explanation. | | ||
| donor_id | `CEMT0007` `C07015` | Identifier for donors within their projects. | | ||
| donor_age | `60-65` `unknown` `46` | Age of donor. Can be an interval. | | ||
| donor_age_unit | `year` `day` | Age unit of donor. | | ||
| donor_life_stage | `embryonic` `adult` | Life stage of donor. | | ||
| sex | `female` `male` | Sex of donor. | | ||
| donor_health_status | `Breast Carcinoma` `Acute Promyelocytic Leukemia with PML-RARA` | Links to the health status of the donor that provided the sample. **Does not describe the disease for this particular sample.** | | ||
| donor_health_status_ontology_curie | `NCIM:C0023487` `NCIM:C0678222` | The CURIE identifying the NCIM donor health status ontology term. | | ||
| health_state | `dead` `alive` | Health state of donor: `dead` or `alive`. | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
[ | ||
{ | ||
"op": "core/mass-edit", | ||
"engineConfig": { | ||
"facets": [ | ||
{ | ||
"type": "list", | ||
"name": "project", | ||
"expression": "value", | ||
"columnName": "project", | ||
"invert": false, | ||
"omitBlank": false, | ||
"omitError": false, | ||
"selection": [ | ||
{ | ||
"v": { | ||
"v": "NIH Roadmap Epigenomics", | ||
"l": "NIH Roadmap Epigenomics" | ||
} | ||
}, | ||
{ | ||
"v": { | ||
"v": "ENCODE", | ||
"l": "ENCODE" | ||
} | ||
} | ||
], | ||
"selectBlank": false, | ||
"selectError": false | ||
} | ||
], | ||
"mode": "row-based" | ||
}, | ||
"columnName": "biomaterial_type", | ||
"expression": "value", | ||
"edits": [ | ||
{ | ||
"from": [ | ||
"primary cell culture" | ||
], | ||
"fromBlank": false, | ||
"fromError": false, | ||
"to": "primary cell" | ||
} | ||
], | ||
"description": "Mass edit cells in column biomaterial_type" | ||
} | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
import os.path | ||
from subprocess import run | ||
|
||
import numpy as np | ||
import pandas as pd | ||
|
||
# make sure the working directory when running this file is the project root of the git project | ||
os.chdir('../../') | ||
|
||
# create openrefine project and apply rules - OPENREFINE SERVER HAS TO BE RUNNING | ||
# creating openrefine projects via the openrefine-client needs a csv as input in order to work properly | ||
openrefine_client = './openrefine/openrefine-client_0-3-10_linux' # path to the openrefine executable | ||
initial_csv = './openrefine/v0.9/IHEC_metadata_harmonization.v0.9.extended.csv' # csv to build project from | ||
|
||
v10_intermediate_tbl = pd.read_csv(initial_csv) | ||
|
||
disease_higher_tbl = pd.read_csv( | ||
'./openrefine/v0.10/internal--IHEC_metadata_harmonization.v0.9.extended - Comp_health_disease.csv', | ||
usecols=['Sample_disease_high_level', 'Sample_disease_intermediate_level', 'disease', 'donor_health_status']) | ||
disease_higher_tbl.replace('-blank-', np.nan, inplace=True) | ||
|
||
disease_higher_tbl.rename(columns={'Sample_disease_high_level': 'disease_high_order_manual', | ||
'Sample_disease_intermediate_level': 'disease_intermediate_order_manual'}, | ||
inplace=True) | ||
|
||
v10_merged = pd.merge(v10_intermediate_tbl, disease_higher_tbl.drop_duplicates(), 'outer', | ||
on=['disease', 'donor_health_status'], validate='many_to_one') | ||
|
||
assert (len(v10_intermediate_tbl) == len(v10_merged)) | ||
|
||
v10_merged.rename(columns={'age': 'donor_age'}, inplace=True) | ||
|
||
v10_intermediate_csv = './openrefine/v0.10/IHEC_metadata_harmonization.v0.10.intermediate.csv' | ||
v10_merged.to_csv(v10_intermediate_csv, index=False) | ||
|
||
# create project with intermediate version | ||
run([openrefine_client, '--create', v10_intermediate_csv], check=True) | ||
intermediate_project_name = os.path.splitext(os.path.basename(v10_intermediate_csv))[0] | ||
|
||
# here we manually solve some mapping issues and conflicts and the resulting json is then used in this script | ||
|
||
run([openrefine_client, '--apply', 'openrefine/v0.10/biomaterial_type_ENCODE_Roadmap.json', intermediate_project_name], | ||
check=True) | ||
run([openrefine_client, '--apply', 'openrefine/v0.10/minor_disease_issues.json', intermediate_project_name], | ||
check=True) | ||
run([openrefine_client, '--apply', 'openrefine/v0.10/sample_ontology_high-level_manual.json', | ||
intermediate_project_name], | ||
check=True) | ||
|
||
v10_extended_csv = './openrefine/v0.10/IHEC_metadata_harmonization.v0.10.extended.csv' | ||
run([openrefine_client, '--export', f'--output={v10_extended_csv}', intermediate_project_name], check=True) | ||
|
||
v10_extended = pd.read_csv(v10_extended_csv) | ||
final_csv = './openrefine/v0.10/IHEC_metadata_harmonization.v0.10.csv' | ||
v10_extended[['EpiRR', 'EpiRR_status', 'project', 'biomaterial_type', 'cell_type', 'line', 'tissue_type', | ||
'sample_ontology_curie', 'sample_ontology_term_high_order_manual', 'markers', 'disease', | ||
'disease_ontology_curie', 'disease_high_order_manual', 'disease_intermediate_order_manual', 'donor_id', | ||
'donor_age', 'donor_age_unit', 'donor_life_stage', 'sex', 'donor_health_status', | ||
'donor_health_status_ontology_curie', 'health_state']].to_csv(final_csv, index=False) | ||
|
||
old = pd.read_csv(initial_csv) | ||
old.rename(columns={'age': 'donor_age'}, inplace=True) | ||
old.index = old.EpiRR | ||
old.sort_index(0, inplace=True) | ||
old.sort_index(1, inplace=True) | ||
new = pd.read_csv(v10_extended_csv) | ||
new.index = new.EpiRR | ||
new.drop(columns=['disease_high_order_manual', 'disease_intermediate_order_manual'], inplace=True) | ||
new.sort_index(0, inplace=True) | ||
new.sort_index(1, inplace=True) | ||
|
||
diff_tbl = old.compare(new) | ||
diff_tbl.rename(columns={'self': 'v0.9', 'other': 'v0.10'}, inplace=True) | ||
diff_tbl.apply(lambda x: [x.dropna()], axis=1).to_json('openrefine/v0.10/diff_v0.9_v0.10.json', indent=True) |
Oops, something went wrong.