Skip to content

Commit

Permalink
version 0.10 pre-release
Browse files Browse the repository at this point in the history
  • Loading branch information
quirinmanz committed Sep 20, 2022
1 parent e0d2e84 commit d89aa91
Show file tree
Hide file tree
Showing 9 changed files with 13,032 additions and 0 deletions.
2,659 changes: 2,659 additions & 0 deletions openrefine/v0.10/IHEC_metadata_harmonization.v0.10.csv

Large diffs are not rendered by default.

2,659 changes: 2,659 additions & 0 deletions openrefine/v0.10/IHEC_metadata_harmonization.v0.10.extended.csv

Large diffs are not rendered by default.

32 changes: 32 additions & 0 deletions openrefine/v0.10/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Version 0.10

This table described the columns included in the metadata table at [IHEC_metadata_harmonization.v0.10.csv](IHEC_metadata_harmonization.v0.10.csv).

For explanations concerning the [extended version](IHEC_metadata_harmonization.v0.10.extended.csv), please see [version 0.9](https://github.com/IHEC/epimap-metadata-harmonization/releases/tag/v0.9).

Please always keep in mind that we try to stay as close to the [IHEC Metadata Standard](https://github.com/IHEC/ihec-ecosystems/blob/master/docs/metadata/2.0/Ihec_metadata_specification.md) as possible.

| Column | Examples | Explanation |
|-------------------------------------------------|--------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| EpiRR | `IHECRE00000001.4` | EpiRR identifier. The number behind the dot (.) is the version. |
| EpiRR_status | `Complete` `Partial` | Whether this epigenome is `Complete` or `Partial`. |
| project | `CEEHRC` `BLUEPRINT` | The project from which the epigenome originated. |
| biomaterial_type | `cell line` `primary cell` `primary cell culture` `primary tissue` | One of `primary cell`,`primary cell culture`, `cell line`, `primary tissue`. |
| cell_type | `myeloid cell` `effector memory CD8-positive, alpha-beta T cell` | The cell type and main sample ontology classification for entries where `biomaterial_type` is `primary cell` or `primary cell culture`. |
| line | `MCF 10A` | The cell line and main sample ontology classification for entries where `biomaterial_type` is `cell line`. |
| tissue_type | `skeletal muscle tissue` `amygdala` | The cell line and main sample ontology classification for entries where `biomaterial_type` is `primary tissue`. |
| sample_ontology_curie | `CL:0000990` `UBERON:0001876` `EFO:0001200` | The CURIE identifying the sample ontology term. <br/>Different ontologies are used, depending on the `biomaterial_type`:<br/> 'CL' for `primary cell` or `primary cell culture`, 'EFO' for `cell line` and 'UBERON' for `primary tissue`. |
| sample_ontology_term_high_order_manual | `other` `T cell` | A manually refined higher level annotation describing the samples using ancestors in the ontology. |
| markers | `CD3+ CD4+ CD45RA+` `CD3- CD19- CD56-` | Markers used to isolate and identify the cell type, when applicable. |
| disease | `Breast Carcinoma` `Acute Promyelocytic Leukemia with PML-RARA` | This attribute reflects **the disease for this particular sample**, not the donor health condition. |
| disease_ontology_curie | `NCIM:C0678222` `NCIM:C0023487` | The CURIE identifying the NCIM disease ontology term. |
| disease_high_order_manual | `Healthy/None` `Cancer` `Disease` | A manually refined higher level annotation describing the diseases using only three categories: _Healthy/None_, _Cancer_, _Disease_. |
| disease_intermediate_order_manual | `Carcinoma` `Leukemia` | A manually refined higher level annotation describing the disease for this particular sample using ancestors in the NCIT ontology. <br/>NCIM CURIEs were mapped to NCIT CURIES, see version 0.9 for explanation. |
| donor_id | `CEMT0007` `C07015` | Identifier for donors within their projects. |
| donor_age | `60-65` `unknown` `46` | Age of donor. Can be an interval. |
| donor_age_unit | `year` `day` | Age unit of donor. |
| donor_life_stage | `embryonic` `adult` | Life stage of donor. |
| sex | `female` `male` | Sex of donor. |
| donor_health_status | `Breast Carcinoma` `Acute Promyelocytic Leukemia with PML-RARA` | Links to the health status of the donor that provided the sample. **Does not describe the disease for this particular sample.** |
| donor_health_status_ontology_curie | `NCIM:C0023487` `NCIM:C0678222` | The CURIE identifying the NCIM donor health status ontology term. |
| health_state | `dead` `alive` | Health state of donor: `dead` or `alive`. |
48 changes: 48 additions & 0 deletions openrefine/v0.10/biomaterial_type_ENCODE_Roadmap.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
[
{
"op": "core/mass-edit",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "project",
"expression": "value",
"columnName": "project",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "NIH Roadmap Epigenomics",
"l": "NIH Roadmap Epigenomics"
}
},
{
"v": {
"v": "ENCODE",
"l": "ENCODE"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "biomaterial_type",
"expression": "value",
"edits": [
{
"from": [
"primary cell culture"
],
"fromBlank": false,
"fromError": false,
"to": "primary cell"
}
],
"description": "Mass edit cells in column biomaterial_type"
}
]
74 changes: 74 additions & 0 deletions openrefine/v0.10/create_v0.10.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import os.path
from subprocess import run

import numpy as np
import pandas as pd

# make sure the working directory when running this file is the project root of the git project
os.chdir('../../')

# create openrefine project and apply rules - OPENREFINE SERVER HAS TO BE RUNNING
# creating openrefine projects via the openrefine-client needs a csv as input in order to work properly
openrefine_client = './openrefine/openrefine-client_0-3-10_linux' # path to the openrefine executable
initial_csv = './openrefine/v0.9/IHEC_metadata_harmonization.v0.9.extended.csv' # csv to build project from

v10_intermediate_tbl = pd.read_csv(initial_csv)

disease_higher_tbl = pd.read_csv(
'./openrefine/v0.10/internal--IHEC_metadata_harmonization.v0.9.extended - Comp_health_disease.csv',
usecols=['Sample_disease_high_level', 'Sample_disease_intermediate_level', 'disease', 'donor_health_status'])
disease_higher_tbl.replace('-blank-', np.nan, inplace=True)

disease_higher_tbl.rename(columns={'Sample_disease_high_level': 'disease_high_order_manual',
'Sample_disease_intermediate_level': 'disease_intermediate_order_manual'},
inplace=True)

v10_merged = pd.merge(v10_intermediate_tbl, disease_higher_tbl.drop_duplicates(), 'outer',
on=['disease', 'donor_health_status'], validate='many_to_one')

assert (len(v10_intermediate_tbl) == len(v10_merged))

v10_merged.rename(columns={'age': 'donor_age'}, inplace=True)

v10_intermediate_csv = './openrefine/v0.10/IHEC_metadata_harmonization.v0.10.intermediate.csv'
v10_merged.to_csv(v10_intermediate_csv, index=False)

# create project with intermediate version
run([openrefine_client, '--create', v10_intermediate_csv], check=True)
intermediate_project_name = os.path.splitext(os.path.basename(v10_intermediate_csv))[0]

# here we manually solve some mapping issues and conflicts and the resulting json is then used in this script

run([openrefine_client, '--apply', 'openrefine/v0.10/biomaterial_type_ENCODE_Roadmap.json', intermediate_project_name],
check=True)
run([openrefine_client, '--apply', 'openrefine/v0.10/minor_disease_issues.json', intermediate_project_name],
check=True)
run([openrefine_client, '--apply', 'openrefine/v0.10/sample_ontology_high-level_manual.json',
intermediate_project_name],
check=True)

v10_extended_csv = './openrefine/v0.10/IHEC_metadata_harmonization.v0.10.extended.csv'
run([openrefine_client, '--export', f'--output={v10_extended_csv}', intermediate_project_name], check=True)

v10_extended = pd.read_csv(v10_extended_csv)
final_csv = './openrefine/v0.10/IHEC_metadata_harmonization.v0.10.csv'
v10_extended[['EpiRR', 'EpiRR_status', 'project', 'biomaterial_type', 'cell_type', 'line', 'tissue_type',
'sample_ontology_curie', 'sample_ontology_term_high_order_manual', 'markers', 'disease',
'disease_ontology_curie', 'disease_high_order_manual', 'disease_intermediate_order_manual', 'donor_id',
'donor_age', 'donor_age_unit', 'donor_life_stage', 'sex', 'donor_health_status',
'donor_health_status_ontology_curie', 'health_state']].to_csv(final_csv, index=False)

old = pd.read_csv(initial_csv)
old.rename(columns={'age': 'donor_age'}, inplace=True)
old.index = old.EpiRR
old.sort_index(0, inplace=True)
old.sort_index(1, inplace=True)
new = pd.read_csv(v10_extended_csv)
new.index = new.EpiRR
new.drop(columns=['disease_high_order_manual', 'disease_intermediate_order_manual'], inplace=True)
new.sort_index(0, inplace=True)
new.sort_index(1, inplace=True)

diff_tbl = old.compare(new)
diff_tbl.rename(columns={'self': 'v0.9', 'other': 'v0.10'}, inplace=True)
diff_tbl.apply(lambda x: [x.dropna()], axis=1).to_json('openrefine/v0.10/diff_v0.9_v0.10.json', indent=True)
Loading

0 comments on commit d89aa91

Please sign in to comment.