version 0.10 pre-release

IHEC · Sep 20, 2022 · d89aa91 · d89aa91
1 parent e0d2e84
commit d89aa91
Show file tree

Hide file tree

Showing 9 changed files with 13,032 additions and 0 deletions.
diff --git a/openrefine/v0.10/IHEC_metadata_harmonization.v0.10.csv b/openrefine/v0.10/IHEC_metadata_harmonization.v0.10.csv
diff --git a/openrefine/v0.10/IHEC_metadata_harmonization.v0.10.extended.csv b/openrefine/v0.10/IHEC_metadata_harmonization.v0.10.extended.csv
diff --git a/openrefine/v0.10/README.md b/openrefine/v0.10/README.md
@@ -0,0 +1,32 @@
+# Version 0.10
+
+This table described the columns included in the metadata table at [IHEC_metadata_harmonization.v0.10.csv](IHEC_metadata_harmonization.v0.10.csv). 
+
+For explanations concerning the [extended version](IHEC_metadata_harmonization.v0.10.extended.csv), please see [version 0.9](https://github.com/IHEC/epimap-metadata-harmonization/releases/tag/v0.9).
+
+Please always keep in mind that we try to stay as close to the [IHEC Metadata Standard](https://github.com/IHEC/ihec-ecosystems/blob/master/docs/metadata/2.0/Ihec_metadata_specification.md) as possible.
+
+| Column                                 	        | Examples                                                           | Explanation 	                                                                                                                                                                                                                             |
+|-------------------------------------------------|--------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| EpiRR                                  	        | `IHECRE00000001.4`                                                 | EpiRR identifier. The number behind the dot (.) is the version.                                                                                                                                                                           |
+| EpiRR_status                           	        | `Complete` `Partial`                                               | Whether this epigenome is `Complete` or `Partial`.                                                                                                                                                                                        |
+| project                                	        | `CEEHRC` `BLUEPRINT`                                               | The project from which the epigenome originated.                                                                                                                                                                                          |
+| biomaterial_type                       	        | `cell line` `primary cell` `primary cell culture` `primary tissue` | One of `primary cell`,`primary cell culture`, `cell line`, `primary tissue`.                                                                                                                                                              |
+| cell_type                              	        | `myeloid cell` `effector memory CD8-positive, alpha-beta T cell`   | The cell type and main sample ontology classification for entries where `biomaterial_type` is `primary cell` or `primary cell culture`.                                                                                                   |
+| line                            	               | `MCF 10A`                                                          | The cell line and main sample ontology classification for entries where `biomaterial_type` is `cell line`.                                                                                                                                |
+| tissue_type                                   	 | `skeletal muscle tissue` `amygdala`                                | The cell line and main sample ontology classification for entries where `biomaterial_type` is `primary tissue`.	                                                                                                                          |
+| sample_ontology_curie                  	        | `CL:0000990` `UBERON:0001876` `EFO:0001200`                        | The CURIE identifying the sample ontology term. <br/>Different ontologies are used, depending on the `biomaterial_type`:<br/> 'CL' for `primary cell` or `primary cell culture`, 'EFO' for `cell line` and 'UBERON' for `primary tissue`. |
+| sample_ontology_term_high_order_manual 	        | `other` `T cell`                                                   | A manually refined higher level annotation describing the samples using ancestors in the ontology.	                                                                                                                                       |
+| markers                                	        | `CD3+ CD4+ CD45RA+` `CD3- CD19- CD56-`                             | Markers used to isolate and identify the cell type, when applicable.	                                                                                                                                                                     |
+| disease                                	        | `Breast Carcinoma` `Acute Promyelocytic Leukemia with PML-RARA`    | This attribute reflects **the disease for this particular sample**, not the donor health condition.	                                                                                                                                      |
+| disease_ontology_curie                 	        | `NCIM:C0678222` `NCIM:C0023487`                                    | The CURIE identifying the NCIM disease ontology term.	                                                                                                                                                                                    |
+| disease_high_order_manual              	        | `Healthy/None` `Cancer` `Disease`                                  | A manually refined higher level annotation describing the diseases using only three categories: _Healthy/None_, _Cancer_, _Disease_.	    	                                                                                                |
+| disease_intermediate_order_manual      	        | `Carcinoma` `Leukemia`                                             | A manually refined higher level annotation describing the disease for this particular sample using ancestors in the NCIT ontology. <br/>NCIM CURIEs were mapped to NCIT CURIES, see version 0.9 for explanation.	    	                    |
+| donor_id                               	        | `CEMT0007` `C07015`                                                | Identifier for donors within their projects. 	                                                                                                                                                                                            |
+| donor_age                              	        | `60-65` `unknown` `46`                                             | Age of donor. Can be an interval.	                                                                                                                                                                                                        |
+| donor_age_unit                         	        | `year` `day`                                                       | Age unit of donor.                                                                                                                                                                                                                        |
+| donor_life_stage                       	        | `embryonic` `adult`                                                | Life stage of donor.	                                                                                                                                                                                                                     |
+| sex                                    	        | `female` `male`                                                    | Sex of donor.	                                                                                                                                                                                                                            |
+| donor_health_status                    	        | `Breast Carcinoma` `Acute Promyelocytic Leukemia with PML-RARA`    | Links to the health status of the donor that provided the sample. **Does not describe the disease for this particular sample.**	                                                                                                          |
+| donor_health_status_ontology_curie     	        | `NCIM:C0023487` `NCIM:C0678222`                                    | The CURIE identifying the NCIM donor health status ontology term.	                                                                                                                                                                        |
+| health_state                           	        | `dead` `alive`                                                     | Health state of donor: `dead` or `alive`.	                                                                                                                                                                                                |
diff --git a/openrefine/v0.10/biomaterial_type_ENCODE_Roadmap.json b/openrefine/v0.10/biomaterial_type_ENCODE_Roadmap.json
@@ -0,0 +1,48 @@
+[
+  {
+    "op": "core/mass-edit",
+    "engineConfig": {
+      "facets": [
+        {
+          "type": "list",
+          "name": "project",
+          "expression": "value",
+          "columnName": "project",
+          "invert": false,
+          "omitBlank": false,
+          "omitError": false,
+          "selection": [
+            {
+              "v": {
+                "v": "NIH Roadmap Epigenomics",
+                "l": "NIH Roadmap Epigenomics"
+              }
+            },
+            {
+              "v": {
+                "v": "ENCODE",
+                "l": "ENCODE"
+              }
+            }
+          ],
+          "selectBlank": false,
+          "selectError": false
+        }
+      ],
+      "mode": "row-based"
+    },
+    "columnName": "biomaterial_type",
+    "expression": "value",
+    "edits": [
+      {
+        "from": [
+          "primary cell culture"
+        ],
+        "fromBlank": false,
+        "fromError": false,
+        "to": "primary cell"
+      }
+    ],
+    "description": "Mass edit cells in column biomaterial_type"
+  }
+]
diff --git a/openrefine/v0.10/create_v0.10.py b/openrefine/v0.10/create_v0.10.py
@@ -0,0 +1,74 @@
+import os.path
+from subprocess import run
+
+import numpy as np
+import pandas as pd
+
+# make sure the working directory when running this file is the project root of the git project
+os.chdir('../../')
+
+# create openrefine project and apply rules - OPENREFINE SERVER HAS TO BE RUNNING
+# creating openrefine projects via the openrefine-client needs a csv as input in order to work properly
+openrefine_client = './openrefine/openrefine-client_0-3-10_linux'  # path to the openrefine executable
+initial_csv = './openrefine/v0.9/IHEC_metadata_harmonization.v0.9.extended.csv'  # csv to build project from
+
+v10_intermediate_tbl = pd.read_csv(initial_csv)
+
+disease_higher_tbl = pd.read_csv(
+    './openrefine/v0.10/internal--IHEC_metadata_harmonization.v0.9.extended - Comp_health_disease.csv',
+    usecols=['Sample_disease_high_level', 'Sample_disease_intermediate_level', 'disease', 'donor_health_status'])
+disease_higher_tbl.replace('-blank-', np.nan, inplace=True)
+
+disease_higher_tbl.rename(columns={'Sample_disease_high_level': 'disease_high_order_manual',
+                                   'Sample_disease_intermediate_level': 'disease_intermediate_order_manual'},
+                          inplace=True)
+
+v10_merged = pd.merge(v10_intermediate_tbl, disease_higher_tbl.drop_duplicates(), 'outer',
+                      on=['disease', 'donor_health_status'], validate='many_to_one')
+
+assert (len(v10_intermediate_tbl) == len(v10_merged))
+
+v10_merged.rename(columns={'age': 'donor_age'}, inplace=True)
+
+v10_intermediate_csv = './openrefine/v0.10/IHEC_metadata_harmonization.v0.10.intermediate.csv'
+v10_merged.to_csv(v10_intermediate_csv, index=False)
+
+# create project with intermediate version
+run([openrefine_client, '--create', v10_intermediate_csv], check=True)
+intermediate_project_name = os.path.splitext(os.path.basename(v10_intermediate_csv))[0]
+
+# here we manually solve some mapping issues and conflicts and the resulting json is then used in this script
+
+run([openrefine_client, '--apply', 'openrefine/v0.10/biomaterial_type_ENCODE_Roadmap.json', intermediate_project_name],
+    check=True)
+run([openrefine_client, '--apply', 'openrefine/v0.10/minor_disease_issues.json', intermediate_project_name],
+    check=True)
+run([openrefine_client, '--apply', 'openrefine/v0.10/sample_ontology_high-level_manual.json',
+     intermediate_project_name],
+    check=True)
+
+v10_extended_csv = './openrefine/v0.10/IHEC_metadata_harmonization.v0.10.extended.csv'
+run([openrefine_client, '--export', f'--output={v10_extended_csv}', intermediate_project_name], check=True)
+
+v10_extended = pd.read_csv(v10_extended_csv)
+final_csv = './openrefine/v0.10/IHEC_metadata_harmonization.v0.10.csv'
+v10_extended[['EpiRR', 'EpiRR_status', 'project', 'biomaterial_type', 'cell_type', 'line', 'tissue_type',
+              'sample_ontology_curie', 'sample_ontology_term_high_order_manual', 'markers', 'disease',
+              'disease_ontology_curie', 'disease_high_order_manual', 'disease_intermediate_order_manual', 'donor_id',
+              'donor_age', 'donor_age_unit', 'donor_life_stage', 'sex', 'donor_health_status',
+              'donor_health_status_ontology_curie', 'health_state']].to_csv(final_csv, index=False)
+
+old = pd.read_csv(initial_csv)
+old.rename(columns={'age': 'donor_age'}, inplace=True)
+old.index = old.EpiRR
+old.sort_index(0, inplace=True)
+old.sort_index(1, inplace=True)
+new = pd.read_csv(v10_extended_csv)
+new.index = new.EpiRR
+new.drop(columns=['disease_high_order_manual', 'disease_intermediate_order_manual'], inplace=True)
+new.sort_index(0, inplace=True)
+new.sort_index(1, inplace=True)
+
+diff_tbl = old.compare(new)
+diff_tbl.rename(columns={'self': 'v0.9', 'other': 'v0.10'}, inplace=True)
+diff_tbl.apply(lambda x: [x.dropna()], axis=1).to_json('openrefine/v0.10/diff_v0.9_v0.10.json', indent=True)