-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* initial comagc commit * test hf example * example to document * document to example * requirements.txt * test pie dataset * test converted document * Wrote README.md * adjusted example to document test Co-authored-by: ArneBinder <[email protected]> * adjusted example to document test Co-authored-by: ArneBinder <[email protected]> * renamed test Co-authored-by: ArneBinder <[email protected]> * Adjusted according to review * Adjusted example_to_document method not finished yet * Adjusted ComagcDocument * added all data properties to doc and removed metadata * introduced new Span Types * adjusted example_to_document method * fixed related tests * Adjusted comagc.py * adjusted document_to_example to match changes * adjusted related tests * introduced converter method to Comagc class * Incorporated changes in document structure into README.md * Adjusted test to increase code coverage * Included doc-id in converted doc * Adjusted README.md * Adjusted README.md * Adjustments in README.md and comagc.py * improved understanding of relation label UNIDENTIFIED * Adjustments in README.md and comagc.py * if no inference rule is applicable no relation will be added to the doc * Update dataset_builders/pie/comagc/comagc.py Co-authored-by: ArneBinder <[email protected]> * Update dataset_builders/pie/comagc/comagc.py Co-authored-by: ArneBinder <[email protected]> * Corrected expression in comagc.py * label is None should be label is not None instead * Wrote new test in test_comagc.py * should cover the case when a document has no relation, i.e. no inference rule applies * Wrote new test for get_relation_label method * tests inferring a relation label --------- Co-authored-by: ArneBinder <[email protected]>
- Loading branch information
1 parent
f201a41
commit 88f6b77
Showing
4 changed files
with
682 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# PIE Dataset Card for "CoMAGC" | ||
|
||
This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the | ||
[CoMAGC Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/CoMAGC). | ||
|
||
## Data Schema | ||
|
||
The document type for this dataset is `ComagcDocument` which defines the following data fields: | ||
|
||
- `pmid` (str): unique sentence identifier | ||
- `sentence` (str) | ||
- `cancer_type` (str) | ||
- `cge` (str): change in gene expression | ||
- `ccs` (str): change in cell state | ||
- `pt` (str, optional): proposition type | ||
- `ige` (str, optional): initial gene expression level | ||
|
||
and the following annotation layers: | ||
|
||
- `gene` (annotation type: `NamedSpan`, target: `sentence`) | ||
- `cancer` (annotation type: `NamedSpan`, target: `sentence`) | ||
- `expression_change_keyword1` (annotation type: `SpanWithNameAndType`, target: `sentence`) | ||
- `expression_change_keyword2` (annotation type: `SpanWithNameAndType`, target: `sentence`) | ||
|
||
`NamedSpan` is a custom annotation type that extends typical `Span` with the following data fields: | ||
|
||
- `name` (str): entity string between span start and end | ||
|
||
`SpanWithNameAndType` is a custom annotation type that extends typical `Span` with the following data fields: | ||
|
||
- `name` (str): entity string between span start and end | ||
- `type` (str): entity type classifying the expression | ||
|
||
See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/annotations.py) and | ||
[here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/annotations.py) for the annotation | ||
type definitions. | ||
|
||
## Document Converters | ||
|
||
The dataset provides predefined document converters for the following target document types: | ||
|
||
- `pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations`: | ||
|
||
- **labeled_spans**: There are always two labeled spans in each sentence. | ||
The first one refers to the gene, while the second one refers to the cancer. | ||
Therefore, the `label` is either `"GENE"` or `"CANCER"`. | ||
- **binary_relations**: There is always one binary relation in each sentence. | ||
This relation is always established between the gene as `head` and the cancer as `tail`. | ||
The specific `label` is the related **gene-class**. It is obtained from inference rules (cf [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/3)), | ||
that are based on the values of the columns CGE, CCS, IGE and PT. In case no gene-class can be inferred, | ||
no binary relation is added to the document. In total to 303 of the 821 examples, | ||
there is no rule is applicable (cf [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/7)). | ||
|
||
See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py) and | ||
[here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type | ||
definitions. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,321 @@ | ||
import logging | ||
from dataclasses import dataclass | ||
from typing import Any, Dict, Optional | ||
|
||
import datasets | ||
from pytorch_ie import AnnotationLayer, Document, annotation_field | ||
from pytorch_ie.annotations import BinaryRelation, LabeledSpan, Span | ||
from pytorch_ie.documents import TextDocumentWithLabeledSpansAndBinaryRelations | ||
|
||
from pie_datasets import ArrowBasedBuilder | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
@dataclass(frozen=True) | ||
class NamedSpan(Span): | ||
name: str | ||
|
||
def resolve(self) -> Any: | ||
return self.name, super().resolve() | ||
|
||
|
||
@dataclass(frozen=True) | ||
class SpanWithNameAndType(Span): | ||
name: str | ||
type: str | ||
|
||
def resolve(self) -> Any: | ||
return self.name, self.type, super().resolve() | ||
|
||
|
||
@dataclass | ||
class ComagcDocument(Document): | ||
pmid: str | ||
sentence: str | ||
cge: str | ||
ccs: str | ||
cancer_type: str | ||
gene: AnnotationLayer[NamedSpan] = annotation_field(target="sentence") | ||
cancer: AnnotationLayer[NamedSpan] = annotation_field(target="sentence") | ||
pt: Optional[str] = None | ||
ige: Optional[str] = None | ||
expression_change_keyword1: AnnotationLayer[SpanWithNameAndType] = annotation_field( | ||
target="sentence" | ||
) | ||
expression_change_keyword2: AnnotationLayer[SpanWithNameAndType] = annotation_field( | ||
target="sentence" | ||
) | ||
|
||
|
||
def example_to_document(example) -> ComagcDocument: | ||
doc = ComagcDocument( | ||
pmid=example["pmid"], | ||
sentence=example["sentence"], | ||
cancer_type=example["cancer_type"], | ||
cge=example["CGE"], | ||
ccs=example["CCS"], | ||
pt=example["PT"], | ||
ige=example["IGE"], | ||
) | ||
|
||
# Gene and cancer entities | ||
# name is (almost) always the text of the gene/cancer (between the start and end position) | ||
gene = NamedSpan( | ||
start=example["gene"]["pos"][0], | ||
end=example["gene"]["pos"][1] + 1, | ||
name=example["gene"]["name"], | ||
) | ||
doc.gene.extend([gene]) | ||
|
||
cancer = NamedSpan( | ||
start=example["cancer"]["pos"][0], | ||
end=example["cancer"]["pos"][1] + 1, | ||
name=example["cancer"]["name"], | ||
) | ||
doc.cancer.extend([cancer]) | ||
|
||
# Expression change keywords | ||
# expression_change_keyword_1 might have no values | ||
if example["expression_change_keyword_1"]["pos"] is not None: | ||
expression_change_keyword1 = SpanWithNameAndType( | ||
start=example["expression_change_keyword_1"]["pos"][0], | ||
end=example["expression_change_keyword_1"]["pos"][1] + 1, | ||
name=example["expression_change_keyword_1"]["name"], | ||
type=example["expression_change_keyword_1"]["type"], | ||
) | ||
doc.expression_change_keyword1.extend([expression_change_keyword1]) | ||
|
||
expression_change_keyword2 = SpanWithNameAndType( | ||
start=example["expression_change_keyword_2"]["pos"][0], | ||
end=example["expression_change_keyword_2"]["pos"][1] + 1, | ||
name=example["expression_change_keyword_2"]["name"], | ||
type=example["expression_change_keyword_2"]["type"], | ||
) | ||
doc.expression_change_keyword2.extend([expression_change_keyword2]) | ||
|
||
return doc | ||
|
||
|
||
def document_to_example(doc: ComagcDocument) -> Dict[str, Any]: | ||
gene = { | ||
"name": doc.gene[0].name, | ||
"pos": [doc.gene[0].start, doc.gene[0].end - 1], | ||
} | ||
cancer = { | ||
"name": doc.cancer[0].name, | ||
"pos": [doc.cancer[0].start, doc.cancer[0].end - 1], | ||
} | ||
|
||
if not doc.expression_change_keyword1.resolve(): | ||
expression_change_keyword_1 = { | ||
"name": "\nNone\n", | ||
"pos": None, | ||
"type": None, | ||
} | ||
else: | ||
expression_change_keyword_1 = { | ||
"name": doc.expression_change_keyword1[0].name, | ||
"pos": [ | ||
doc.expression_change_keyword1[0].start, | ||
doc.expression_change_keyword1[0].end - 1, | ||
], | ||
"type": doc.expression_change_keyword1[0].type, | ||
} | ||
|
||
expression_change_keyword_2 = { | ||
"name": doc.expression_change_keyword2[0].name, | ||
"pos": [ | ||
doc.expression_change_keyword2[0].start, | ||
doc.expression_change_keyword2[0].end - 1, | ||
], | ||
"type": doc.expression_change_keyword2[0].type, | ||
} | ||
|
||
return { | ||
"pmid": doc.pmid, | ||
"sentence": doc.sentence, | ||
"cancer_type": doc.cancer_type, | ||
"gene": gene, | ||
"cancer": cancer, | ||
"CGE": doc.cge, | ||
"CCS": doc.ccs, | ||
"PT": doc.pt, | ||
"IGE": doc.ige, | ||
"expression_change_keyword_1": expression_change_keyword_1, | ||
"expression_change_keyword_2": expression_change_keyword_2, | ||
} | ||
|
||
|
||
def convert_to_text_document_with_labeled_spans_and_binary_relations( | ||
document: ComagcDocument, | ||
) -> TextDocumentWithLabeledSpansAndBinaryRelations: | ||
metadata = { | ||
"cancer_type": document.cancer_type, | ||
"CGE": document.cge, | ||
"CCS": document.ccs, | ||
"PT": document.pt, | ||
"IGE": document.ige, | ||
"expression_change_keyword_1": document_to_example(document)[ | ||
"expression_change_keyword_1" | ||
], | ||
"expression_change_keyword_2": document_to_example(document)[ | ||
"expression_change_keyword_2" | ||
], | ||
} | ||
|
||
text_document = TextDocumentWithLabeledSpansAndBinaryRelations( | ||
id=document.pmid, text=document.sentence, metadata=metadata | ||
) | ||
|
||
gene = LabeledSpan( | ||
start=document.gene[0].start, | ||
end=document.gene[0].end, | ||
label="GENE", | ||
) | ||
text_document.labeled_spans.append(gene) | ||
|
||
cancer = LabeledSpan( | ||
start=document.cancer[0].start, | ||
end=document.cancer[0].end, | ||
label="CANCER", | ||
) | ||
text_document.labeled_spans.append(cancer) | ||
|
||
label = get_relation_label( | ||
cge=document.cge, ccs=document.ccs, ige=document.ige, pt=document.pt | ||
) | ||
|
||
if label is not None: | ||
relation = BinaryRelation( | ||
head=gene, | ||
tail=cancer, | ||
label=label, | ||
) | ||
text_document.binary_relations.append(relation) | ||
|
||
return text_document | ||
|
||
|
||
class Comagc(ArrowBasedBuilder): | ||
DOCUMENT_TYPE = ComagcDocument | ||
BASE_DATASET_PATH = "DFKI-SLT/CoMAGC" | ||
BASE_DATASET_REVISION = "8e2950b8a3967c2f45de86f60dd5c8ccb9ad3815" | ||
|
||
BUILDER_CONFIGS = [ | ||
datasets.BuilderConfig( | ||
version=datasets.Version("1.0.0"), | ||
description="CoMAGC dataset", | ||
) | ||
] | ||
|
||
DOCUMENT_CONVERTERS = { | ||
TextDocumentWithLabeledSpansAndBinaryRelations: convert_to_text_document_with_labeled_spans_and_binary_relations | ||
} | ||
|
||
def _generate_document(self, example, **kwargs): | ||
return example_to_document(example) | ||
|
||
def _generate_example(self, document: ComagcDocument, **kwargs) -> Dict[str, Any]: | ||
return document_to_example(document) | ||
|
||
|
||
def get_relation_label(cge: str, ccs: str, pt: str, ige: str) -> Optional[str]: | ||
"""Simple rule-based function to determine the relation between the gene and the cancer. | ||
As this dataset contains a multi-faceted annotation scheme | ||
for gene-cancer relations, it does not only label the relation | ||
between gene and cancer, but provides further information. | ||
However, the relation of interest stays the gene-class, | ||
which can be derived from inference rules | ||
(https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/3), based on the | ||
information given in columns CGE, CCS, IGE, PT. | ||
""" | ||
|
||
rules = [ | ||
{ | ||
"CGE": "increased", | ||
"CCS": "normalTOcancer", | ||
"IGE": "*", | ||
"PT": "causality", | ||
"Gene class": "oncogene", | ||
}, | ||
{ | ||
"CGE": "decreased", | ||
"CCS": "cancerTOnormal", | ||
"IGE": "unidentifiable", | ||
"PT": "causality", | ||
"Gene class": "oncogene", | ||
}, | ||
{ | ||
"CGE": "decreased", | ||
"CCS": "cancerTOnormal", | ||
"IGE": "up-regulated", | ||
"PT": "*", | ||
"Gene class": "oncogene", | ||
}, | ||
{ | ||
"CGE": "decreased", | ||
"CCS": "normalTOcancer", | ||
"IGE": "*", | ||
"PT": "causality", | ||
"Gene class": "tumor suppressor gene", | ||
}, | ||
{ | ||
"CGE": "increased", | ||
"CCS": "cancerTOnormal", | ||
"IGE": "unidentifiable", | ||
"PT": "causality", | ||
"Gene class": "tumor suppressor gene", | ||
}, | ||
{ | ||
"CGE": "increased", | ||
"CCS": "cancerTOnormal", | ||
"IGE": "down-regulated", | ||
"PT": "*", | ||
"Gene class": "tumor suppressor gene", | ||
}, | ||
{ | ||
"CGE": "*", | ||
"CCS": "normalTOcancer", | ||
"IGE": "*", | ||
"PT": "observation", | ||
"Gene class": "biomarker", | ||
}, | ||
{ | ||
"CGE": "*", | ||
"CCS": "cancerTOnormal", | ||
"IGE": "unidentifiable", | ||
"PT": "observation", | ||
"Gene class": "biomarker", | ||
}, | ||
{ | ||
"CGE": "decreased", | ||
"CCS": "cancerTOcancer", | ||
"IGE": "up-regulated", | ||
"PT": "observation", | ||
"Gene class": "biomarker", | ||
}, | ||
{ | ||
"CGE": "increased", | ||
"CCS": "cancerTOcancer", | ||
"IGE": "down-regulated", | ||
"PT": "observation", | ||
"Gene class": "biomarker", | ||
}, | ||
] | ||
|
||
for rule in rules: | ||
if ( | ||
(rule["CGE"] == "*" or cge == rule["CGE"]) | ||
and (rule["CCS"] == "*" or ccs == rule["CCS"]) | ||
and (rule["IGE"] == "*" or ige == rule["IGE"]) | ||
and (rule["PT"] == "*" or pt == rule["PT"]) | ||
): | ||
return rule["Gene class"] | ||
|
||
# Commented out to avoid spamming the logs | ||
# logger.warning("No rule matched. cge: " + cge + " - ccs: " + ccs + " - ige: " + ige + " - pt: " + pt) | ||
# NOTE: In case no inference rule is applicable, no relation is returned and | ||
# eventually no relation is added to the document. | ||
return None |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
pie-datasets>=0.6.0,<0.11.0 |
Oops, something went wrong.