Skip to content

Commit

Permalink
Add CoMAGC dataset (#141)
Browse files Browse the repository at this point in the history
* initial comagc commit

* test hf example

* example to document

* document to example

* requirements.txt

* test pie dataset

* test converted document

* Wrote README.md

* adjusted example to document test

Co-authored-by: ArneBinder <[email protected]>

* adjusted example to document test

Co-authored-by: ArneBinder <[email protected]>

* renamed test

Co-authored-by: ArneBinder <[email protected]>

* Adjusted according to review

* Adjusted example_to_document method

not finished yet

* Adjusted ComagcDocument

* added all data properties to doc and removed metadata

* introduced new Span Types

* adjusted example_to_document method

* fixed related tests

* Adjusted comagc.py

* adjusted document_to_example to match changes

* adjusted related tests

* introduced converter method to Comagc class

* Incorporated changes in document structure into README.md

* Adjusted test to increase code coverage

* Included doc-id in converted doc

* Adjusted README.md

* Adjusted README.md

* Adjustments in README.md and comagc.py

* improved understanding of relation label UNIDENTIFIED

* Adjustments in README.md and comagc.py

* if no inference rule is applicable no relation will be added to the doc

* Update dataset_builders/pie/comagc/comagc.py

Co-authored-by: ArneBinder <[email protected]>

* Update dataset_builders/pie/comagc/comagc.py

Co-authored-by: ArneBinder <[email protected]>

* Corrected expression in comagc.py

* label is None should be  label is not None instead

* Wrote new test in test_comagc.py

* should cover the case when a document has no relation, i.e. no inference rule applies

* Wrote new test for get_relation_label method

* tests inferring a relation label

---------

Co-authored-by: ArneBinder <[email protected]>
  • Loading branch information
kai-car and ArneBinder authored Aug 7, 2024
1 parent f201a41 commit 88f6b77
Show file tree
Hide file tree
Showing 4 changed files with 682 additions and 0 deletions.
56 changes: 56 additions & 0 deletions dataset_builders/pie/comagc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# PIE Dataset Card for "CoMAGC"

This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the
[CoMAGC Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/CoMAGC).

## Data Schema

The document type for this dataset is `ComagcDocument` which defines the following data fields:

- `pmid` (str): unique sentence identifier
- `sentence` (str)
- `cancer_type` (str)
- `cge` (str): change in gene expression
- `ccs` (str): change in cell state
- `pt` (str, optional): proposition type
- `ige` (str, optional): initial gene expression level

and the following annotation layers:

- `gene` (annotation type: `NamedSpan`, target: `sentence`)
- `cancer` (annotation type: `NamedSpan`, target: `sentence`)
- `expression_change_keyword1` (annotation type: `SpanWithNameAndType`, target: `sentence`)
- `expression_change_keyword2` (annotation type: `SpanWithNameAndType`, target: `sentence`)

`NamedSpan` is a custom annotation type that extends typical `Span` with the following data fields:

- `name` (str): entity string between span start and end

`SpanWithNameAndType` is a custom annotation type that extends typical `Span` with the following data fields:

- `name` (str): entity string between span start and end
- `type` (str): entity type classifying the expression

See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/annotations.py) and
[here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/annotations.py) for the annotation
type definitions.

## Document Converters

The dataset provides predefined document converters for the following target document types:

- `pie_modules.documents.TextDocumentWithLabeledSpansAndBinaryRelations`:

- **labeled_spans**: There are always two labeled spans in each sentence.
The first one refers to the gene, while the second one refers to the cancer.
Therefore, the `label` is either `"GENE"` or `"CANCER"`.
- **binary_relations**: There is always one binary relation in each sentence.
This relation is always established between the gene as `head` and the cancer as `tail`.
The specific `label` is the related **gene-class**. It is obtained from inference rules (cf [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/3)),
that are based on the values of the columns CGE, CCS, IGE and PT. In case no gene-class can be inferred,
no binary relation is added to the document. In total to 303 of the 821 examples,
there is no rule is applicable (cf [here](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/7)).

See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py) and
[here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.
321 changes: 321 additions & 0 deletions dataset_builders/pie/comagc/comagc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,321 @@
import logging
from dataclasses import dataclass
from typing import Any, Dict, Optional

import datasets
from pytorch_ie import AnnotationLayer, Document, annotation_field
from pytorch_ie.annotations import BinaryRelation, LabeledSpan, Span
from pytorch_ie.documents import TextDocumentWithLabeledSpansAndBinaryRelations

from pie_datasets import ArrowBasedBuilder

logger = logging.getLogger(__name__)


@dataclass(frozen=True)
class NamedSpan(Span):
name: str

def resolve(self) -> Any:
return self.name, super().resolve()


@dataclass(frozen=True)
class SpanWithNameAndType(Span):
name: str
type: str

def resolve(self) -> Any:
return self.name, self.type, super().resolve()


@dataclass
class ComagcDocument(Document):
pmid: str
sentence: str
cge: str
ccs: str
cancer_type: str
gene: AnnotationLayer[NamedSpan] = annotation_field(target="sentence")
cancer: AnnotationLayer[NamedSpan] = annotation_field(target="sentence")
pt: Optional[str] = None
ige: Optional[str] = None
expression_change_keyword1: AnnotationLayer[SpanWithNameAndType] = annotation_field(
target="sentence"
)
expression_change_keyword2: AnnotationLayer[SpanWithNameAndType] = annotation_field(
target="sentence"
)


def example_to_document(example) -> ComagcDocument:
doc = ComagcDocument(
pmid=example["pmid"],
sentence=example["sentence"],
cancer_type=example["cancer_type"],
cge=example["CGE"],
ccs=example["CCS"],
pt=example["PT"],
ige=example["IGE"],
)

# Gene and cancer entities
# name is (almost) always the text of the gene/cancer (between the start and end position)
gene = NamedSpan(
start=example["gene"]["pos"][0],
end=example["gene"]["pos"][1] + 1,
name=example["gene"]["name"],
)
doc.gene.extend([gene])

cancer = NamedSpan(
start=example["cancer"]["pos"][0],
end=example["cancer"]["pos"][1] + 1,
name=example["cancer"]["name"],
)
doc.cancer.extend([cancer])

# Expression change keywords
# expression_change_keyword_1 might have no values
if example["expression_change_keyword_1"]["pos"] is not None:
expression_change_keyword1 = SpanWithNameAndType(
start=example["expression_change_keyword_1"]["pos"][0],
end=example["expression_change_keyword_1"]["pos"][1] + 1,
name=example["expression_change_keyword_1"]["name"],
type=example["expression_change_keyword_1"]["type"],
)
doc.expression_change_keyword1.extend([expression_change_keyword1])

expression_change_keyword2 = SpanWithNameAndType(
start=example["expression_change_keyword_2"]["pos"][0],
end=example["expression_change_keyword_2"]["pos"][1] + 1,
name=example["expression_change_keyword_2"]["name"],
type=example["expression_change_keyword_2"]["type"],
)
doc.expression_change_keyword2.extend([expression_change_keyword2])

return doc


def document_to_example(doc: ComagcDocument) -> Dict[str, Any]:
gene = {
"name": doc.gene[0].name,
"pos": [doc.gene[0].start, doc.gene[0].end - 1],
}
cancer = {
"name": doc.cancer[0].name,
"pos": [doc.cancer[0].start, doc.cancer[0].end - 1],
}

if not doc.expression_change_keyword1.resolve():
expression_change_keyword_1 = {
"name": "\nNone\n",
"pos": None,
"type": None,
}
else:
expression_change_keyword_1 = {
"name": doc.expression_change_keyword1[0].name,
"pos": [
doc.expression_change_keyword1[0].start,
doc.expression_change_keyword1[0].end - 1,
],
"type": doc.expression_change_keyword1[0].type,
}

expression_change_keyword_2 = {
"name": doc.expression_change_keyword2[0].name,
"pos": [
doc.expression_change_keyword2[0].start,
doc.expression_change_keyword2[0].end - 1,
],
"type": doc.expression_change_keyword2[0].type,
}

return {
"pmid": doc.pmid,
"sentence": doc.sentence,
"cancer_type": doc.cancer_type,
"gene": gene,
"cancer": cancer,
"CGE": doc.cge,
"CCS": doc.ccs,
"PT": doc.pt,
"IGE": doc.ige,
"expression_change_keyword_1": expression_change_keyword_1,
"expression_change_keyword_2": expression_change_keyword_2,
}


def convert_to_text_document_with_labeled_spans_and_binary_relations(
document: ComagcDocument,
) -> TextDocumentWithLabeledSpansAndBinaryRelations:
metadata = {
"cancer_type": document.cancer_type,
"CGE": document.cge,
"CCS": document.ccs,
"PT": document.pt,
"IGE": document.ige,
"expression_change_keyword_1": document_to_example(document)[
"expression_change_keyword_1"
],
"expression_change_keyword_2": document_to_example(document)[
"expression_change_keyword_2"
],
}

text_document = TextDocumentWithLabeledSpansAndBinaryRelations(
id=document.pmid, text=document.sentence, metadata=metadata
)

gene = LabeledSpan(
start=document.gene[0].start,
end=document.gene[0].end,
label="GENE",
)
text_document.labeled_spans.append(gene)

cancer = LabeledSpan(
start=document.cancer[0].start,
end=document.cancer[0].end,
label="CANCER",
)
text_document.labeled_spans.append(cancer)

label = get_relation_label(
cge=document.cge, ccs=document.ccs, ige=document.ige, pt=document.pt
)

if label is not None:
relation = BinaryRelation(
head=gene,
tail=cancer,
label=label,
)
text_document.binary_relations.append(relation)

return text_document


class Comagc(ArrowBasedBuilder):
DOCUMENT_TYPE = ComagcDocument
BASE_DATASET_PATH = "DFKI-SLT/CoMAGC"
BASE_DATASET_REVISION = "8e2950b8a3967c2f45de86f60dd5c8ccb9ad3815"

BUILDER_CONFIGS = [
datasets.BuilderConfig(
version=datasets.Version("1.0.0"),
description="CoMAGC dataset",
)
]

DOCUMENT_CONVERTERS = {
TextDocumentWithLabeledSpansAndBinaryRelations: convert_to_text_document_with_labeled_spans_and_binary_relations
}

def _generate_document(self, example, **kwargs):
return example_to_document(example)

def _generate_example(self, document: ComagcDocument, **kwargs) -> Dict[str, Any]:
return document_to_example(document)


def get_relation_label(cge: str, ccs: str, pt: str, ige: str) -> Optional[str]:
"""Simple rule-based function to determine the relation between the gene and the cancer.
As this dataset contains a multi-faceted annotation scheme
for gene-cancer relations, it does not only label the relation
between gene and cancer, but provides further information.
However, the relation of interest stays the gene-class,
which can be derived from inference rules
(https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323/tables/3), based on the
information given in columns CGE, CCS, IGE, PT.
"""

rules = [
{
"CGE": "increased",
"CCS": "normalTOcancer",
"IGE": "*",
"PT": "causality",
"Gene class": "oncogene",
},
{
"CGE": "decreased",
"CCS": "cancerTOnormal",
"IGE": "unidentifiable",
"PT": "causality",
"Gene class": "oncogene",
},
{
"CGE": "decreased",
"CCS": "cancerTOnormal",
"IGE": "up-regulated",
"PT": "*",
"Gene class": "oncogene",
},
{
"CGE": "decreased",
"CCS": "normalTOcancer",
"IGE": "*",
"PT": "causality",
"Gene class": "tumor suppressor gene",
},
{
"CGE": "increased",
"CCS": "cancerTOnormal",
"IGE": "unidentifiable",
"PT": "causality",
"Gene class": "tumor suppressor gene",
},
{
"CGE": "increased",
"CCS": "cancerTOnormal",
"IGE": "down-regulated",
"PT": "*",
"Gene class": "tumor suppressor gene",
},
{
"CGE": "*",
"CCS": "normalTOcancer",
"IGE": "*",
"PT": "observation",
"Gene class": "biomarker",
},
{
"CGE": "*",
"CCS": "cancerTOnormal",
"IGE": "unidentifiable",
"PT": "observation",
"Gene class": "biomarker",
},
{
"CGE": "decreased",
"CCS": "cancerTOcancer",
"IGE": "up-regulated",
"PT": "observation",
"Gene class": "biomarker",
},
{
"CGE": "increased",
"CCS": "cancerTOcancer",
"IGE": "down-regulated",
"PT": "observation",
"Gene class": "biomarker",
},
]

for rule in rules:
if (
(rule["CGE"] == "*" or cge == rule["CGE"])
and (rule["CCS"] == "*" or ccs == rule["CCS"])
and (rule["IGE"] == "*" or ige == rule["IGE"])
and (rule["PT"] == "*" or pt == rule["PT"])
):
return rule["Gene class"]

# Commented out to avoid spamming the logs
# logger.warning("No rule matched. cge: " + cge + " - ccs: " + ccs + " - ige: " + ige + " - pt: " + pt)
# NOTE: In case no inference rule is applicable, no relation is returned and
# eventually no relation is added to the document.
return None
1 change: 1 addition & 0 deletions dataset_builders/pie/comagc/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
pie-datasets>=0.6.0,<0.11.0
Loading

0 comments on commit 88f6b77

Please sign in to comment.