Skip to content

Commit

Permalink
Merge pull request #75 from ArneBinder/fix_brat_dataset
Browse files Browse the repository at this point in the history
fix `brat` dataset card and set base revision
  • Loading branch information
ArneBinder authored Nov 26, 2023
2 parents d38dd09 + e73dea7 commit af4e8a2
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 49 deletions.
64 changes: 16 additions & 48 deletions dataset_builders/pie/brat/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,16 @@
# PIE Dataset Card for "conll2003"
# PIE Dataset Card for "brat"

This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the
[BRAT Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/brat).

## Dataset Variants

The dataset provides the following variants:

- `default`: The original dataset. Documents are of type `BratDocument` (with `LabeledMultiSpan` annotations, see below).
- `merge_fragmented_spans`: Documents are of type `BratDocumentWithMergedSpans` (this variant merges spans that are
fragmented into simple `LabeledSpans`, see below).

## Data Schema

The document type for this dataset is `BratDocument` or `BratDocumentWithMergedSpans`, depending on if the
Expand All @@ -19,6 +27,12 @@ and the following annotation layers:
- `span_attributes` (annotation type: `Attribute`, target: `spans`)
- `relation_attributes` (annotation type: `Attribute`, target: `relations`)

The `LabeledMultiSpan` annotation type is defined as follows:

- `slices` (type: `Tuple[Tuple[int, int], ...]`): the slices consisting if start (including) and end (excluding) indices of the spans
- `label` (type: `str`)
- `score` (type: `float`, optional, not included in comparison)

The `Attribute` annotation type is defined as follows:

- `annotation` (type: `Annotation`): the annotation to which the attribute is attached
Expand All @@ -31,50 +45,4 @@ See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/a
## Document Converters

The dataset provides no predefined document converters because the BRAT format is very flexible and can be used
for many different tasks. You can add your own document converter by doing the following:

```python
import dataclasses
from typing import Optional

from pytorch_ie.core import AnnotationList, annotation_field
from pytorch_ie.documents import TextBasedDocument
from pytorch_ie.annotations import LabeledSpan

from pie_datasets import DatasetDict

# define your document class
@dataclasses.dataclass
class MyDocument(TextBasedDocument):
my_field: Optional[str] = None
my_span_annotations: AnnotationList[LabeledSpan] = annotation_field(target="text")

# define your document converter
def my_converter(document: BratDocumentWithMergedSpans) -> MyDocument:
# create your document with the data from the original document.
# The fields "text", "id" and "metadata" are derived from the TextBasedDocument.
my_document = MyDocument(id=document.id, text=document.text, metadata=document.metadata, my_field="my_value")

# create a new span annotation
new_span = LabeledSpan(label="my_label", start=2, end=10)
# add the new span annotation to your document
my_document.my_span_annotations.append(new_span)

# add annotations from the document to your document
for span in document.spans:
# we need to copy the span because an annotation can only be attached to one document
my_document.my_span_annotations.append(span.copy())

return my_document


# load the dataset. We use the "merge_fragmented_spans" dataset variant here
# because it provides documents of type BratDocumentWithMergedSpans.
dataset = DatasetDict.load_dataset("pie/brat", name="merge_fragmented_spans", data_dir="path/to/brat/data")

# attach your document converter to the dataset
dataset.register_document_converter(my_converter)

# convert the dataset
converted_dataset = dataset.to_document_type(MyDocument)
```
for many different tasks.
3 changes: 2 additions & 1 deletion dataset_builders/pie/brat/brat.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@


class Brat(BratBuilder):
pass
BASE_DATASET_PATH = "DFKI-SLT/brat"
BASE_DATASET_REVISION = "052163d34b4429d81003981bc10674cef54aa0b8"

0 comments on commit af4e8a2

Please sign in to comment.