diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md
index f47a7f9f..81c958a9 100644
--- a/dataset_builders/pie/aae2/README.md
+++ b/dataset_builders/pie/aae2/README.md
@@ -4,6 +4,29 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for t
Therefore, the `aae2` dataset as described here follows the data structure from the [PIE brat dataset card](https://huggingface.co/datasets/pie/brat).
+### Usage
+
+```python
+from pie_datasets import load_dataset
+from pie_datasets.builders.brat import BratDocumentWithMergedSpans
+from pytorch_ie.documents import TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions
+
+# load default version
+dataset = load_dataset("pie/aae2")
+assert isinstance(dataset["train"][0], BratDocumentWithMergedSpans)
+
+# if required, normalize the document type (see section Document Converters below)
+dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)
+assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)
+
+# get first relation in the first document
+doc = dataset_converted["train"][0]
+print(doc.binary_relations[0])
+# BinaryRelation(head=LabeledSpan(start=716, end=851, label='Premise', score=1.0), tail=LabeledSpan(start=591, end=714, label='Claim', score=1.0), label='supports', score=1.0)
+print(doc.binary_relations[0].resolve())
+# ('supports', (('Premise', 'What we acquired from team work is not only how to achieve the same goal with others but more importantly, how to get along with others'), ('Claim', 'through cooperation, children can learn about interpersonal skills which are significant in the future life of all students')))
+```
+
### Dataset Summary
Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a major claim component as well as claim components, and premise components justify or refute the claims. Attack and support labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642)
@@ -28,17 +51,6 @@ The `aae2` dataset comes in a single version (`default`) with `BratDocumentWithM
See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema).
-### Usage
-
-```python
-from pie_datasets import load_dataset, builders
-
-# load default version
-datasets = load_dataset("pie/aae2")
-doc = datasets["train"][0]
-assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
-```
-
### Data Splits
| Statistics | Train | Test |
@@ -109,7 +121,7 @@ The dataset provides document converters for the following target document types
See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.
-#### Label Statistics after Document Conversion
+#### Relation Label Statistics after Document Conversion
When converting from `BratDocumentWithMergedSpan` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`,
we apply a relation-conversion method (see above) that changes the label counts for the relations, as follows:
@@ -129,6 +141,154 @@ we apply a relation-conversion method (see above) that changes the label counts
| support: `supports` | 5958 | 89.3 % |
| attack: `attacks` | 715 | 10.7 % |
+### Collected Statistics after Document Conversion
+
+We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
+After checking out that code, the statistics and plots can be generated by the command:
+
+```commandline
+python src/evaluate_documents.py dataset=aae2_base metric=METRIC
+```
+
+where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)).
+
+This also requires to have the following dataset config in `configs/dataset/aae2_base.yaml` of this dataset within the repo directory:
+
+```commandline
+_target_: src.utils.execute_pipeline
+input:
+ _target_: pie_datasets.DatasetDict.load_dataset
+ path: pie/aae2
+ revision: 1015ee38bd8a36549b344008f7a49af72956a7fe
+```
+
+For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).
+
+For relation-label statistics, we collect those from the default relation conversion method, i.e., `connect_first`, resulting in three distinct relation labels.
+
+#### Relation argument (outer) token distance per label
+
+The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
+We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=aae2_base metric=relation_argument_token_distances
+```
+
+
+
+##### train (322 documents)
+
+| | len | max | mean | min | std |
+| :---------------- | ---: | --: | ------: | --: | ------: |
+| ALL | 9002 | 514 | 102.582 | 9 | 93.76 |
+| attacks | 810 | 442 | 127.622 | 10 | 109.283 |
+| semantically_same | 552 | 514 | 301.638 | 25 | 73.756 |
+| supports | 7640 | 493 | 85.545 | 9 | 74.023 |
+
+
+ Histogram (split: train, 322 documents)
+
+![rtd-label_aae2_train.png](img%2Frtd-label_aae2_train.png)
+
+
+
+##### test (80 documents)
+
+| | len | max | mean | min | std |
+| :---------------- | ---: | --: | ------: | --: | -----: |
+| ALL | 2372 | 442 | 100.711 | 10 | 92.698 |
+| attacks | 184 | 402 | 115.891 | 12 | 98.751 |
+| semantically_same | 146 | 442 | 299.671 | 34 | 72.921 |
+| supports | 2042 | 437 | 85.118 | 10 | 75.023 |
+
+
+ Histogram (split: test, 80 documents)
+
+![rtd-label_aae2_test.png](img%2Frtd-label_aae2_test.png)
+
+
+
+#### Span lengths (tokens)
+
+The span length is measured from the first token of the first argumentative unit to the last token of the particular unit.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*).
+We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=aae2_base metric=span_lengths_tokens
+```
+
+
+
+| statistics | train | test |
+| :--------- | -----: | -----: |
+| no. doc | 322 | 80 |
+| len | 4823 | 1266 |
+| mean | 17.157 | 16.317 |
+| std | 8.079 | 7.953 |
+| min | 3 | 3 |
+| max | 75 | 50 |
+
+
+ Histogram (split: train, 332 documents)
+
+![slt_aae2_train.png](img%2Fslt_aae2_train.png)
+
+
+
+ Histogram (split: test, 80 documents)
+
+![slt_aae2_test.png](img%2Fslt_aae2_test.png)
+
+
+
+#### Token length (tokens)
+
+The token length is measured from the first token of the document to the last one.
+
+We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*).
+We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=aae2_base metric=count_text_tokens
+```
+
+
+
+| statistics | train | test |
+| :--------- | ------: | -----: |
+| no. doc | 322 | 80 |
+| mean | 377.686 | 378.4 |
+| std | 64.534 | 66.054 |
+| min | 236 | 269 |
+| max | 580 | 532 |
+
+
+ Histogram (split: train, 332 documents)
+
+![tl_aae2_train.png](img%2Ftl_aae2_train.png)
+
+
+
+ Histogram (split: test, 80 documents)
+
+![tl_aae2_test.png](img%2Ftl_aae2_test.png)
+
+
+
## Dataset Creation
### Curation Rationale
diff --git a/dataset_builders/pie/aae2/img/rtd-label_aae2_test.png b/dataset_builders/pie/aae2/img/rtd-label_aae2_test.png
new file mode 100644
index 00000000..d62218f8
Binary files /dev/null and b/dataset_builders/pie/aae2/img/rtd-label_aae2_test.png differ
diff --git a/dataset_builders/pie/aae2/img/rtd-label_aae2_train.png b/dataset_builders/pie/aae2/img/rtd-label_aae2_train.png
new file mode 100644
index 00000000..c214fe8f
Binary files /dev/null and b/dataset_builders/pie/aae2/img/rtd-label_aae2_train.png differ
diff --git a/dataset_builders/pie/aae2/img/slt_aae2_test.png b/dataset_builders/pie/aae2/img/slt_aae2_test.png
new file mode 100644
index 00000000..805c6b98
Binary files /dev/null and b/dataset_builders/pie/aae2/img/slt_aae2_test.png differ
diff --git a/dataset_builders/pie/aae2/img/slt_aae2_train.png b/dataset_builders/pie/aae2/img/slt_aae2_train.png
new file mode 100644
index 00000000..30140435
Binary files /dev/null and b/dataset_builders/pie/aae2/img/slt_aae2_train.png differ
diff --git a/dataset_builders/pie/aae2/img/tl_aae2_test.png b/dataset_builders/pie/aae2/img/tl_aae2_test.png
new file mode 100644
index 00000000..d3f8cf5c
Binary files /dev/null and b/dataset_builders/pie/aae2/img/tl_aae2_test.png differ
diff --git a/dataset_builders/pie/aae2/img/tl_aae2_train.png b/dataset_builders/pie/aae2/img/tl_aae2_train.png
new file mode 100644
index 00000000..ea135de5
Binary files /dev/null and b/dataset_builders/pie/aae2/img/tl_aae2_train.png differ
diff --git a/dataset_builders/pie/abstrct/README.md b/dataset_builders/pie/abstrct/README.md
index 4819920e..0bbf2daa 100644
--- a/dataset_builders/pie/abstrct/README.md
+++ b/dataset_builders/pie/abstrct/README.md
@@ -10,6 +10,29 @@ A novel corpus of healthcare texts (i.e., RCT abstracts on various diseases) fro
are annotated with argumentative components (i.e., `MajorClaim`, `Claim`, and `Premise`) and relations (i.e., `Support`, `Attack`, and `Partial-attack`),
in order to support clinicians' daily tasks in information finding and evidence-based reasoning for decision making.
+### Usage
+
+```python
+from pie_datasets import load_dataset
+from pie_datasets.builders.brat import BratDocumentWithMergedSpans
+from pytorch_ie.documents import TextDocumentWithLabeledSpansAndBinaryRelations
+
+# load default version
+dataset = load_dataset("pie/abstrct")
+assert isinstance(dataset["neoplasm_train"][0], BratDocumentWithMergedSpans)
+
+# if required, normalize the document type (see section Document Converters below)
+dataset_converted = dataset.to_document_type("pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations")
+assert isinstance(dataset_converted["neoplasm_train"][0], TextDocumentWithLabeledSpansAndBinaryRelations)
+
+# get first relation in the first document
+doc = dataset_converted["neoplasm_train"][0]
+print(doc.binary_relations[0])
+# BinaryRelation(head=LabeledSpan(start=1769, end=1945, label='Claim', score=1.0), tail=LabeledSpan(start=1, end=162, label='MajorClaim', score=1.0), label='Support', score=1.0)
+print(doc.binary_relations[0].resolve())
+# ('Support', (('Claim', 'Treatment with mitoxantrone plus prednisone was associated with greater and longer-lasting improvement in several HQL domains and symptoms than treatment with prednisone alone.'), ('MajorClaim', 'A combination of mitoxantrone plus prednisone is preferable to prednisone alone for reduction of pain in men with metastatic, hormone-resistant, prostate cancer.')))
+```
+
### Supported Tasks and Leaderboards
- **Tasks**: Argumentation Mining, Component Identification, Boundary Detection, Relation Identification, Link Prediction
@@ -30,16 +53,17 @@ Without any need to merge fragments, the document type `BratDocumentWithMergedSp
See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema).
-### Usage
+### Document Converters
-```python
-from pie_datasets import load_dataset, builders
+The dataset provides document converters for the following target document types:
-# load default version
-datasets = load_dataset("pie/abstrct")
-doc = datasets["neoplasm_train"][0]
-assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
-```
+- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
+ - `LabeledSpans`, converted from `BratDocumentWithMergedSpans`'s `spans`
+ - labels: `MajorClaim`, `Claim`, `Premise`
+ - `BinraryRelations`, converted from `BratDocumentWithMergedSpans`'s `relations`
+ - labels: `Support`, `Partial-Attack`, `Attack`
+
+See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type definitions.
### Data Splits
@@ -92,22 +116,239 @@ Morio et al. ([2022](https://aclanthology.org/2022.tacl-1.37.pdf); p. 642, Table
(Mayer et al. 2020, p.2110)
-#### Examples
+#### Example
-![Examples](img/abstr-sam.png)
+![abstr-sam.png](img%2Fabstr-sam.png)
-### Document Converters
+### Collected Statistics after Document Conversion
-The dataset provides document converters for the following target document types:
+We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
+After checking out that code, the statistics and plots can be generated by the command:
-- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
- - `labeled_spans`: `LabeledSpan` annotations, converted from `BratDocumentWithMergedSpans`'s `spans`
- - labels: `MajorClaim`, `Claim`, `Premise`
- - `binary_relations`: `BinaryRelation` annotations, converted from `BratDocumentWithMergedSpans`'s `relations`
- - labels: `Support`, `Partial-Attack`, `Attack`
+```commandline
+python src/evaluate_documents.py dataset=abstrct_base metric=METRIC
+```
+
+where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)).
+
+This also requires to have the following dataset config in `configs/dataset/abstrct_base.yaml` of this dataset within the repo directory:
+
+```commandline
+_target_: src.utils.execute_pipeline
+input:
+ _target_: pie_datasets.DatasetDict.load_dataset
+ path: pie/abstrct
+ revision: 277dc703fd78614635e86fe57c636b54931538b2
+```
+
+For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).
+
+#### Relation argument (outer) token distance per label
+
+The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
+We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=abstrct_base metric=relation_argument_token_distances
+```
+
+
+
+##### neoplasm_train (350 documents)
+
+| | len | max | mean | min | std |
+| :------------- | ---: | --: | ------: | --: | -----: |
+| ALL | 2836 | 511 | 132.903 | 17 | 80.869 |
+| Attack | 72 | 346 | 89.639 | 29 | 75.554 |
+| Partial-Attack | 338 | 324 | 59.024 | 17 | 42.773 |
+| Support | 2426 | 511 | 144.481 | 26 | 79.187 |
+
+
+ Histogram (split: neoplasm_train, 350 documents)
+
+![img_2.png](img/rtd-label_abs-neo_train.png)
+
+
+
+##### neoplasm_dev (50 documents)
+
+| | len | max | mean | min | std |
+| :------------- | --: | --: | ------: | --: | -----: |
+| ALL | 438 | 625 | 146.393 | 24 | 98.788 |
+| Attack | 16 | 200 | 90.375 | 26 | 62.628 |
+| Partial-Attack | 50 | 240 | 72.04 | 24 | 47.685 |
+| Support | 372 | 625 | 158.796 | 34 | 99.922 |
+
+
+ Histogram (split: neoplasm_dev, 50 documents)
+
+![img_3.png](img/rtd-label_abs-neo_dev.png)
+
+
+
+##### neoplasm_test (100 documents)
+
+| | len | max | mean | min | std |
+| :------------- | --: | --: | ------: | --: | -----: |
+| ALL | 848 | 459 | 126.731 | 22 | 75.363 |
+| Attack | 32 | 390 | 115.688 | 22 | 97.262 |
+| Partial-Attack | 88 | 205 | 56.955 | 24 | 34.534 |
+| Support | 728 | 459 | 135.651 | 33 | 73.365 |
+
+
+ Histogram (split: neoplasm_test, 100 documents)
+
+![img_4.png](img/rtd-label_abs-neo_test.png)
+
+
+
+##### glaucoma_test (100 documents)
+
+| | len | max | mean | min | std |
+| :------------- | --: | --: | ------: | --: | -----: |
+| ALL | 734 | 488 | 159.166 | 26 | 83.885 |
+| Attack | 14 | 177 | 89 | 47 | 40.171 |
+| Partial-Attack | 52 | 259 | 74 | 26 | 51.239 |
+| Support | 668 | 488 | 167.266 | 38 | 82.222 |
+
+
+ Histogram (split: glaucoma_test, 100 documents)
+
+![img_5.png](img/rtd-label_abs-glu_test.png)
+
+
+
+##### mixed_test (100 documents)
+
+| | len | max | mean | min | std |
+| :------------- | --: | --: | ------: | --: | ------: |
+| ALL | 658 | 459 | 145.067 | 23 | 77.921 |
+| Attack | 6 | 411 | 164 | 34 | 174.736 |
+| Partial-Attack | 42 | 259 | 65.762 | 23 | 62.426 |
+| Support | 610 | 459 | 150.341 | 35 | 74.273 |
+
+
+ Histogram (split: mixed_test, 100 documents)
+
+![img_6.png](img/rtd-label_abs-mix_test.png)
+
+
+
+#### Span lengths (tokens)
+
+The span length is measured from the first token of the first argumentative unit to the last token of the particular unit.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*).
+We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=abstrct_base metric=span_lengths_tokens
+```
+
+
+
+| statistics | neoplasm_train | neoplasm_dev | neoplasm_test | glaucoma_test | mixed_test |
+| :--------- | -------------: | -----------: | ------------: | ------------: | ---------: |
+| no. doc | 350 | 50 | 100 | 100 | 100 |
+| len | 2267 | 326 | 686 | 594 | 600 |
+| mean | 34.303 | 37.135 | 32.566 | 38.997 | 38.507 |
+| std | 22.425 | 29.941 | 20.264 | 22.604 | 24.036 |
+| min | 5 | 5 | 6 | 6 | 7 |
+| max | 250 | 288 | 182 | 169 | 159 |
+
+
+ Histogram (split: neoplasm_train, 350 documents)
+
+![slt_abs-neo_train.png](img%2Fslt_abs-neo_train.png)
+
+
+
+ Histogram (split: neoplasm_dev, 50 documents)
+
+![slt_abs-neo_dev.png](img%2Fslt_abs-neo_dev.png)
+
+
+
+ Histogram (split: neoplasm_test, 100 documents)
+
+![slt_abs-neo_test.png](img%2Fslt_abs-neo_test.png)
+
+
+
+ Histogram (split: glucoma_test, 100 documents)
+
+![slt_abs-glu_test.png](img%2Fslt_abs-glu_test.png)
+
+
+
+ Histogram (split: mixed_test, 100 documents)
+
+![slt_abs-mix_test.png](img%2Fslt_abs-mix_test.png)
+
+
+
+#### Token length (tokens)
+
+The token length is measured from the first token of the document to the last one.
+
+We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*).
+We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=abstrct_base metric=count_text_tokens
+```
+
+
+
+| statistics | neoplasm_train | neoplasm_dev | neoplasm_test | glaucoma_test | mixed_test |
+| :--------- | -------------: | -----------: | ------------: | ------------: | ---------: |
+| no. doc | 350 | 50 | 100 | 100 | 100 |
+| mean | 447.291 | 481.66 | 442.79 | 456.78 | 450.29 |
+| std | 91.266 | 116.239 | 89.692 | 115.535 | 87.002 |
+| min | 301 | 329 | 292 | 212 | 268 |
+| max | 843 | 952 | 776 | 1022 | 776 |
+
+
+ Histogram (split: neoplasm_train, 350 documents)
+
+![tl_abs-neo_train.png](img%2Ftl_abs-neo_train.png)
+
+
+
+ Histogram (split: neoplasm_dev, 50 documents)
+
+![tl_abs-neo_dev.png](img%2Ftl_abs-neo_dev.png)
+
+
+
+ Histogram (split: neoplasm_test, 100 documents)
+
+![tl_abs-neo_test.png](img%2Ftl_abs-neo_test.png)
+
+
+
+ Histogram (split: glucoma_test, 100 documents)
+
+![tl_abs-glu_test.png](img%2Ftl_abs-glu_test.png)
+
+
+
+ Histogram (split: mixed_test, 100 documents)
+
+![tl_abs-mix_test.png](img%2Ftl_abs-mix_test.png)
-See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
-definitions.
+
## Dataset Creation
diff --git a/dataset_builders/pie/abstrct/img/rtd-label_abs-glu_test.png b/dataset_builders/pie/abstrct/img/rtd-label_abs-glu_test.png
new file mode 100644
index 00000000..5fb18436
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/rtd-label_abs-glu_test.png differ
diff --git a/dataset_builders/pie/abstrct/img/rtd-label_abs-mix_test.png b/dataset_builders/pie/abstrct/img/rtd-label_abs-mix_test.png
new file mode 100644
index 00000000..f56f6bc8
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/rtd-label_abs-mix_test.png differ
diff --git a/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_dev.png b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_dev.png
new file mode 100644
index 00000000..f3a59b57
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_dev.png differ
diff --git a/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_test.png b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_test.png
new file mode 100644
index 00000000..1bd8cb12
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_test.png differ
diff --git a/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_train.png b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_train.png
new file mode 100644
index 00000000..9b414dfa
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_train.png differ
diff --git a/dataset_builders/pie/abstrct/img/slt_abs-glu_test.png b/dataset_builders/pie/abstrct/img/slt_abs-glu_test.png
new file mode 100644
index 00000000..2b57feb0
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/slt_abs-glu_test.png differ
diff --git a/dataset_builders/pie/abstrct/img/slt_abs-mix_test.png b/dataset_builders/pie/abstrct/img/slt_abs-mix_test.png
new file mode 100644
index 00000000..7e23f3b9
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/slt_abs-mix_test.png differ
diff --git a/dataset_builders/pie/abstrct/img/slt_abs-neo_dev.png b/dataset_builders/pie/abstrct/img/slt_abs-neo_dev.png
new file mode 100644
index 00000000..0ea90445
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/slt_abs-neo_dev.png differ
diff --git a/dataset_builders/pie/abstrct/img/slt_abs-neo_test.png b/dataset_builders/pie/abstrct/img/slt_abs-neo_test.png
new file mode 100644
index 00000000..829e4f9a
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/slt_abs-neo_test.png differ
diff --git a/dataset_builders/pie/abstrct/img/slt_abs-neo_train.png b/dataset_builders/pie/abstrct/img/slt_abs-neo_train.png
new file mode 100644
index 00000000..c74e887e
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/slt_abs-neo_train.png differ
diff --git a/dataset_builders/pie/abstrct/img/tl_abs-glu_test.png b/dataset_builders/pie/abstrct/img/tl_abs-glu_test.png
new file mode 100644
index 00000000..cbe4756b
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/tl_abs-glu_test.png differ
diff --git a/dataset_builders/pie/abstrct/img/tl_abs-mix_test.png b/dataset_builders/pie/abstrct/img/tl_abs-mix_test.png
new file mode 100644
index 00000000..260561e3
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/tl_abs-mix_test.png differ
diff --git a/dataset_builders/pie/abstrct/img/tl_abs-neo_dev.png b/dataset_builders/pie/abstrct/img/tl_abs-neo_dev.png
new file mode 100644
index 00000000..4c0cec01
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/tl_abs-neo_dev.png differ
diff --git a/dataset_builders/pie/abstrct/img/tl_abs-neo_test.png b/dataset_builders/pie/abstrct/img/tl_abs-neo_test.png
new file mode 100644
index 00000000..50b16ddf
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/tl_abs-neo_test.png differ
diff --git a/dataset_builders/pie/abstrct/img/tl_abs-neo_train.png b/dataset_builders/pie/abstrct/img/tl_abs-neo_train.png
new file mode 100644
index 00000000..b8a71584
Binary files /dev/null and b/dataset_builders/pie/abstrct/img/tl_abs-neo_train.png differ
diff --git a/dataset_builders/pie/argmicro/README.md b/dataset_builders/pie/argmicro/README.md
index fe9da6f9..9de51471 100644
--- a/dataset_builders/pie/argmicro/README.md
+++ b/dataset_builders/pie/argmicro/README.md
@@ -3,6 +3,27 @@
This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the
[ArgMicro Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/argmicro).
+## Usage
+
+```python
+from pie_datasets import load_dataset
+from pytorch_ie.documents import TextDocumentWithLabeledSpansAndBinaryRelations
+
+# load English variant
+dataset = load_dataset("pie/argmicro", name="en")
+
+# if required, normalize the document type (see section Document Converters below)
+dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansAndBinaryRelations)
+assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansAndBinaryRelations)
+
+# get first relation in the first document
+doc = dataset_converted["train"][0]
+print(doc.binary_relations[0])
+# BinaryRelation(head=LabeledSpan(start=0, end=81, label='opp', score=1.0), tail=LabeledSpan(start=326, end=402, label='pro', score=1.0), label='reb', score=1.0)
+print(doc.binary_relations[0].resolve())
+# ('reb', (('opp', "Yes, it's annoying and cumbersome to separate your rubbish properly all the time."), ('pro', 'We Berliners should take the chance and become pioneers in waste separation!')))
+```
+
## Dataset Variants
The dataset contains two `BuilderConfig`'s:
@@ -53,3 +74,122 @@ The dataset provides document converters for the following target document types
See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.
+
+### Collected Statistics after Document Conversion
+
+We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
+After checking out that code, the statistics and plots can be generated by the command:
+
+```commandline
+python src/evaluate_documents.py dataset=argmicro_base metric=METRIC
+```
+
+where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)).
+
+This also requires to have the following dataset config in `configs/dataset/argmicro_base.yaml` of this dataset within the repo directory:
+
+```commandline
+_target_: src.utils.execute_pipeline
+input:
+ _target_: pie_datasets.DatasetDict.load_dataset
+ path: pie/argmicro
+ revision: 28ef031d2a2c97be7e9ed360e1a5b20bd55b57b2
+ name: en
+```
+
+For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).
+
+#### Relation argument (outer) token distance per label
+
+The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
+We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=argmicro_base metric=relation_argument_token_distances
+```
+
+
+
+| | len | max | mean | min | std |
+| :---- | ---: | --: | -----: | --: | -----: |
+| ALL | 1018 | 127 | 44.434 | 14 | 21.501 |
+| exa | 18 | 63 | 33.556 | 16 | 13.056 |
+| joint | 88 | 48 | 30.091 | 17 | 9.075 |
+| reb | 220 | 127 | 49.327 | 16 | 24.653 |
+| sup | 562 | 124 | 46.534 | 14 | 22.079 |
+| und | 130 | 84 | 38.292 | 17 | 12.321 |
+
+
+ Histogram (split: train, 112 documents)
+
+![rtd-label_argmicro.png](img%2Frtd-label_argmicro.png)
+
+
+
+#### Span lengths (tokens)
+
+The span length is measured from the first token of the first argumentative unit to the last token of the particular unit.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*).
+We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=argmicro_base metric=span_lengths_tokens
+```
+
+
+
+| statistics | train |
+| :--------- | -----: |
+| no. doc | 112 |
+| len | 576 |
+| mean | 16.365 |
+| std | 6.545 |
+| min | 4 |
+| max | 41 |
+
+
+ Histogram (split: train, 112 documents)
+
+![slt_argmicro.png](img%2Fslt_argmicro.png)
+
+
+
+#### Token length (tokens)
+
+The token length is measured from the first token of the document to the last one.
+
+We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*).
+We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=argmicro_base metric=count_text_tokens
+```
+
+
+
+| statistics | train |
+| :--------- | -----: |
+| no. doc | 112 |
+| mean | 84.161 |
+| std | 22.596 |
+| min | 36 |
+| max | 153 |
+
+
+ Histogram (split: train, 112 documents)
+
+![tl_argmicro.png](img%2Ftl_argmicro.png)
+
+
diff --git a/dataset_builders/pie/argmicro/img/rtd-label_argmicro.png b/dataset_builders/pie/argmicro/img/rtd-label_argmicro.png
new file mode 100644
index 00000000..0b6be71d
Binary files /dev/null and b/dataset_builders/pie/argmicro/img/rtd-label_argmicro.png differ
diff --git a/dataset_builders/pie/argmicro/img/slt_argmicro.png b/dataset_builders/pie/argmicro/img/slt_argmicro.png
new file mode 100644
index 00000000..5dfaac41
Binary files /dev/null and b/dataset_builders/pie/argmicro/img/slt_argmicro.png differ
diff --git a/dataset_builders/pie/argmicro/img/tl_argmicro.png b/dataset_builders/pie/argmicro/img/tl_argmicro.png
new file mode 100644
index 00000000..3b19768f
Binary files /dev/null and b/dataset_builders/pie/argmicro/img/tl_argmicro.png differ
diff --git a/dataset_builders/pie/cdcp/README.md b/dataset_builders/pie/cdcp/README.md
index cc3e8bd3..1ea8748c 100644
--- a/dataset_builders/pie/cdcp/README.md
+++ b/dataset_builders/pie/cdcp/README.md
@@ -3,6 +3,27 @@
This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the
[CDCP Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/cdcp).
+## Usage
+
+```python
+from pie_datasets import load_dataset
+from pytorch_ie.documents import TextDocumentWithLabeledSpansAndBinaryRelations
+
+# load English variant
+dataset = load_dataset("pie/cdcp")
+
+# if required, normalize the document type (see section Document Converters below)
+dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansAndBinaryRelations)
+assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansAndBinaryRelations)
+
+# get first relation in the first document
+doc = dataset_converted["train"][0]
+print(doc.binary_relations[0])
+# BinaryRelation(head=LabeledSpan(start=0, end=78, label='value', score=1.0), tail=LabeledSpan(start=79, end=242, label='value', score=1.0), label='reason', score=1.0)
+print(doc.binary_relations[0].resolve())
+# ('reason', (('value', 'State and local court rules sometimes make default judgments much more likely.'), ('value', 'For example, when a person who allegedly owes a debt is told to come to court on a work day, they may be forced to choose between a default judgment and their job.')))
+```
+
## Data Schema
The document type for this dataset is `CDCPDocument` which defines the following data fields:
@@ -32,3 +53,147 @@ The dataset provides document converters for the following target document types
See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.
+
+### Collected Statistics after Document Conversion
+
+We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
+After checking out that code, the statistics and plots can be generated by the command:
+
+```commandline
+python src/evaluate_documents.py dataset=cdcp_base metric=METRIC
+```
+
+where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)).
+
+This also requires to have the following dataset config in `configs/dataset/cdcp_base.yaml` of this dataset within the repo directory:
+
+```commandline
+_target_: src.utils.execute_pipeline
+input:
+ _target_: pie_datasets.DatasetDict.load_dataset
+ path: pie/cdcp
+ revision: 001722894bdca6df6a472d0d186a3af103e392c5
+```
+
+For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).
+
+#### Relation argument (outer) token distance per label
+
+The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
+We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=cdcp_base metric=relation_argument_token_distances
+```
+
+
+
+##### train (580 documents)
+
+| | len | max | mean | min | std |
+| :------- | ---: | --: | -----: | --: | -----: |
+| ALL | 2204 | 240 | 48.839 | 8 | 31.462 |
+| evidence | 94 | 196 | 66.723 | 14 | 42.444 |
+| reason | 2110 | 240 | 48.043 | 8 | 30.64 |
+
+
+ Histogram (split: train, 580 documents)
+
+![rtd-label_cdcp_train.png](img%2Frtd-label_cdcp_train.png)
+
+
+
+##### test (150 documents)
+
+| | len | max | mean | min | std |
+| :------- | --: | --: | -----: | --: | -----: |
+| ALL | 648 | 212 | 51.299 | 8 | 31.159 |
+| evidence | 52 | 170 | 73.923 | 20 | 39.855 |
+| reason | 596 | 212 | 49.326 | 8 | 29.47 |
+
+
+ Histogram (split: test, 150 documents)
+
+![rtd-label_cdcp_test.png](img%2Frtd-label_cdcp_test.png)
+
+
+
+#### Span lengths (tokens)
+
+The span length is measured from the first token of the first argumentative unit to the last token of the particular unit.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*).
+We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=cdcp_base metric=span_lengths_tokens
+```
+
+
+
+| statistics | train | test |
+| :--------- | -----: | -----: |
+| no. doc | 580 | 150 |
+| len | 3901 | 1026 |
+| mean | 19.441 | 18.758 |
+| std | 11.71 | 10.388 |
+| min | 2 | 3 |
+| max | 142 | 83 |
+
+
+ Histogram (split: train, 580 documents)
+
+![slt_cdcp_train.png](img%2Fslt_cdcp_train.png)
+
+
+
+ Histogram (split: test, 150 documents)
+
+![slt_cdcp_test.png](img%2Fslt_cdcp_test.png)
+
+
+
+#### Token length (tokens)
+
+The token length is measured from the first token of the document to the last one.
+
+We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*).
+We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=cdcp_base metric=count_text_tokens
+```
+
+
+
+| statistics | train | test |
+| :--------- | ------: | ------: |
+| no. doc | 580 | 150 |
+| mean | 130.781 | 128.673 |
+| std | 101.121 | 98.708 |
+| min | 13 | 15 |
+| max | 562 | 571 |
+
+
+ Histogram (split: train, 580 documents)
+
+![tl_cdcp_train.png](img%2Ftl_cdcp_train.png)
+
+
+
+ Histogram (split: test, 150 documents)
+
+![tl_cdcp_test.png](img%2Ftl_cdcp_test.png)
+
+
diff --git a/dataset_builders/pie/cdcp/img/rtd-label_cdcp_test.png b/dataset_builders/pie/cdcp/img/rtd-label_cdcp_test.png
new file mode 100644
index 00000000..539ee94f
Binary files /dev/null and b/dataset_builders/pie/cdcp/img/rtd-label_cdcp_test.png differ
diff --git a/dataset_builders/pie/cdcp/img/rtd-label_cdcp_train.png b/dataset_builders/pie/cdcp/img/rtd-label_cdcp_train.png
new file mode 100644
index 00000000..f9bc45c8
Binary files /dev/null and b/dataset_builders/pie/cdcp/img/rtd-label_cdcp_train.png differ
diff --git a/dataset_builders/pie/cdcp/img/slt_cdcp_test.png b/dataset_builders/pie/cdcp/img/slt_cdcp_test.png
new file mode 100644
index 00000000..fa82e864
Binary files /dev/null and b/dataset_builders/pie/cdcp/img/slt_cdcp_test.png differ
diff --git a/dataset_builders/pie/cdcp/img/slt_cdcp_train.png b/dataset_builders/pie/cdcp/img/slt_cdcp_train.png
new file mode 100644
index 00000000..79404c63
Binary files /dev/null and b/dataset_builders/pie/cdcp/img/slt_cdcp_train.png differ
diff --git a/dataset_builders/pie/cdcp/img/tl_cdcp_test.png b/dataset_builders/pie/cdcp/img/tl_cdcp_test.png
new file mode 100644
index 00000000..279f511a
Binary files /dev/null and b/dataset_builders/pie/cdcp/img/tl_cdcp_test.png differ
diff --git a/dataset_builders/pie/cdcp/img/tl_cdcp_train.png b/dataset_builders/pie/cdcp/img/tl_cdcp_train.png
new file mode 100644
index 00000000..b85a3ee7
Binary files /dev/null and b/dataset_builders/pie/cdcp/img/tl_cdcp_train.png differ
diff --git a/dataset_builders/pie/sciarg/README.md b/dataset_builders/pie/sciarg/README.md
index 5f8cfee9..bfa5918f 100644
--- a/dataset_builders/pie/sciarg/README.md
+++ b/dataset_builders/pie/sciarg/README.md
@@ -4,6 +4,37 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for t
Therefore, the `sciarg` dataset as described here follows the data structure from the [PIE brat dataset card](https://huggingface.co/datasets/pie/brat).
+### Usage
+
+```python
+from pie_datasets import load_dataset
+from pie_datasets.builders.brat import BratDocumentWithMergedSpans, BratDocument
+from pytorch_ie.documents import TextDocumentWithLabeledMultiSpansBinaryRelationsAndLabeledPartitions, TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions
+
+# load default version
+dataset = load_dataset("pie/sciarg")
+assert isinstance(dataset["train"][0], BratDocumentWithMergedSpans)
+
+# if required, normalize the document type (see section Document Converters below)
+dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)
+assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)
+
+# load version with resolved parts_of_same relations
+dataset = load_dataset("pie/sciarg", name='resolve_parts_of_same')
+assert isinstance(dataset["train"][0], BratDocument)
+
+# if required, normalize the document type (see section Document Converters below)
+dataset_converted = dataset.to_document_type(TextDocumentWithLabeledMultiSpansBinaryRelationsAndLabeledPartitions)
+assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledMultiSpansBinaryRelationsAndLabeledPartitions)
+
+# get first relation in the first document
+doc = dataset_converted["train"][0]
+print(doc.binary_relations[0])
+# BinaryRelation(head=LabeledMultiSpan(slices=((15071, 15076),), label='data', score=1.0), tail=LabeledMultiSpan(slices=((14983, 15062),), label='background_claim', score=1.0), label='supports', score=1.0)
+print(doc.binary_relations[0].resolve())
+# ('supports', (('data', ('[ 3 ]',)), ('background_claim', ('PSD and improved example-based schemes have been discussed in many publications',))))
+```
+
### Dataset Summary
The SciArg dataset is an extension of the Dr. Inventor corpus (Fisas et al., [2015](https://aclanthology.org/W15-1605.pdf), [2016](https://aclanthology.org/L16-1492.pdf)) with an annotation layer containing
@@ -39,21 +70,25 @@ are connected via the `parts_of_same` relations are converted to `LabeledMultiSp
See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema).
-### Usage
+### Document Converters
-```python
-from pie_datasets import load_dataset, builders
+The dataset provides document converters for the following target document types:
-# load default version
-datasets = load_dataset("pie/sciarg")
-doc = datasets["train"][0]
-assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
+- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations`
+ - `LabeledSpans`, converted from `BratDocument`'s `spans`
+ - labels: `background_claim`, `own_claim`, `data`
+ - if `spans` contain whitespace at the beginning and/or the end, the whitespace are trimmed out.
+ - `BinraryRelations`, converted from `BratDocument`'s `relations`
+ - labels: `supports`, `contradicts`, `semantically_same`, `parts_of_same`
+ - if the `relations` label is `semantically_same` or `parts_of_same`, they are merged if they are the same arguments after sorting.
+- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`
+ - `LabeledSpans`, as above
+ - `BinaryRelations`, as above
+ - `LabeledPartitions`, partitioned `BratDocument`'s `text`, according to the paragraph, using regex.
+ - labels: `title`, `abstract`, `H1`
-# load version with resolved parts_of_same relations
-datasets = load_dataset("pie/sciarg", name='resolve_parts_of_same')
-doc = datasets["train"][0]
-assert isinstance(doc, builders.brat.BratDocument)
-```
+See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
+definitions.
### Data Splits
@@ -133,6 +168,13 @@ possibly since [Lauscher et al., 2018](https://aclanthology.org/W18-5206/) prese
(*Annotation Guidelines*, pp. 4-6)
+There are currently discrepancies in label counts between
+
+- previous report in [Lauscher et al., 2018](https://aclanthology.org/W18-5206/), p. 43),
+- current report above here (labels counted in `BratDocument`'s);
+
+possibly since [Lauscher et al., 2018](https://aclanthology.org/W18-5206/) presents the numbers of the real argumentative components, whereas here discontinuous components are still split (marked with the `parts_of_same` helper relation) and, thus, count per fragment.
+
#### Examples
![sample1](img/leaannof3.png)
@@ -143,9 +185,14 @@ Below: Subset of relations in `A01`
![sample2](img/sciarg-sam.png)
-### Document Converters
+### Collected Statistics after Document Conversion
-The dataset provides document converters for the following target document types:
+We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
+After checking out that code, the statistics and plots can be generated by the command:
+
+```commandline
+python src/evaluate_documents.py dataset=sciarg_base metric=METRIC
+```
From `default` version:
@@ -178,8 +225,111 @@ From `resolve_parts_of_same` version:
- `labeled_partitions`, `LabeledSpan` annotations, created from splitting `BratDocument`'s `text` at new paragraph in `xml` format.
- labels: `title`, `abstract`, `H1`
-See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py) for the document type
-definitions.
+This also requires to have the following dataset config in `configs/dataset/sciarg_base.yaml` of this dataset within the repo directory:
+
+```commandline
+_target_: src.utils.execute_pipeline
+input:
+ _target_: pie_datasets.DatasetDict.load_dataset
+ path: pie/sciarg
+ revision: 982d5682ba414ee13cf92cb93ec18fc8e78e2b81
+```
+
+For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py)).
+
+#### Relation argument (outer) token distance per label
+
+The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
+We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=sciarg_base metric=relation_argument_token_distances
+```
+
+
+
+| | len | max | mean | min | std |
+| :---------------- | ----: | ---: | ------: | --: | ------: |
+| ALL | 15640 | 2864 | 30.524 | 3 | 45.351 |
+| contradicts | 1392 | 238 | 32.565 | 6 | 19.771 |
+| parts_of_same | 2594 | 374 | 28.18 | 3 | 26.845 |
+| semantically_same | 84 | 2864 | 206.333 | 11 | 492.268 |
+| supports | 11570 | 407 | 29.527 | 4 | 24.189 |
+
+
+ Histogram (split: train, 40 documents)
+
+![rtd-label_sciarg.png](img%2Frtd-label_sciarg.png)
+
+
+
+#### Span lengths (tokens)
+
+The span length is measured from the first token of the first argumentative unit to the last token of the particular unit.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*).
+We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=sciarg_base metric=span_lengths_tokens
+```
+
+
+
+| statistics | train |
+| :--------- | -----: |
+| no. doc | 40 |
+| len | 13586 |
+| mean | 11.677 |
+| std | 8.731 |
+| min | 1 |
+| max | 138 |
+
+
+ Histogram (split: train, 40 documents)
+
+![slt_sciarg.png](img%2Fslt_sciarg.png)
+
+
+
+#### Token length (tokens)
+
+The token length is measured from the first token of the document to the last one.
+
+We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*).
+We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=sciarg_base metric=count_text_tokens
+```
+
+
+
+| statistics | train |
+| :--------- | ------: |
+| no. doc | 40 |
+| mean | 10521.1 |
+| std | 2472.2 |
+| min | 6452 |
+| max | 16421 |
+
+
+ Histogram (split: train, 40 documents)
+
+![tl_sciarg.png](img%2Ftl_sciarg.png)
+
+
## Dataset Creation
diff --git a/dataset_builders/pie/sciarg/img/rtd-label_sciarg.png b/dataset_builders/pie/sciarg/img/rtd-label_sciarg.png
new file mode 100644
index 00000000..34500b39
Binary files /dev/null and b/dataset_builders/pie/sciarg/img/rtd-label_sciarg.png differ
diff --git a/dataset_builders/pie/sciarg/img/slt_sciarg.png b/dataset_builders/pie/sciarg/img/slt_sciarg.png
new file mode 100644
index 00000000..67cd3e37
Binary files /dev/null and b/dataset_builders/pie/sciarg/img/slt_sciarg.png differ
diff --git a/dataset_builders/pie/sciarg/img/tl_sciarg.png b/dataset_builders/pie/sciarg/img/tl_sciarg.png
new file mode 100644
index 00000000..46f701d8
Binary files /dev/null and b/dataset_builders/pie/sciarg/img/tl_sciarg.png differ
diff --git a/dataset_builders/pie/scidtb_argmin/README.md b/dataset_builders/pie/scidtb_argmin/README.md
index e5235c75..cb577a6b 100644
--- a/dataset_builders/pie/scidtb_argmin/README.md
+++ b/dataset_builders/pie/scidtb_argmin/README.md
@@ -3,6 +3,27 @@
This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the
[SciDTB ArgMin Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/scidtb_argmin).
+## Usage
+
+```python
+from pie_datasets import load_dataset
+from pytorch_ie.documents import TextDocumentWithLabeledSpansAndBinaryRelations
+
+# load English variant
+dataset = load_dataset("pie/scidtb_argmin")
+
+# if required, normalize the document type (see section Document Converters below)
+dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansAndBinaryRelations)
+assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansAndBinaryRelations)
+
+# get first relation in the first document
+doc = dataset_converted["train"][0]
+print(doc.binary_relations[0])
+# BinaryRelation(head=LabeledSpan(start=251, end=454, label='means', score=1.0), tail=LabeledSpan(start=455, end=712, label='proposal', score=1.0), label='detail', score=1.0)
+print(doc.binary_relations[0].resolve())
+# ('detail', (('means', 'We observe , identify , and detect naturally occurring signals of interestingness in click transitions on the Web between source and target documents , which we collect from commercial Web browser logs .'), ('proposal', 'The DSSM is trained on millions of Web transitions , and maps source-target document pairs to feature vectors in a latent space in such a way that the distance between source documents and their corresponding interesting targets in that space is minimized .')))
+```
+
## Data Schema
The document type for this dataset is `SciDTBArgminDocument` which defines the following data fields:
@@ -31,3 +52,120 @@ The dataset provides document converters for the following target document types
See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.
+
+### Collected Statistics after Document Conversion
+
+We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
+After checking out that code, the statistics and plots can be generated by the command:
+
+```commandline
+python src/evaluate_documents.py dataset=scidtb_argmin_base metric=METRIC
+```
+
+where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)).
+
+This also requires to have the following dataset config in `configs/dataset/scidtb_argmin_base.yaml` of this dataset within the repo directory:
+
+```commandline
+_target_: src.utils.execute_pipeline
+input:
+ _target_: pie_datasets.DatasetDict.load_dataset
+ path: pie/scidtb_argmin
+ revision: 335a8e6168919d7f204c6920eceb96745dbd161b
+```
+
+For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).
+
+#### Relation argument (outer) token distance per label
+
+The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
+We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=scidtb_argmin_base metric=relation_argument_token_distances
+```
+
+
+
+| | len | max | mean | min | std |
+| :--------- | --: | --: | -----: | --: | -----: |
+| ALL | 586 | 277 | 75.239 | 21 | 40.312 |
+| additional | 54 | 180 | 59.593 | 36 | 29.306 |
+| detail | 258 | 163 | 65.62 | 22 | 29.21 |
+| sequence | 22 | 93 | 59.727 | 38 | 17.205 |
+| support | 252 | 277 | 89.794 | 21 | 48.118 |
+
+
+ Histogram (split: train, 60 documents)
+
+![rtd-label_scitdb-argmin.png](img%2Frtd-label_scitdb-argmin.png)
+
+
+
+#### Span lengths (tokens)
+
+The span length is measured from the first token of the first argumentative unit to the last token of the particular unit.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*).
+We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=scidtb_argmin_base metric=span_lengths_tokens
+```
+
+
+
+| statistics | train |
+| :--------- | -----: |
+| no. doc | 60 |
+| len | 353 |
+| mean | 27.946 |
+| std | 13.054 |
+| min | 7 |
+| max | 123 |
+
+
+ Histogram (split: train, 60 documents)
+
+![slt_scitdb-argmin.png](img%2Fslt_scitdb-argmin.png)
+
+
+
+#### Token length (tokens)
+
+The token length is measured from the first token of the document to the last one.
+
+We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*).
+We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly.
+
+
+Command
+
+```
+python src/evaluate_documents.py dataset=scidtb_argmin_base metric=count_text_tokens
+```
+
+
+
+| statistics | train |
+| :--------- | ------: |
+| no. doc | 60 |
+| mean | 164.417 |
+| std | 64.572 |
+| min | 80 |
+| max | 532 |
+
+
+ Histogram (split: train, 60 documents)
+
+![tl_scidtb-argmin.png](img%2Ftl_scidtb-argmin.png)
+
+
diff --git a/dataset_builders/pie/scidtb_argmin/img/rtd-label_scitdb-argmin.png b/dataset_builders/pie/scidtb_argmin/img/rtd-label_scitdb-argmin.png
new file mode 100644
index 00000000..d80c88f4
Binary files /dev/null and b/dataset_builders/pie/scidtb_argmin/img/rtd-label_scitdb-argmin.png differ
diff --git a/dataset_builders/pie/scidtb_argmin/img/slt_scitdb-argmin.png b/dataset_builders/pie/scidtb_argmin/img/slt_scitdb-argmin.png
new file mode 100644
index 00000000..aa9d9823
Binary files /dev/null and b/dataset_builders/pie/scidtb_argmin/img/slt_scitdb-argmin.png differ
diff --git a/dataset_builders/pie/scidtb_argmin/img/tl_scidtb-argmin.png b/dataset_builders/pie/scidtb_argmin/img/tl_scidtb-argmin.png
new file mode 100644
index 00000000..d14f3fc7
Binary files /dev/null and b/dataset_builders/pie/scidtb_argmin/img/tl_scidtb-argmin.png differ