ArneBinder · ArneBinder · Nov 7, 2024 · Jan 25, 2024 · Jan 25, 2024 · Jan 25, 2024
diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md
@@ -4,6 +4,29 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for t
 
 Therefore, the `aae2` dataset as described here follows the data structure from the [PIE brat dataset card](https://huggingface.co/datasets/pie/brat).
 
+### Usage
+
+```python
+from pie_datasets import load_dataset
+from pie_datasets.builders.brat import BratDocumentWithMergedSpans
+from pytorch_ie.documents import TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions
+
+# load default version
+dataset = load_dataset("pie/aae2")
+assert isinstance(dataset["train"][0], BratDocumentWithMergedSpans)
+
+# if required, normalize the document type (see section Document Converters below)
+dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)
+assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)
+
+# get first relation in the first document
+doc = dataset_converted["train"][0]
+print(doc.binary_relations[0])
+# BinaryRelation(head=LabeledSpan(start=716, end=851, label='Premise', score=1.0), tail=LabeledSpan(start=591, end=714, label='Claim', score=1.0), label='supports', score=1.0)
+print(doc.binary_relations[0].resolve())
+# ('supports', (('Premise', 'What we acquired from team work is not only how to achieve the same goal with others but more importantly, how to get along with others'), ('Claim', 'through cooperation, children can learn about interpersonal skills which are significant in the future life of all students')))
+```
+
 ### Dataset Summary
 
 Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a major claim component as well as claim components, and premise components justify or refute the claims. Attack and support labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642)
@@ -28,17 +51,6 @@ The `aae2` dataset comes in a single version (`default`) with `BratDocumentWithM
 
 See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema).
 
-### Usage
-
-```python
-from pie_datasets import load_dataset, builders
-
-# load default version
-datasets = load_dataset("pie/aae2")
-doc = datasets["train"][0]
-assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
-```
-
 ### Data Splits
 
 | Statistics                                                       |                      Train |                     Test |
@@ -109,7 +121,7 @@ The dataset provides document converters for the following target document types
 See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
 definitions.
 
-#### Label Statistics after Document Conversion
+#### Relation Label Statistics after Document Conversion
 
 When converting from `BratDocumentWithMergedSpan` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`,
 we apply a relation-conversion method (see above) that changes the label counts for the relations, as follows:
@@ -129,6 +141,154 @@ we apply a relation-conversion method (see above) that changes the label counts
 | support: `supports` |  5958 |     89.3 % |
 | attack: `attacks`   |   715 |     10.7 % |
 
+### Collected Statistics after Document Conversion
+
+We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
+After checking out that code, the statistics and plots can be generated by the command:
+
+```commandline
+python src/evaluate_documents.py dataset=aae2_base metric=METRIC
+```
+
+where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)).
+
+This also requires to have the following dataset config in `configs/dataset/aae2_base.yaml` of this dataset within the repo directory:
+
+```commandline
+_target_: src.utils.execute_pipeline
+input:
+  _target_: pie_datasets.DatasetDict.load_dataset
+  path: pie/aae2
+  revision: 1015ee38bd8a36549b344008f7a49af72956a7fe
+```
+
+For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).
+
+For relation-label statistics, we collect those from the default relation conversion method, i.e., `connect_first`, resulting in three distinct relation labels.
+
+#### Relation argument (outer) token distance per label
+
+The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
+We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.
+
+<details>
+<summary>Command</summary>
+
+```
+python src/evaluate_documents.py dataset=aae2_base metric=relation_argument_token_distances
+```
+
+</details>
+
+##### train (322 documents)
+
+|                   |  len | max |    mean | min |     std |
+| :---------------- | ---: | --: | ------: | --: | ------: |
+| ALL               | 9002 | 514 | 102.582 |   9 |   93.76 |
+| attacks           |  810 | 442 | 127.622 |  10 | 109.283 |
+| semantically_same |  552 | 514 | 301.638 |  25 |  73.756 |
+| supports          | 7640 | 493 |  85.545 |   9 |  74.023 |
+
+<details>
+  <summary>Histogram (split: train, 322 documents)</summary>
+
+![rtd-label_aae2_train.png](img%2Frtd-label_aae2_train.png)
+
+</details>
+
+##### test (80 documents)
+
+|                   |  len | max |    mean | min |    std |
+| :---------------- | ---: | --: | ------: | --: | -----: |
+| ALL               | 2372 | 442 | 100.711 |  10 | 92.698 |
+| attacks           |  184 | 402 | 115.891 |  12 | 98.751 |
+| semantically_same |  146 | 442 | 299.671 |  34 | 72.921 |
+| supports          | 2042 | 437 |  85.118 |  10 | 75.023 |
+
+<details>
+  <summary>Histogram (split: test, 80 documents)</summary>
+
+![rtd-label_aae2_test.png](img%2Frtd-label_aae2_test.png)
+
+</details>
+
+#### Span lengths (tokens)
+
+The span length is measured from the first token of the first argumentative unit to the last token of the particular unit.
+
+We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*).
+We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly.
+
+<details>
+<summary>Command</summary>
+
+```
+python src/evaluate_documents.py dataset=aae2_base metric=span_lengths_tokens
+```
+
+</details>
+
+| statistics |  train |   test |
+| :--------- | -----: | -----: |
+| no. doc    |    322 |     80 |
+| len        |   4823 |   1266 |
+| mean       | 17.157 | 16.317 |
+| std        |  8.079 |  7.953 |
+| min        |      3 |      3 |
+| max        |     75 |     50 |
+
+<details>
+  <summary>Histogram (split: train, 332 documents)</summary>
+
+![slt_aae2_train.png](img%2Fslt_aae2_train.png)
+
+</details>
+  <details>
+  <summary>Histogram (split: test, 80 documents)</summary>
+
+![slt_aae2_test.png](img%2Fslt_aae2_test.png)
+
+</details>
+
+#### Token length (tokens)
+
+The token length is measured from the first token of the document to the last one.
+
+We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*).
+We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly.
+
+<details>
+<summary>Command</summary>
+
+```
+python src/evaluate_documents.py dataset=aae2_base metric=count_text_tokens
+```
+
+</details>
+
+| statistics |   train |   test |
+| :--------- | ------: | -----: |
+| no. doc    |     322 |     80 |
+| mean       | 377.686 |  378.4 |
+| std        |  64.534 | 66.054 |
+| min        |     236 |    269 |
+| max        |     580 |    532 |
+
+<details>
+  <summary>Histogram (split: train, 332 documents)</summary>
+
+![tl_aae2_train.png](img%2Ftl_aae2_train.png)
+
+</details>
+  <details>
+  <summary>Histogram (split: test, 80 documents)</summary>
+
+![tl_aae2_test.png](img%2Ftl_aae2_test.png)
+
+</details>
+
 ## Dataset Creation
 
 ### Curation Rationale

diff --git a/dataset_builders/pie/aae2/img/rtd-label_aae2_test.png b/dataset_builders/pie/aae2/img/rtd-label_aae2_test.png
diff --git a/dataset_builders/pie/aae2/img/rtd-label_aae2_train.png b/dataset_builders/pie/aae2/img/rtd-label_aae2_train.png
diff --git a/dataset_builders/pie/aae2/img/slt_aae2_test.png b/dataset_builders/pie/aae2/img/slt_aae2_test.png
diff --git a/dataset_builders/pie/aae2/img/slt_aae2_train.png b/dataset_builders/pie/aae2/img/slt_aae2_train.png
diff --git a/dataset_builders/pie/aae2/img/tl_aae2_test.png b/dataset_builders/pie/aae2/img/tl_aae2_test.png
diff --git a/dataset_builders/pie/aae2/img/tl_aae2_train.png b/dataset_builders/pie/aae2/img/tl_aae2_train.png