diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md index f47a7f9f..81c958a9 100644 --- a/dataset_builders/pie/aae2/README.md +++ b/dataset_builders/pie/aae2/README.md @@ -4,6 +4,29 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for t Therefore, the `aae2` dataset as described here follows the data structure from the [PIE brat dataset card](https://huggingface.co/datasets/pie/brat). +### Usage + +```python +from pie_datasets import load_dataset +from pie_datasets.builders.brat import BratDocumentWithMergedSpans +from pytorch_ie.documents import TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions + +# load default version +dataset = load_dataset("pie/aae2") +assert isinstance(dataset["train"][0], BratDocumentWithMergedSpans) + +# if required, normalize the document type (see section Document Converters below) +dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions) +assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions) + +# get first relation in the first document +doc = dataset_converted["train"][0] +print(doc.binary_relations[0]) +# BinaryRelation(head=LabeledSpan(start=716, end=851, label='Premise', score=1.0), tail=LabeledSpan(start=591, end=714, label='Claim', score=1.0), label='supports', score=1.0) +print(doc.binary_relations[0].resolve()) +# ('supports', (('Premise', 'What we acquired from team work is not only how to achieve the same goal with others but more importantly, how to get along with others'), ('Claim', 'through cooperation, children can learn about interpersonal skills which are significant in the future life of all students'))) +``` + ### Dataset Summary Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a major claim component as well as claim components, and premise components justify or refute the claims. Attack and support labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642) @@ -28,17 +51,6 @@ The `aae2` dataset comes in a single version (`default`) with `BratDocumentWithM See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema). -### Usage - -```python -from pie_datasets import load_dataset, builders - -# load default version -datasets = load_dataset("pie/aae2") -doc = datasets["train"][0] -assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans) -``` - ### Data Splits | Statistics | Train | Test | @@ -109,7 +121,7 @@ The dataset provides document converters for the following target document types See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type definitions. -#### Label Statistics after Document Conversion +#### Relation Label Statistics after Document Conversion When converting from `BratDocumentWithMergedSpan` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`, we apply a relation-conversion method (see above) that changes the label counts for the relations, as follows: @@ -129,6 +141,154 @@ we apply a relation-conversion method (see above) that changes the label counts | support: `supports` | 5958 | 89.3 % | | attack: `attacks` | 715 | 10.7 % | +### Collected Statistics after Document Conversion + +We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics. +After checking out that code, the statistics and plots can be generated by the command: + +```commandline +python src/evaluate_documents.py dataset=aae2_base metric=METRIC +``` + +where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)). + +This also requires to have the following dataset config in `configs/dataset/aae2_base.yaml` of this dataset within the repo directory: + +```commandline +_target_: src.utils.execute_pipeline +input: + _target_: pie_datasets.DatasetDict.load_dataset + path: pie/aae2 + revision: 1015ee38bd8a36549b344008f7a49af72956a7fe +``` + +For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)). + +For relation-label statistics, we collect those from the default relation conversion method, i.e., `connect_first`, resulting in three distinct relation labels. + +#### Relation argument (outer) token distance per label + +The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*). +We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=aae2_base metric=relation_argument_token_distances +``` + +
+ +##### train (322 documents) + +| | len | max | mean | min | std | +| :---------------- | ---: | --: | ------: | --: | ------: | +| ALL | 9002 | 514 | 102.582 | 9 | 93.76 | +| attacks | 810 | 442 | 127.622 | 10 | 109.283 | +| semantically_same | 552 | 514 | 301.638 | 25 | 73.756 | +| supports | 7640 | 493 | 85.545 | 9 | 74.023 | + +
+ Histogram (split: train, 322 documents) + +![rtd-label_aae2_train.png](img%2Frtd-label_aae2_train.png) + +
+ +##### test (80 documents) + +| | len | max | mean | min | std | +| :---------------- | ---: | --: | ------: | --: | -----: | +| ALL | 2372 | 442 | 100.711 | 10 | 92.698 | +| attacks | 184 | 402 | 115.891 | 12 | 98.751 | +| semantically_same | 146 | 442 | 299.671 | 34 | 72.921 | +| supports | 2042 | 437 | 85.118 | 10 | 75.023 | + +
+ Histogram (split: test, 80 documents) + +![rtd-label_aae2_test.png](img%2Frtd-label_aae2_test.png) + +
+ +#### Span lengths (tokens) + +The span length is measured from the first token of the first argumentative unit to the last token of the particular unit. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*). +We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=aae2_base metric=span_lengths_tokens +``` + +
+ +| statistics | train | test | +| :--------- | -----: | -----: | +| no. doc | 322 | 80 | +| len | 4823 | 1266 | +| mean | 17.157 | 16.317 | +| std | 8.079 | 7.953 | +| min | 3 | 3 | +| max | 75 | 50 | + +
+ Histogram (split: train, 332 documents) + +![slt_aae2_train.png](img%2Fslt_aae2_train.png) + +
+
+ Histogram (split: test, 80 documents) + +![slt_aae2_test.png](img%2Fslt_aae2_test.png) + +
+ +#### Token length (tokens) + +The token length is measured from the first token of the document to the last one. + +We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*). +We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=aae2_base metric=count_text_tokens +``` + +
+ +| statistics | train | test | +| :--------- | ------: | -----: | +| no. doc | 322 | 80 | +| mean | 377.686 | 378.4 | +| std | 64.534 | 66.054 | +| min | 236 | 269 | +| max | 580 | 532 | + +
+ Histogram (split: train, 332 documents) + +![tl_aae2_train.png](img%2Ftl_aae2_train.png) + +
+
+ Histogram (split: test, 80 documents) + +![tl_aae2_test.png](img%2Ftl_aae2_test.png) + +
+ ## Dataset Creation ### Curation Rationale diff --git a/dataset_builders/pie/aae2/img/rtd-label_aae2_test.png b/dataset_builders/pie/aae2/img/rtd-label_aae2_test.png new file mode 100644 index 00000000..d62218f8 Binary files /dev/null and b/dataset_builders/pie/aae2/img/rtd-label_aae2_test.png differ diff --git a/dataset_builders/pie/aae2/img/rtd-label_aae2_train.png b/dataset_builders/pie/aae2/img/rtd-label_aae2_train.png new file mode 100644 index 00000000..c214fe8f Binary files /dev/null and b/dataset_builders/pie/aae2/img/rtd-label_aae2_train.png differ diff --git a/dataset_builders/pie/aae2/img/slt_aae2_test.png b/dataset_builders/pie/aae2/img/slt_aae2_test.png new file mode 100644 index 00000000..805c6b98 Binary files /dev/null and b/dataset_builders/pie/aae2/img/slt_aae2_test.png differ diff --git a/dataset_builders/pie/aae2/img/slt_aae2_train.png b/dataset_builders/pie/aae2/img/slt_aae2_train.png new file mode 100644 index 00000000..30140435 Binary files /dev/null and b/dataset_builders/pie/aae2/img/slt_aae2_train.png differ diff --git a/dataset_builders/pie/aae2/img/tl_aae2_test.png b/dataset_builders/pie/aae2/img/tl_aae2_test.png new file mode 100644 index 00000000..d3f8cf5c Binary files /dev/null and b/dataset_builders/pie/aae2/img/tl_aae2_test.png differ diff --git a/dataset_builders/pie/aae2/img/tl_aae2_train.png b/dataset_builders/pie/aae2/img/tl_aae2_train.png new file mode 100644 index 00000000..ea135de5 Binary files /dev/null and b/dataset_builders/pie/aae2/img/tl_aae2_train.png differ diff --git a/dataset_builders/pie/abstrct/README.md b/dataset_builders/pie/abstrct/README.md index 4819920e..0bbf2daa 100644 --- a/dataset_builders/pie/abstrct/README.md +++ b/dataset_builders/pie/abstrct/README.md @@ -10,6 +10,29 @@ A novel corpus of healthcare texts (i.e., RCT abstracts on various diseases) fro are annotated with argumentative components (i.e., `MajorClaim`, `Claim`, and `Premise`) and relations (i.e., `Support`, `Attack`, and `Partial-attack`), in order to support clinicians' daily tasks in information finding and evidence-based reasoning for decision making. +### Usage + +```python +from pie_datasets import load_dataset +from pie_datasets.builders.brat import BratDocumentWithMergedSpans +from pytorch_ie.documents import TextDocumentWithLabeledSpansAndBinaryRelations + +# load default version +dataset = load_dataset("pie/abstrct") +assert isinstance(dataset["neoplasm_train"][0], BratDocumentWithMergedSpans) + +# if required, normalize the document type (see section Document Converters below) +dataset_converted = dataset.to_document_type("pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations") +assert isinstance(dataset_converted["neoplasm_train"][0], TextDocumentWithLabeledSpansAndBinaryRelations) + +# get first relation in the first document +doc = dataset_converted["neoplasm_train"][0] +print(doc.binary_relations[0]) +# BinaryRelation(head=LabeledSpan(start=1769, end=1945, label='Claim', score=1.0), tail=LabeledSpan(start=1, end=162, label='MajorClaim', score=1.0), label='Support', score=1.0) +print(doc.binary_relations[0].resolve()) +# ('Support', (('Claim', 'Treatment with mitoxantrone plus prednisone was associated with greater and longer-lasting improvement in several HQL domains and symptoms than treatment with prednisone alone.'), ('MajorClaim', 'A combination of mitoxantrone plus prednisone is preferable to prednisone alone for reduction of pain in men with metastatic, hormone-resistant, prostate cancer.'))) +``` + ### Supported Tasks and Leaderboards - **Tasks**: Argumentation Mining, Component Identification, Boundary Detection, Relation Identification, Link Prediction @@ -30,16 +53,17 @@ Without any need to merge fragments, the document type `BratDocumentWithMergedSp See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema). -### Usage +### Document Converters -```python -from pie_datasets import load_dataset, builders +The dataset provides document converters for the following target document types: -# load default version -datasets = load_dataset("pie/abstrct") -doc = datasets["neoplasm_train"][0] -assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans) -``` +- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations` + - `LabeledSpans`, converted from `BratDocumentWithMergedSpans`'s `spans` + - labels: `MajorClaim`, `Claim`, `Premise` + - `BinraryRelations`, converted from `BratDocumentWithMergedSpans`'s `relations` + - labels: `Support`, `Partial-Attack`, `Attack` + +See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type definitions. ### Data Splits @@ -92,22 +116,239 @@ Morio et al. ([2022](https://aclanthology.org/2022.tacl-1.37.pdf); p. 642, Table (Mayer et al. 2020, p.2110) -#### Examples +#### Example -![Examples](img/abstr-sam.png) +![abstr-sam.png](img%2Fabstr-sam.png) -### Document Converters +### Collected Statistics after Document Conversion -The dataset provides document converters for the following target document types: +We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics. +After checking out that code, the statistics and plots can be generated by the command: -- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations` - - `labeled_spans`: `LabeledSpan` annotations, converted from `BratDocumentWithMergedSpans`'s `spans` - - labels: `MajorClaim`, `Claim`, `Premise` - - `binary_relations`: `BinaryRelation` annotations, converted from `BratDocumentWithMergedSpans`'s `relations` - - labels: `Support`, `Partial-Attack`, `Attack` +```commandline +python src/evaluate_documents.py dataset=abstrct_base metric=METRIC +``` + +where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)). + +This also requires to have the following dataset config in `configs/dataset/abstrct_base.yaml` of this dataset within the repo directory: + +```commandline +_target_: src.utils.execute_pipeline +input: + _target_: pie_datasets.DatasetDict.load_dataset + path: pie/abstrct + revision: 277dc703fd78614635e86fe57c636b54931538b2 +``` + +For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)). + +#### Relation argument (outer) token distance per label + +The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*). +We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=abstrct_base metric=relation_argument_token_distances +``` + +
+ +##### neoplasm_train (350 documents) + +| | len | max | mean | min | std | +| :------------- | ---: | --: | ------: | --: | -----: | +| ALL | 2836 | 511 | 132.903 | 17 | 80.869 | +| Attack | 72 | 346 | 89.639 | 29 | 75.554 | +| Partial-Attack | 338 | 324 | 59.024 | 17 | 42.773 | +| Support | 2426 | 511 | 144.481 | 26 | 79.187 | + +
+ Histogram (split: neoplasm_train, 350 documents) + +![img_2.png](img/rtd-label_abs-neo_train.png) + +
+ +##### neoplasm_dev (50 documents) + +| | len | max | mean | min | std | +| :------------- | --: | --: | ------: | --: | -----: | +| ALL | 438 | 625 | 146.393 | 24 | 98.788 | +| Attack | 16 | 200 | 90.375 | 26 | 62.628 | +| Partial-Attack | 50 | 240 | 72.04 | 24 | 47.685 | +| Support | 372 | 625 | 158.796 | 34 | 99.922 | + +
+ Histogram (split: neoplasm_dev, 50 documents) + +![img_3.png](img/rtd-label_abs-neo_dev.png) + +
+ +##### neoplasm_test (100 documents) + +| | len | max | mean | min | std | +| :------------- | --: | --: | ------: | --: | -----: | +| ALL | 848 | 459 | 126.731 | 22 | 75.363 | +| Attack | 32 | 390 | 115.688 | 22 | 97.262 | +| Partial-Attack | 88 | 205 | 56.955 | 24 | 34.534 | +| Support | 728 | 459 | 135.651 | 33 | 73.365 | + +
+ Histogram (split: neoplasm_test, 100 documents) + +![img_4.png](img/rtd-label_abs-neo_test.png) + +
+ +##### glaucoma_test (100 documents) + +| | len | max | mean | min | std | +| :------------- | --: | --: | ------: | --: | -----: | +| ALL | 734 | 488 | 159.166 | 26 | 83.885 | +| Attack | 14 | 177 | 89 | 47 | 40.171 | +| Partial-Attack | 52 | 259 | 74 | 26 | 51.239 | +| Support | 668 | 488 | 167.266 | 38 | 82.222 | + +
+ Histogram (split: glaucoma_test, 100 documents) + +![img_5.png](img/rtd-label_abs-glu_test.png) + +
+ +##### mixed_test (100 documents) + +| | len | max | mean | min | std | +| :------------- | --: | --: | ------: | --: | ------: | +| ALL | 658 | 459 | 145.067 | 23 | 77.921 | +| Attack | 6 | 411 | 164 | 34 | 174.736 | +| Partial-Attack | 42 | 259 | 65.762 | 23 | 62.426 | +| Support | 610 | 459 | 150.341 | 35 | 74.273 | + +
+ Histogram (split: mixed_test, 100 documents) + +![img_6.png](img/rtd-label_abs-mix_test.png) + +
+ +#### Span lengths (tokens) + +The span length is measured from the first token of the first argumentative unit to the last token of the particular unit. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*). +We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=abstrct_base metric=span_lengths_tokens +``` + +
+ +| statistics | neoplasm_train | neoplasm_dev | neoplasm_test | glaucoma_test | mixed_test | +| :--------- | -------------: | -----------: | ------------: | ------------: | ---------: | +| no. doc | 350 | 50 | 100 | 100 | 100 | +| len | 2267 | 326 | 686 | 594 | 600 | +| mean | 34.303 | 37.135 | 32.566 | 38.997 | 38.507 | +| std | 22.425 | 29.941 | 20.264 | 22.604 | 24.036 | +| min | 5 | 5 | 6 | 6 | 7 | +| max | 250 | 288 | 182 | 169 | 159 | + +
+ Histogram (split: neoplasm_train, 350 documents) + +![slt_abs-neo_train.png](img%2Fslt_abs-neo_train.png) + +
+
+ Histogram (split: neoplasm_dev, 50 documents) + +![slt_abs-neo_dev.png](img%2Fslt_abs-neo_dev.png) + +
+
+ Histogram (split: neoplasm_test, 100 documents) + +![slt_abs-neo_test.png](img%2Fslt_abs-neo_test.png) + +
+
+ Histogram (split: glucoma_test, 100 documents) + +![slt_abs-glu_test.png](img%2Fslt_abs-glu_test.png) + +
+
+ Histogram (split: mixed_test, 100 documents) + +![slt_abs-mix_test.png](img%2Fslt_abs-mix_test.png) + +
+ +#### Token length (tokens) + +The token length is measured from the first token of the document to the last one. + +We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*). +We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=abstrct_base metric=count_text_tokens +``` + +
+ +| statistics | neoplasm_train | neoplasm_dev | neoplasm_test | glaucoma_test | mixed_test | +| :--------- | -------------: | -----------: | ------------: | ------------: | ---------: | +| no. doc | 350 | 50 | 100 | 100 | 100 | +| mean | 447.291 | 481.66 | 442.79 | 456.78 | 450.29 | +| std | 91.266 | 116.239 | 89.692 | 115.535 | 87.002 | +| min | 301 | 329 | 292 | 212 | 268 | +| max | 843 | 952 | 776 | 1022 | 776 | + +
+ Histogram (split: neoplasm_train, 350 documents) + +![tl_abs-neo_train.png](img%2Ftl_abs-neo_train.png) + +
+
+ Histogram (split: neoplasm_dev, 50 documents) + +![tl_abs-neo_dev.png](img%2Ftl_abs-neo_dev.png) + +
+
+ Histogram (split: neoplasm_test, 100 documents) + +![tl_abs-neo_test.png](img%2Ftl_abs-neo_test.png) + +
+
+ Histogram (split: glucoma_test, 100 documents) + +![tl_abs-glu_test.png](img%2Ftl_abs-glu_test.png) + +
+
+ Histogram (split: mixed_test, 100 documents) + +![tl_abs-mix_test.png](img%2Ftl_abs-mix_test.png) -See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type -definitions. +
## Dataset Creation diff --git a/dataset_builders/pie/abstrct/img/rtd-label_abs-glu_test.png b/dataset_builders/pie/abstrct/img/rtd-label_abs-glu_test.png new file mode 100644 index 00000000..5fb18436 Binary files /dev/null and b/dataset_builders/pie/abstrct/img/rtd-label_abs-glu_test.png differ diff --git a/dataset_builders/pie/abstrct/img/rtd-label_abs-mix_test.png b/dataset_builders/pie/abstrct/img/rtd-label_abs-mix_test.png new file mode 100644 index 00000000..f56f6bc8 Binary files /dev/null and b/dataset_builders/pie/abstrct/img/rtd-label_abs-mix_test.png differ diff --git a/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_dev.png b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_dev.png new file mode 100644 index 00000000..f3a59b57 Binary files /dev/null and b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_dev.png differ diff --git a/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_test.png b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_test.png new file mode 100644 index 00000000..1bd8cb12 Binary files /dev/null and b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_test.png differ diff --git a/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_train.png b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_train.png new file mode 100644 index 00000000..9b414dfa Binary files /dev/null and b/dataset_builders/pie/abstrct/img/rtd-label_abs-neo_train.png differ diff --git a/dataset_builders/pie/abstrct/img/slt_abs-glu_test.png b/dataset_builders/pie/abstrct/img/slt_abs-glu_test.png new file mode 100644 index 00000000..2b57feb0 Binary files /dev/null and b/dataset_builders/pie/abstrct/img/slt_abs-glu_test.png differ diff --git a/dataset_builders/pie/abstrct/img/slt_abs-mix_test.png b/dataset_builders/pie/abstrct/img/slt_abs-mix_test.png new file mode 100644 index 00000000..7e23f3b9 Binary files /dev/null and b/dataset_builders/pie/abstrct/img/slt_abs-mix_test.png differ diff --git a/dataset_builders/pie/abstrct/img/slt_abs-neo_dev.png b/dataset_builders/pie/abstrct/img/slt_abs-neo_dev.png new file mode 100644 index 00000000..0ea90445 Binary files /dev/null and b/dataset_builders/pie/abstrct/img/slt_abs-neo_dev.png differ diff --git a/dataset_builders/pie/abstrct/img/slt_abs-neo_test.png b/dataset_builders/pie/abstrct/img/slt_abs-neo_test.png new file mode 100644 index 00000000..829e4f9a Binary files /dev/null and b/dataset_builders/pie/abstrct/img/slt_abs-neo_test.png differ diff --git a/dataset_builders/pie/abstrct/img/slt_abs-neo_train.png b/dataset_builders/pie/abstrct/img/slt_abs-neo_train.png new file mode 100644 index 00000000..c74e887e Binary files /dev/null and b/dataset_builders/pie/abstrct/img/slt_abs-neo_train.png differ diff --git a/dataset_builders/pie/abstrct/img/tl_abs-glu_test.png b/dataset_builders/pie/abstrct/img/tl_abs-glu_test.png new file mode 100644 index 00000000..cbe4756b Binary files /dev/null and b/dataset_builders/pie/abstrct/img/tl_abs-glu_test.png differ diff --git a/dataset_builders/pie/abstrct/img/tl_abs-mix_test.png b/dataset_builders/pie/abstrct/img/tl_abs-mix_test.png new file mode 100644 index 00000000..260561e3 Binary files /dev/null and b/dataset_builders/pie/abstrct/img/tl_abs-mix_test.png differ diff --git a/dataset_builders/pie/abstrct/img/tl_abs-neo_dev.png b/dataset_builders/pie/abstrct/img/tl_abs-neo_dev.png new file mode 100644 index 00000000..4c0cec01 Binary files /dev/null and b/dataset_builders/pie/abstrct/img/tl_abs-neo_dev.png differ diff --git a/dataset_builders/pie/abstrct/img/tl_abs-neo_test.png b/dataset_builders/pie/abstrct/img/tl_abs-neo_test.png new file mode 100644 index 00000000..50b16ddf Binary files /dev/null and b/dataset_builders/pie/abstrct/img/tl_abs-neo_test.png differ diff --git a/dataset_builders/pie/abstrct/img/tl_abs-neo_train.png b/dataset_builders/pie/abstrct/img/tl_abs-neo_train.png new file mode 100644 index 00000000..b8a71584 Binary files /dev/null and b/dataset_builders/pie/abstrct/img/tl_abs-neo_train.png differ diff --git a/dataset_builders/pie/argmicro/README.md b/dataset_builders/pie/argmicro/README.md index fe9da6f9..9de51471 100644 --- a/dataset_builders/pie/argmicro/README.md +++ b/dataset_builders/pie/argmicro/README.md @@ -3,6 +3,27 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the [ArgMicro Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/argmicro). +## Usage + +```python +from pie_datasets import load_dataset +from pytorch_ie.documents import TextDocumentWithLabeledSpansAndBinaryRelations + +# load English variant +dataset = load_dataset("pie/argmicro", name="en") + +# if required, normalize the document type (see section Document Converters below) +dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansAndBinaryRelations) +assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansAndBinaryRelations) + +# get first relation in the first document +doc = dataset_converted["train"][0] +print(doc.binary_relations[0]) +# BinaryRelation(head=LabeledSpan(start=0, end=81, label='opp', score=1.0), tail=LabeledSpan(start=326, end=402, label='pro', score=1.0), label='reb', score=1.0) +print(doc.binary_relations[0].resolve()) +# ('reb', (('opp', "Yes, it's annoying and cumbersome to separate your rubbish properly all the time."), ('pro', 'We Berliners should take the chance and become pioneers in waste separation!'))) +``` + ## Dataset Variants The dataset contains two `BuilderConfig`'s: @@ -53,3 +74,122 @@ The dataset provides document converters for the following target document types See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type definitions. + +### Collected Statistics after Document Conversion + +We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics. +After checking out that code, the statistics and plots can be generated by the command: + +```commandline +python src/evaluate_documents.py dataset=argmicro_base metric=METRIC +``` + +where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)). + +This also requires to have the following dataset config in `configs/dataset/argmicro_base.yaml` of this dataset within the repo directory: + +```commandline +_target_: src.utils.execute_pipeline +input: + _target_: pie_datasets.DatasetDict.load_dataset + path: pie/argmicro + revision: 28ef031d2a2c97be7e9ed360e1a5b20bd55b57b2 + name: en +``` + +For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)). + +#### Relation argument (outer) token distance per label + +The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*). +We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=argmicro_base metric=relation_argument_token_distances +``` + +
+ +| | len | max | mean | min | std | +| :---- | ---: | --: | -----: | --: | -----: | +| ALL | 1018 | 127 | 44.434 | 14 | 21.501 | +| exa | 18 | 63 | 33.556 | 16 | 13.056 | +| joint | 88 | 48 | 30.091 | 17 | 9.075 | +| reb | 220 | 127 | 49.327 | 16 | 24.653 | +| sup | 562 | 124 | 46.534 | 14 | 22.079 | +| und | 130 | 84 | 38.292 | 17 | 12.321 | + +
+ Histogram (split: train, 112 documents) + +![rtd-label_argmicro.png](img%2Frtd-label_argmicro.png) + +
+ +#### Span lengths (tokens) + +The span length is measured from the first token of the first argumentative unit to the last token of the particular unit. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*). +We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=argmicro_base metric=span_lengths_tokens +``` + +
+ +| statistics | train | +| :--------- | -----: | +| no. doc | 112 | +| len | 576 | +| mean | 16.365 | +| std | 6.545 | +| min | 4 | +| max | 41 | + +
+ Histogram (split: train, 112 documents) + +![slt_argmicro.png](img%2Fslt_argmicro.png) + +
+ +#### Token length (tokens) + +The token length is measured from the first token of the document to the last one. + +We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*). +We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=argmicro_base metric=count_text_tokens +``` + +
+ +| statistics | train | +| :--------- | -----: | +| no. doc | 112 | +| mean | 84.161 | +| std | 22.596 | +| min | 36 | +| max | 153 | + +
+ Histogram (split: train, 112 documents) + +![tl_argmicro.png](img%2Ftl_argmicro.png) + +
diff --git a/dataset_builders/pie/argmicro/img/rtd-label_argmicro.png b/dataset_builders/pie/argmicro/img/rtd-label_argmicro.png new file mode 100644 index 00000000..0b6be71d Binary files /dev/null and b/dataset_builders/pie/argmicro/img/rtd-label_argmicro.png differ diff --git a/dataset_builders/pie/argmicro/img/slt_argmicro.png b/dataset_builders/pie/argmicro/img/slt_argmicro.png new file mode 100644 index 00000000..5dfaac41 Binary files /dev/null and b/dataset_builders/pie/argmicro/img/slt_argmicro.png differ diff --git a/dataset_builders/pie/argmicro/img/tl_argmicro.png b/dataset_builders/pie/argmicro/img/tl_argmicro.png new file mode 100644 index 00000000..3b19768f Binary files /dev/null and b/dataset_builders/pie/argmicro/img/tl_argmicro.png differ diff --git a/dataset_builders/pie/cdcp/README.md b/dataset_builders/pie/cdcp/README.md index cc3e8bd3..1ea8748c 100644 --- a/dataset_builders/pie/cdcp/README.md +++ b/dataset_builders/pie/cdcp/README.md @@ -3,6 +3,27 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the [CDCP Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/cdcp). +## Usage + +```python +from pie_datasets import load_dataset +from pytorch_ie.documents import TextDocumentWithLabeledSpansAndBinaryRelations + +# load English variant +dataset = load_dataset("pie/cdcp") + +# if required, normalize the document type (see section Document Converters below) +dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansAndBinaryRelations) +assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansAndBinaryRelations) + +# get first relation in the first document +doc = dataset_converted["train"][0] +print(doc.binary_relations[0]) +# BinaryRelation(head=LabeledSpan(start=0, end=78, label='value', score=1.0), tail=LabeledSpan(start=79, end=242, label='value', score=1.0), label='reason', score=1.0) +print(doc.binary_relations[0].resolve()) +# ('reason', (('value', 'State and local court rules sometimes make default judgments much more likely.'), ('value', 'For example, when a person who allegedly owes a debt is told to come to court on a work day, they may be forced to choose between a default judgment and their job.'))) +``` + ## Data Schema The document type for this dataset is `CDCPDocument` which defines the following data fields: @@ -32,3 +53,147 @@ The dataset provides document converters for the following target document types See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type definitions. + +### Collected Statistics after Document Conversion + +We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics. +After checking out that code, the statistics and plots can be generated by the command: + +```commandline +python src/evaluate_documents.py dataset=cdcp_base metric=METRIC +``` + +where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)). + +This also requires to have the following dataset config in `configs/dataset/cdcp_base.yaml` of this dataset within the repo directory: + +```commandline +_target_: src.utils.execute_pipeline +input: + _target_: pie_datasets.DatasetDict.load_dataset + path: pie/cdcp + revision: 001722894bdca6df6a472d0d186a3af103e392c5 +``` + +For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)). + +#### Relation argument (outer) token distance per label + +The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*). +We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=cdcp_base metric=relation_argument_token_distances +``` + +
+ +##### train (580 documents) + +| | len | max | mean | min | std | +| :------- | ---: | --: | -----: | --: | -----: | +| ALL | 2204 | 240 | 48.839 | 8 | 31.462 | +| evidence | 94 | 196 | 66.723 | 14 | 42.444 | +| reason | 2110 | 240 | 48.043 | 8 | 30.64 | + +
+ Histogram (split: train, 580 documents) + +![rtd-label_cdcp_train.png](img%2Frtd-label_cdcp_train.png) + +
+ +##### test (150 documents) + +| | len | max | mean | min | std | +| :------- | --: | --: | -----: | --: | -----: | +| ALL | 648 | 212 | 51.299 | 8 | 31.159 | +| evidence | 52 | 170 | 73.923 | 20 | 39.855 | +| reason | 596 | 212 | 49.326 | 8 | 29.47 | + +
+ Histogram (split: test, 150 documents) + +![rtd-label_cdcp_test.png](img%2Frtd-label_cdcp_test.png) + +
+ +#### Span lengths (tokens) + +The span length is measured from the first token of the first argumentative unit to the last token of the particular unit. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*). +We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=cdcp_base metric=span_lengths_tokens +``` + +
+ +| statistics | train | test | +| :--------- | -----: | -----: | +| no. doc | 580 | 150 | +| len | 3901 | 1026 | +| mean | 19.441 | 18.758 | +| std | 11.71 | 10.388 | +| min | 2 | 3 | +| max | 142 | 83 | + +
+ Histogram (split: train, 580 documents) + +![slt_cdcp_train.png](img%2Fslt_cdcp_train.png) + +
+
+ Histogram (split: test, 150 documents) + +![slt_cdcp_test.png](img%2Fslt_cdcp_test.png) + +
+ +#### Token length (tokens) + +The token length is measured from the first token of the document to the last one. + +We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*). +We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=cdcp_base metric=count_text_tokens +``` + +
+ +| statistics | train | test | +| :--------- | ------: | ------: | +| no. doc | 580 | 150 | +| mean | 130.781 | 128.673 | +| std | 101.121 | 98.708 | +| min | 13 | 15 | +| max | 562 | 571 | + +
+ Histogram (split: train, 580 documents) + +![tl_cdcp_train.png](img%2Ftl_cdcp_train.png) + +
+
+ Histogram (split: test, 150 documents) + +![tl_cdcp_test.png](img%2Ftl_cdcp_test.png) + +
diff --git a/dataset_builders/pie/cdcp/img/rtd-label_cdcp_test.png b/dataset_builders/pie/cdcp/img/rtd-label_cdcp_test.png new file mode 100644 index 00000000..539ee94f Binary files /dev/null and b/dataset_builders/pie/cdcp/img/rtd-label_cdcp_test.png differ diff --git a/dataset_builders/pie/cdcp/img/rtd-label_cdcp_train.png b/dataset_builders/pie/cdcp/img/rtd-label_cdcp_train.png new file mode 100644 index 00000000..f9bc45c8 Binary files /dev/null and b/dataset_builders/pie/cdcp/img/rtd-label_cdcp_train.png differ diff --git a/dataset_builders/pie/cdcp/img/slt_cdcp_test.png b/dataset_builders/pie/cdcp/img/slt_cdcp_test.png new file mode 100644 index 00000000..fa82e864 Binary files /dev/null and b/dataset_builders/pie/cdcp/img/slt_cdcp_test.png differ diff --git a/dataset_builders/pie/cdcp/img/slt_cdcp_train.png b/dataset_builders/pie/cdcp/img/slt_cdcp_train.png new file mode 100644 index 00000000..79404c63 Binary files /dev/null and b/dataset_builders/pie/cdcp/img/slt_cdcp_train.png differ diff --git a/dataset_builders/pie/cdcp/img/tl_cdcp_test.png b/dataset_builders/pie/cdcp/img/tl_cdcp_test.png new file mode 100644 index 00000000..279f511a Binary files /dev/null and b/dataset_builders/pie/cdcp/img/tl_cdcp_test.png differ diff --git a/dataset_builders/pie/cdcp/img/tl_cdcp_train.png b/dataset_builders/pie/cdcp/img/tl_cdcp_train.png new file mode 100644 index 00000000..b85a3ee7 Binary files /dev/null and b/dataset_builders/pie/cdcp/img/tl_cdcp_train.png differ diff --git a/dataset_builders/pie/sciarg/README.md b/dataset_builders/pie/sciarg/README.md index 5f8cfee9..bfa5918f 100644 --- a/dataset_builders/pie/sciarg/README.md +++ b/dataset_builders/pie/sciarg/README.md @@ -4,6 +4,37 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for t Therefore, the `sciarg` dataset as described here follows the data structure from the [PIE brat dataset card](https://huggingface.co/datasets/pie/brat). +### Usage + +```python +from pie_datasets import load_dataset +from pie_datasets.builders.brat import BratDocumentWithMergedSpans, BratDocument +from pytorch_ie.documents import TextDocumentWithLabeledMultiSpansBinaryRelationsAndLabeledPartitions, TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions + +# load default version +dataset = load_dataset("pie/sciarg") +assert isinstance(dataset["train"][0], BratDocumentWithMergedSpans) + +# if required, normalize the document type (see section Document Converters below) +dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions) +assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions) + +# load version with resolved parts_of_same relations +dataset = load_dataset("pie/sciarg", name='resolve_parts_of_same') +assert isinstance(dataset["train"][0], BratDocument) + +# if required, normalize the document type (see section Document Converters below) +dataset_converted = dataset.to_document_type(TextDocumentWithLabeledMultiSpansBinaryRelationsAndLabeledPartitions) +assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledMultiSpansBinaryRelationsAndLabeledPartitions) + +# get first relation in the first document +doc = dataset_converted["train"][0] +print(doc.binary_relations[0]) +# BinaryRelation(head=LabeledMultiSpan(slices=((15071, 15076),), label='data', score=1.0), tail=LabeledMultiSpan(slices=((14983, 15062),), label='background_claim', score=1.0), label='supports', score=1.0) +print(doc.binary_relations[0].resolve()) +# ('supports', (('data', ('[ 3 ]',)), ('background_claim', ('PSD and improved example-based schemes have been discussed in many publications',)))) +``` + ### Dataset Summary The SciArg dataset is an extension of the Dr. Inventor corpus (Fisas et al., [2015](https://aclanthology.org/W15-1605.pdf), [2016](https://aclanthology.org/L16-1492.pdf)) with an annotation layer containing @@ -39,21 +70,25 @@ are connected via the `parts_of_same` relations are converted to `LabeledMultiSp See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema). -### Usage +### Document Converters -```python -from pie_datasets import load_dataset, builders +The dataset provides document converters for the following target document types: -# load default version -datasets = load_dataset("pie/sciarg") -doc = datasets["train"][0] -assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans) +- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations` + - `LabeledSpans`, converted from `BratDocument`'s `spans` + - labels: `background_claim`, `own_claim`, `data` + - if `spans` contain whitespace at the beginning and/or the end, the whitespace are trimmed out. + - `BinraryRelations`, converted from `BratDocument`'s `relations` + - labels: `supports`, `contradicts`, `semantically_same`, `parts_of_same` + - if the `relations` label is `semantically_same` or `parts_of_same`, they are merged if they are the same arguments after sorting. +- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions` + - `LabeledSpans`, as above + - `BinaryRelations`, as above + - `LabeledPartitions`, partitioned `BratDocument`'s `text`, according to the paragraph, using regex. + - labels: `title`, `abstract`, `H1` -# load version with resolved parts_of_same relations -datasets = load_dataset("pie/sciarg", name='resolve_parts_of_same') -doc = datasets["train"][0] -assert isinstance(doc, builders.brat.BratDocument) -``` +See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type +definitions. ### Data Splits @@ -133,6 +168,13 @@ possibly since [Lauscher et al., 2018](https://aclanthology.org/W18-5206/) prese (*Annotation Guidelines*, pp. 4-6) +There are currently discrepancies in label counts between + +- previous report in [Lauscher et al., 2018](https://aclanthology.org/W18-5206/), p. 43), +- current report above here (labels counted in `BratDocument`'s); + +possibly since [Lauscher et al., 2018](https://aclanthology.org/W18-5206/) presents the numbers of the real argumentative components, whereas here discontinuous components are still split (marked with the `parts_of_same` helper relation) and, thus, count per fragment. + #### Examples ![sample1](img/leaannof3.png) @@ -143,9 +185,14 @@ Below: Subset of relations in `A01` ![sample2](img/sciarg-sam.png) -### Document Converters +### Collected Statistics after Document Conversion -The dataset provides document converters for the following target document types: +We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics. +After checking out that code, the statistics and plots can be generated by the command: + +```commandline +python src/evaluate_documents.py dataset=sciarg_base metric=METRIC +``` From `default` version: @@ -178,8 +225,111 @@ From `resolve_parts_of_same` version: - `labeled_partitions`, `LabeledSpan` annotations, created from splitting `BratDocument`'s `text` at new paragraph in `xml` format. - labels: `title`, `abstract`, `H1` -See [here](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py) for the document type -definitions. +This also requires to have the following dataset config in `configs/dataset/sciarg_base.yaml` of this dataset within the repo directory: + +```commandline +_target_: src.utils.execute_pipeline +input: + _target_: pie_datasets.DatasetDict.load_dataset + path: pie/sciarg + revision: 982d5682ba414ee13cf92cb93ec18fc8e78e2b81 +``` + +For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ArneBinder/pie-modules/blob/main/src/pie_modules/documents.py)). + +#### Relation argument (outer) token distance per label + +The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*). +We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=sciarg_base metric=relation_argument_token_distances +``` + +
+ +| | len | max | mean | min | std | +| :---------------- | ----: | ---: | ------: | --: | ------: | +| ALL | 15640 | 2864 | 30.524 | 3 | 45.351 | +| contradicts | 1392 | 238 | 32.565 | 6 | 19.771 | +| parts_of_same | 2594 | 374 | 28.18 | 3 | 26.845 | +| semantically_same | 84 | 2864 | 206.333 | 11 | 492.268 | +| supports | 11570 | 407 | 29.527 | 4 | 24.189 | + +
+ Histogram (split: train, 40 documents) + +![rtd-label_sciarg.png](img%2Frtd-label_sciarg.png) + +
+ +#### Span lengths (tokens) + +The span length is measured from the first token of the first argumentative unit to the last token of the particular unit. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*). +We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=sciarg_base metric=span_lengths_tokens +``` + +
+ +| statistics | train | +| :--------- | -----: | +| no. doc | 40 | +| len | 13586 | +| mean | 11.677 | +| std | 8.731 | +| min | 1 | +| max | 138 | + +
+ Histogram (split: train, 40 documents) + +![slt_sciarg.png](img%2Fslt_sciarg.png) + +
+ +#### Token length (tokens) + +The token length is measured from the first token of the document to the last one. + +We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*). +We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=sciarg_base metric=count_text_tokens +``` + +
+ +| statistics | train | +| :--------- | ------: | +| no. doc | 40 | +| mean | 10521.1 | +| std | 2472.2 | +| min | 6452 | +| max | 16421 | + +
+ Histogram (split: train, 40 documents) + +![tl_sciarg.png](img%2Ftl_sciarg.png) + +
## Dataset Creation diff --git a/dataset_builders/pie/sciarg/img/rtd-label_sciarg.png b/dataset_builders/pie/sciarg/img/rtd-label_sciarg.png new file mode 100644 index 00000000..34500b39 Binary files /dev/null and b/dataset_builders/pie/sciarg/img/rtd-label_sciarg.png differ diff --git a/dataset_builders/pie/sciarg/img/slt_sciarg.png b/dataset_builders/pie/sciarg/img/slt_sciarg.png new file mode 100644 index 00000000..67cd3e37 Binary files /dev/null and b/dataset_builders/pie/sciarg/img/slt_sciarg.png differ diff --git a/dataset_builders/pie/sciarg/img/tl_sciarg.png b/dataset_builders/pie/sciarg/img/tl_sciarg.png new file mode 100644 index 00000000..46f701d8 Binary files /dev/null and b/dataset_builders/pie/sciarg/img/tl_sciarg.png differ diff --git a/dataset_builders/pie/scidtb_argmin/README.md b/dataset_builders/pie/scidtb_argmin/README.md index e5235c75..cb577a6b 100644 --- a/dataset_builders/pie/scidtb_argmin/README.md +++ b/dataset_builders/pie/scidtb_argmin/README.md @@ -3,6 +3,27 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the [SciDTB ArgMin Huggingface dataset loading script](https://huggingface.co/datasets/DFKI-SLT/scidtb_argmin). +## Usage + +```python +from pie_datasets import load_dataset +from pytorch_ie.documents import TextDocumentWithLabeledSpansAndBinaryRelations + +# load English variant +dataset = load_dataset("pie/scidtb_argmin") + +# if required, normalize the document type (see section Document Converters below) +dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansAndBinaryRelations) +assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansAndBinaryRelations) + +# get first relation in the first document +doc = dataset_converted["train"][0] +print(doc.binary_relations[0]) +# BinaryRelation(head=LabeledSpan(start=251, end=454, label='means', score=1.0), tail=LabeledSpan(start=455, end=712, label='proposal', score=1.0), label='detail', score=1.0) +print(doc.binary_relations[0].resolve()) +# ('detail', (('means', 'We observe , identify , and detect naturally occurring signals of interestingness in click transitions on the Web between source and target documents , which we collect from commercial Web browser logs .'), ('proposal', 'The DSSM is trained on millions of Web transitions , and maps source-target document pairs to feature vectors in a latent space in such a way that the distance between source documents and their corresponding interesting targets in that space is minimized .'))) +``` + ## Data Schema The document type for this dataset is `SciDTBArgminDocument` which defines the following data fields: @@ -31,3 +52,120 @@ The dataset provides document converters for the following target document types See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type definitions. + +### Collected Statistics after Document Conversion + +We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics. +After checking out that code, the statistics and plots can be generated by the command: + +```commandline +python src/evaluate_documents.py dataset=scidtb_argmin_base metric=METRIC +``` + +where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)). + +This also requires to have the following dataset config in `configs/dataset/scidtb_argmin_base.yaml` of this dataset within the repo directory: + +```commandline +_target_: src.utils.execute_pipeline +input: + _target_: pie_datasets.DatasetDict.load_dataset + path: pie/scidtb_argmin + revision: 335a8e6168919d7f204c6920eceb96745dbd161b +``` + +For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)). + +#### Relation argument (outer) token distance per label + +The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*). +We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=scidtb_argmin_base metric=relation_argument_token_distances +``` + +
+ +| | len | max | mean | min | std | +| :--------- | --: | --: | -----: | --: | -----: | +| ALL | 586 | 277 | 75.239 | 21 | 40.312 | +| additional | 54 | 180 | 59.593 | 36 | 29.306 | +| detail | 258 | 163 | 65.62 | 22 | 29.21 | +| sequence | 22 | 93 | 59.727 | 38 | 17.205 | +| support | 252 | 277 | 89.794 | 21 | 48.118 | + +
+ Histogram (split: train, 60 documents) + +![rtd-label_scitdb-argmin.png](img%2Frtd-label_scitdb-argmin.png) + +
+ +#### Span lengths (tokens) + +The span length is measured from the first token of the first argumentative unit to the last token of the particular unit. + +We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*). +We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=scidtb_argmin_base metric=span_lengths_tokens +``` + +
+ +| statistics | train | +| :--------- | -----: | +| no. doc | 60 | +| len | 353 | +| mean | 27.946 | +| std | 13.054 | +| min | 7 | +| max | 123 | + +
+ Histogram (split: train, 60 documents) + +![slt_scitdb-argmin.png](img%2Fslt_scitdb-argmin.png) + +
+ +#### Token length (tokens) + +The token length is measured from the first token of the document to the last one. + +We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*). +We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly. + +
+Command + +``` +python src/evaluate_documents.py dataset=scidtb_argmin_base metric=count_text_tokens +``` + +
+ +| statistics | train | +| :--------- | ------: | +| no. doc | 60 | +| mean | 164.417 | +| std | 64.572 | +| min | 80 | +| max | 532 | + +
+ Histogram (split: train, 60 documents) + +![tl_scidtb-argmin.png](img%2Ftl_scidtb-argmin.png) + +
diff --git a/dataset_builders/pie/scidtb_argmin/img/rtd-label_scitdb-argmin.png b/dataset_builders/pie/scidtb_argmin/img/rtd-label_scitdb-argmin.png new file mode 100644 index 00000000..d80c88f4 Binary files /dev/null and b/dataset_builders/pie/scidtb_argmin/img/rtd-label_scitdb-argmin.png differ diff --git a/dataset_builders/pie/scidtb_argmin/img/slt_scitdb-argmin.png b/dataset_builders/pie/scidtb_argmin/img/slt_scitdb-argmin.png new file mode 100644 index 00000000..aa9d9823 Binary files /dev/null and b/dataset_builders/pie/scidtb_argmin/img/slt_scitdb-argmin.png differ diff --git a/dataset_builders/pie/scidtb_argmin/img/tl_scidtb-argmin.png b/dataset_builders/pie/scidtb_argmin/img/tl_scidtb-argmin.png new file mode 100644 index 00000000..d14f3fc7 Binary files /dev/null and b/dataset_builders/pie/scidtb_argmin/img/tl_scidtb-argmin.png differ