Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collecting statistics from AM dataset #100

Merged
merged 42 commits into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
b4631db
added statistics: Relation_argument_outer_token_distance.md
idalr Jan 25, 2024
afba426
minor fix
idalr Jan 25, 2024
4733900
micro fix
idalr Jan 25, 2024
bac5acf
micro fix
idalr Jan 25, 2024
a443e86
micro fix
idalr Jan 25, 2024
27f80e7
added commands and collapsed all histograms
idalr Jan 28, 2024
0d266ed
minor edits
idalr Jan 28, 2024
839a801
added text_length_tokens.md and
idalr Jan 31, 2024
372485d
micro fix
idalr Jan 31, 2024
acb951a
add span_lengths_tokens.md
idalr Jan 31, 2024
b5efb8b
edited texts in markdowns
idalr Jan 31, 2024
f28a0df
micro fix
idalr Jan 31, 2024
88f6012
added relation_argument_outer_token_distance_per_label.md
idalr Feb 1, 2024
e1e5706
micro fix
idalr Feb 1, 2024
4ffab8d
deleted ~~relation_argument_outer_token_distance.md~~
idalr Feb 1, 2024
bd36737
moved collected statistics content to pie/abstrct/readme.md
idalr Feb 19, 2024
43fd083
minor fix
idalr Feb 19, 2024
9aff110
edited text in abstrct/readme
idalr Feb 23, 2024
19d31a5
minor edit abstrct/readme
idalr Mar 11, 2024
7a11771
removed abstrct histograms from statistics folder
idalr Mar 11, 2024
e943371
moved content and files to aae2.readme, and removed from statistics f…
idalr Mar 11, 2024
51e84bc
minor fix
idalr Mar 11, 2024
0c8efdf
minor edit
idalr Mar 11, 2024
01598ec
moved content and files to argmicro/readme.md
idalr Mar 11, 2024
3a6d0f3
moved content and files to cdcp/readme.md
idalr Mar 11, 2024
6105b61
minor fix
idalr Mar 11, 2024
53afeab
moved content and files to scidtb_argmin/readme.md
idalr Mar 11, 2024
4477c28
minor fix
idalr Mar 11, 2024
075737c
moved content and files to sciarg/readme.md
idalr Mar 11, 2024
d87bb85
deleted AM_statistics folder
idalr Mar 11, 2024
c55db07
minor edit
idalr Mar 11, 2024
2d04194
fixed typos
idalr Mar 11, 2024
768eecf
minor adjustments
ArneBinder Nov 7, 2024
10f21bd
minor adjustments
ArneBinder Nov 7, 2024
27f9215
improve usage examples
ArneBinder Nov 7, 2024
19ecf02
add usage example to argmicro
ArneBinder Nov 7, 2024
5d0c88b
fix argmicro usage example
ArneBinder Nov 7, 2024
d5157cb
add cdcp usage example
ArneBinder Nov 7, 2024
e72f2d4
add usage example to scidtb_argmin minor and fix
ArneBinder Nov 7, 2024
045496d
add details to document converter
ArneBinder Nov 7, 2024
f44870f
improve usage example for sciarg
ArneBinder Nov 7, 2024
189c155
move usage example for sciarg
ArneBinder Nov 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 172 additions & 12 deletions dataset_builders/pie/aae2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,29 @@ This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for t

Therefore, the `aae2` dataset as described here follows the data structure from the [PIE brat dataset card](https://huggingface.co/datasets/pie/brat).

### Usage

```python
from pie_datasets import load_dataset
from pie_datasets.builders.brat import BratDocumentWithMergedSpans
from pytorch_ie.documents import TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions

# load default version
dataset = load_dataset("pie/aae2")
assert isinstance(dataset["train"][0], BratDocumentWithMergedSpans)

# if required, normalize the document type (see section Document Converters below)
dataset_converted = dataset.to_document_type(TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)
assert isinstance(dataset_converted["train"][0], TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions)

# get first relation in the first document
doc = dataset_converted["train"][0]
print(doc.binary_relations[0])
# BinaryRelation(head=LabeledSpan(start=716, end=851, label='Premise', score=1.0), tail=LabeledSpan(start=591, end=714, label='Claim', score=1.0), label='supports', score=1.0)
print(doc.binary_relations[0].resolve())
# ('supports', (('Premise', 'What we acquired from team work is not only how to achieve the same goal with others but more importantly, how to get along with others'), ('Claim', 'through cooperation, children can learn about interpersonal skills which are significant in the future life of all students')))
```

### Dataset Summary

Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a major claim component as well as claim components, and premise components justify or refute the claims. Attack and support labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642)
Expand All @@ -28,17 +51,6 @@ The `aae2` dataset comes in a single version (`default`) with `BratDocumentWithM

See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema).

### Usage

```python
from pie_datasets import load_dataset, builders

# load default version
datasets = load_dataset("pie/aae2")
doc = datasets["train"][0]
assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans)
```

### Data Splits

| Statistics | Train | Test |
Expand Down Expand Up @@ -109,7 +121,7 @@ The dataset provides document converters for the following target document types
See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type
definitions.

#### Label Statistics after Document Conversion
#### Relation Label Statistics after Document Conversion

When converting from `BratDocumentWithMergedSpan` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`,
we apply a relation-conversion method (see above) that changes the label counts for the relations, as follows:
Expand All @@ -129,6 +141,154 @@ we apply a relation-conversion method (see above) that changes the label counts
| support: `supports` | 5958 | 89.3 % |
| attack: `attacks` | 715 | 10.7 % |

### Collected Statistics after Document Conversion

We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
After checking out that code, the statistics and plots can be generated by the command:

```commandline
python src/evaluate_documents.py dataset=aae2_base metric=METRIC
```

where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)).

This also requires to have the following dataset config in `configs/dataset/aae2_base.yaml` of this dataset within the repo directory:

```commandline
_target_: src.utils.execute_pipeline
input:
_target_: pie_datasets.DatasetDict.load_dataset
path: pie/aae2
revision: 1015ee38bd8a36549b344008f7a49af72956a7fe
```

For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).

For relation-label statistics, we collect those from the default relation conversion method, i.e., `connect_first`, resulting in three distinct relation labels.

#### Relation argument (outer) token distance per label

The distance is measured from the first token of the first argumentative unit to the last token of the last unit, a.k.a. outer distance.

We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
We also present histograms in the collapsible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.

<details>
<summary>Command</summary>

```
python src/evaluate_documents.py dataset=aae2_base metric=relation_argument_token_distances
```

</details>

##### train (322 documents)

| | len | max | mean | min | std |
| :---------------- | ---: | --: | ------: | --: | ------: |
| ALL | 9002 | 514 | 102.582 | 9 | 93.76 |
| attacks | 810 | 442 | 127.622 | 10 | 109.283 |
| semantically_same | 552 | 514 | 301.638 | 25 | 73.756 |
| supports | 7640 | 493 | 85.545 | 9 | 74.023 |

<details>
<summary>Histogram (split: train, 322 documents)</summary>

![rtd-label_aae2_train.png](img%2Frtd-label_aae2_train.png)

</details>

##### test (80 documents)

| | len | max | mean | min | std |
| :---------------- | ---: | --: | ------: | --: | -----: |
| ALL | 2372 | 442 | 100.711 | 10 | 92.698 |
| attacks | 184 | 402 | 115.891 | 12 | 98.751 |
| semantically_same | 146 | 442 | 299.671 | 34 | 72.921 |
| supports | 2042 | 437 | 85.118 | 10 | 75.023 |

<details>
<summary>Histogram (split: test, 80 documents)</summary>

![rtd-label_aae2_test.png](img%2Frtd-label_aae2_test.png)

</details>

#### Span lengths (tokens)

The span length is measured from the first token of the first argumentative unit to the last token of the particular unit.

We collect the following statistics: number of documents in the split (*no. doc*), no. of spans (*len*), mean of number of tokens in a span (*mean*), standard deviation of the number of tokens (*std*), minimum tokens in a span (*min*), and maximum tokens in a span (*max*).
We also present histograms in the collapsible, showing the distribution of these token-numbers (x-axis; and unit-counts in y-axis), accordingly.

<details>
<summary>Command</summary>

```
python src/evaluate_documents.py dataset=aae2_base metric=span_lengths_tokens
```

</details>

| statistics | train | test |
| :--------- | -----: | -----: |
| no. doc | 322 | 80 |
| len | 4823 | 1266 |
| mean | 17.157 | 16.317 |
| std | 8.079 | 7.953 |
| min | 3 | 3 |
| max | 75 | 50 |

<details>
<summary>Histogram (split: train, 332 documents)</summary>

![slt_aae2_train.png](img%2Fslt_aae2_train.png)

</details>
<details>
<summary>Histogram (split: test, 80 documents)</summary>

![slt_aae2_test.png](img%2Fslt_aae2_test.png)

</details>

#### Token length (tokens)

The token length is measured from the first token of the document to the last one.

We collect the following statistics: number of documents in the split (*no. doc*), mean of document token-length (*mean*), standard deviation of the length (*std*), minimum number of tokens in a document (*min*), and maximum number of tokens in a document (*max*).
We also present histograms in the collapsible, showing the distribution of these token lengths (x-axis; and unit-counts in y-axis), accordingly.

<details>
<summary>Command</summary>

```
python src/evaluate_documents.py dataset=aae2_base metric=count_text_tokens
```

</details>

| statistics | train | test |
| :--------- | ------: | -----: |
| no. doc | 322 | 80 |
| mean | 377.686 | 378.4 |
| std | 64.534 | 66.054 |
| min | 236 | 269 |
| max | 580 | 532 |

<details>
<summary>Histogram (split: train, 332 documents)</summary>

![tl_aae2_train.png](img%2Ftl_aae2_train.png)

</details>
<details>
<summary>Histogram (split: test, 80 documents)</summary>

![tl_aae2_test.png](img%2Ftl_aae2_test.png)

</details>

## Dataset Creation

### Curation Rationale
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dataset_builders/pie/aae2/img/slt_aae2_test.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dataset_builders/pie/aae2/img/slt_aae2_train.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dataset_builders/pie/aae2/img/tl_aae2_test.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added dataset_builders/pie/aae2/img/tl_aae2_train.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading