Skip to content

Commit

Permalink
edited text in abstrct/readme
Browse files Browse the repository at this point in the history
  • Loading branch information
idalr committed Feb 23, 2024
1 parent 78f074a commit d4167ce
Showing 1 changed file with 12 additions and 7 deletions.
19 changes: 12 additions & 7 deletions dataset_builders/pie/abstrct/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,8 +111,16 @@ Morio et al. ([2022](https://aclanthology.org/2022.tacl-1.37.pdf); p. 642, Table

### Collected Statistics after Document Conversion

In this section, we collect further statistics of the dataset after the conversion to `TextDocumentWithLabeledSpansAndBinaryRelations`.
In the commands, we used the following dataset config: `configs/dataset/abstrct_base.yaml`:
We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
After checking out that code, the statistics and plots can be generated by the command:

```commandline
python src/evaluate_documents.py dataset=abstrct_base metric=METRIC
```

where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)).

This also requires to have the following dataset config in `configs/dataset/abstrct_base.yaml` of this dataset within the repo directory:

```commandline
_target_: src.utils.execute_pipeline
Expand All @@ -122,10 +130,7 @@ input:
revision: 277dc703fd78614635e86fe57c636b54931538b2
```

The script `evaluate_documents.py` comes from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1).

For the tokenization, we use `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased))
to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).
For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).

#### Relation argument (outer) token distance per label

Expand All @@ -134,7 +139,7 @@ The distance is measured from the first token of the first argumentative unit to
We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
We also present histograms in the collasible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.

*Note that*: to collect the relation argument distance by tokens, the [respective branch](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/pull/135) must be checked out instead of the `main` branch.
**Important**: To reproduce the statistics and figures for this metric by yourself, the [`relation_argument_distance_collector` branch](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/pull/135) of `pytorch-ie-hydra-template` needs to be checked out instead of the `main` branch!

<details>
<summary>Command</summary>
Expand Down

0 comments on commit d4167ce

Please sign in to comment.