edited text in abstrct/readme

ArneBinder · Feb 23, 2024 · d4167ce · d4167ce
1 parent 78f074a
commit d4167ce
Showing 1 changed file with 12 additions and 7 deletions.
diff --git a/dataset_builders/pie/abstrct/README.md b/dataset_builders/pie/abstrct/README.md
@@ -111,8 +111,16 @@ Morio et al. ([2022](https://aclanthology.org/2022.tacl-1.37.pdf); p. 642, Table
 
 ### Collected Statistics after Document Conversion
 
-In this section, we collect further statistics of the dataset after the conversion to `TextDocumentWithLabeledSpansAndBinaryRelations`.
-In the commands, we used the following dataset config: `configs/dataset/abstrct_base.yaml`:
+We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics.
+After checking out that code, the statistics and plots can be generated by the command:
+
+```commandline
+python src/evaluate_documents.py dataset=abstrct_base metric=METRIC
+```
+
+where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)).
+
+This also requires to have the following dataset config in `configs/dataset/abstrct_base.yaml` of this dataset within the repo directory:
 
 ```commandline
 _target_: src.utils.execute_pipeline
@@ -122,10 +130,7 @@ input:
   revision: 277dc703fd78614635e86fe57c636b54931538b2
 ```
 
-The script `evaluate_documents.py` comes from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1).
-
-For the tokenization, we use `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased))
-to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).
+For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)).
 
 #### Relation argument (outer) token distance per label
 
@@ -134,7 +139,7 @@ The distance is measured from the first token of the first argumentative unit to
 We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*).
 We also present histograms in the collasible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly.
 
-*Note that*: to collect the relation argument distance by tokens, the [respective branch](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/pull/135) must be checked out instead of the `main` branch.
+**Important**: To reproduce the statistics and figures for this metric by yourself, the [`relation_argument_distance_collector` branch](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/pull/135) of `pytorch-ie-hydra-template` needs to be checked out instead of the `main` branch!
 
 <details>
 <summary>Command</summary>