diff --git a/dataset_builders/pie/abstrct/README.md b/dataset_builders/pie/abstrct/README.md index 0a07da6b..e7f00bcc 100644 --- a/dataset_builders/pie/abstrct/README.md +++ b/dataset_builders/pie/abstrct/README.md @@ -111,8 +111,16 @@ Morio et al. ([2022](https://aclanthology.org/2022.tacl-1.37.pdf); p. 642, Table ### Collected Statistics after Document Conversion -In this section, we collect further statistics of the dataset after the conversion to `TextDocumentWithLabeledSpansAndBinaryRelations`. -In the commands, we used the following dataset config: `configs/dataset/abstrct_base.yaml`: +We use the script `evaluate_documents.py` from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1) to generate these statistics. +After checking out that code, the statistics and plots can be generated by the command: + +```commandline +python src/evaluate_documents.py dataset=abstrct_base metric=METRIC +``` + +where a `METRIC` is called according to the available metric configs in `config/metric/METRIC` (see [metrics](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/tree/main/configs/metric)). + +This also requires to have the following dataset config in `configs/dataset/abstrct_base.yaml` of this dataset within the repo directory: ```commandline _target_: src.utils.execute_pipeline @@ -122,10 +130,7 @@ input: revision: 277dc703fd78614635e86fe57c636b54931538b2 ``` -The script `evaluate_documents.py` comes from [PyTorch-IE-Hydra-Template](https://github.com/ArneBinder/pytorch-ie-hydra-template-1). - -For the tokenization, we use `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased)) -to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)). +For token based metrics, this uses `bert-base-uncased` from `transformer.AutoTokenizer` (see [AutoTokenizer](https://huggingface.co/docs/transformers/v4.37.1/en/model_doc/auto#transformers.AutoTokenizer), and [bert-based-uncased](https://huggingface.co/bert-base-uncased) to tokenize `text` in `TextDocumentWithLabeledSpansAndBinaryRelations` (see [document type](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py)). #### Relation argument (outer) token distance per label @@ -134,7 +139,7 @@ The distance is measured from the first token of the first argumentative unit to We collect the following statistics: number of documents in the split (*no. doc*), no. of relations (*len*), mean of token distance (*mean*), standard deviation of the distance (*std*), minimum outer distance (*min*), and maximum outer distance (*max*). We also present histograms in the collasible, showing the distribution of these relation distances (x-axis; and unit-counts in y-axis), accordingly. -*Note that*: to collect the relation argument distance by tokens, the [respective branch](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/pull/135) must be checked out instead of the `main` branch. +**Important**: To reproduce the statistics and figures for this metric by yourself, the [`relation_argument_distance_collector` branch](https://github.com/ArneBinder/pytorch-ie-hydra-template-1/pull/135) of `pytorch-ie-hydra-template` needs to be checked out instead of the `main` branch!
Command