From 96646fc0878dac57cba0ee2d5f4ee2367f5b0b63 Mon Sep 17 00:00:00 2001 From: Ruangrin L <88072261+idalr@users.noreply.github.com> Date: Thu, 21 Dec 2023 15:09:48 +0100 Subject: [PATCH 01/14] created files --- dataset_builders/pie/aae2/README.md | 0 dataset_builders/pie/aae2/aae2.py | 0 dataset_builders/pie/aae2/requirements.txt | 0 tests/dataset_builders/pie/test_aae2.py | 0 4 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 dataset_builders/pie/aae2/README.md create mode 100644 dataset_builders/pie/aae2/aae2.py create mode 100644 dataset_builders/pie/aae2/requirements.txt create mode 100644 tests/dataset_builders/pie/test_aae2.py diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md new file mode 100644 index 00000000..e69de29b diff --git a/dataset_builders/pie/aae2/aae2.py b/dataset_builders/pie/aae2/aae2.py new file mode 100644 index 00000000..e69de29b diff --git a/dataset_builders/pie/aae2/requirements.txt b/dataset_builders/pie/aae2/requirements.txt new file mode 100644 index 00000000..e69de29b diff --git a/tests/dataset_builders/pie/test_aae2.py b/tests/dataset_builders/pie/test_aae2.py new file mode 100644 index 00000000..e69de29b From 1f56d5bef34f1f2e94de537d4ae61f442e703437 Mon Sep 17 00:00:00 2001 From: Ruangrin L <88072261+idalr@users.noreply.github.com> Date: Thu, 21 Dec 2023 15:57:31 +0100 Subject: [PATCH 02/14] imported scripts from PR64 --- dataset_builders/pie/aae2/README.md | 270 +++++++++ dataset_builders/pie/aae2/aae2.py | 181 ++++++ dataset_builders/pie/aae2/requirements.txt | 2 + tests/dataset_builders/pie/test_aae2.py | 650 +++++++++++++++++++++ 4 files changed, 1103 insertions(+) diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md index e69de29b..120b3a21 100644 --- a/dataset_builders/pie/aae2/README.md +++ b/dataset_builders/pie/aae2/README.md @@ -0,0 +1,270 @@ +# PIE Dataset Card for "aae2" + +This is a [PyTorch-IE](https://github.com/ChristophAlt/pytorch-ie) wrapper for the Argument Annotated Essays v2 (AAE2) dataset ([paper](https://aclanthology.org/J17-3005.pdf) and [homepage](https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2422)). Since the AAE2 dataset is published in the [BRAT standoff format](https://brat.nlplab.org/standoff.html), this dataset builder is based on the [PyTorch-IE brat dataset loading script](https://huggingface.co/datasets/pie/brat). + +Therefore, the `aae2` dataset as described here follows the data structure from the [PIE brat dataset card](https://huggingface.co/datasets/pie/brat). + +### Dataset Summary + +Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a `MajorClaim` component as well as `Claim` components, and `Premise` components justify or refute the claims. `Attack` and `Support` labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642) + +There are two types of data: essay-level and paragraph-level ([Eger et al., 2017](https://aclanthology.org/P17-1002/)). In other words, a tree structure is complete within each paragraph, and there was no `Premise` that link to another `Premise` or `Claim` in a different paragraph, as seen in **Example** below. Therefore, it is possible to train a model on a paragraph-level which is also less memory-exhaustive (Eger et al., 2017, p. 16). + +### Supported Tasks and Leaderboards + +- **Tasks**: Argumentation Mining, Component Identification, Component Classification, Structure Identification +- **Leaderboard:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### Languages + +The language in the dataset is English (persuasive essays). + +### Dataset Variants + +See [PIE-Brat Dataset Variants](https://huggingface.co/datasets/pie/brat#dataset-variants). + +## Data Schema + +See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema). + +### Usage + +```python +from pie_datasets import load_dataset, builders + +# load default version +datasets = load_dataset("pie/aae2") +doc = datasets["train"][0] +assert isinstance(doc, builders.brat.BratDocument) + +# load version with merged span fragments +dataset_merged_spans = load_dataset("pie/aae2", name="merge_fragmented_spans") +doc_merged_spans = dataset_merged_spans["train"][0] +assert isinstance(doc_merged_spans, builders.brat.BratDocumentWithMergedSpans) +``` + +## Document Converters + +The dataset provides document converters for the following target document types: + +- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations` + - `LabeledSpans`, converted from `BratDocument`'s `spans` + - labels: `MajorClaim`, `Claim`, `Premise` + - `BinaryRelations`, converted from `BratDocument`'s `relations` + - labels: `support`, `attack`, `semantically_same` + - there are two conversion methods that convert `Claim`'s attributes to their relations to `MajorClaim` (see [Relation Conversions](#relation-conversions) below, for more details) +- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions` + - - `LabeledSpans`, as above + - `BinaryRelations`, as above + - `LabeledPartitions`, partitioned `BratDocument`'s `text`, according to the paragraph, using regex. + - every partition is labeled as `labeled_partition` + +See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type +definitions. + +### Data Splits + +| Statistics | Train | Test | +| ---------------------------------------------------------------- | -------------------------: | -----------------------: | +| No. of document | 322 | 80 | +| Components
- `MajorClaim`
- `Claim`
- `Premise` |
598
1202
3023 |
153
304
809 | +| Relations\*
- `Support`
- `Attack` |
3820
405 |
1021
92 | + +\* included all relations between claims and premises and all claim attributions. + +See further statistics in Stab & Gurevych (2017), p. 650, Table A.1. + +### Label Descriptions + +#### Components + +| Components | Count | Percentage | +| ------------ | ----: | ---------: | +| `MajorClaim` | 751 | 12.3 % | +| `Claim` | 1506 | 24.7 % | +| `Premise` | 3832 | 62.9 % | + +- `MajorClaim` is the root node of the argumentation structure and represents the author’s standpoint on the topic. Essay bodies either support or attack the author’s standpoint expressed in the major claim. +- `Claim` constitutes the central component of each argument. Each one has at least one premise and take the values "for" or "against" +- `Premise` is the reasons of the argument; either linked to claim or another premise. + +**Note that** relations between `MajorClaim` and `Claim` were not annotated; however, each claim is annotated with `Attribute`: `for` or `against` - which indicates the relation between itself and `MajorClaim`. In addition, when two non-related `claim` 's appear in one paragraph, there is also no relations to one another. + +#### Relations + +| Relations | Count | Percentage | +| ------------------ | ----: | ---------: | +| support: `Support` | 3613 | 94.3 % | +| attack: `Attack` | 219 | 5.7 % | + +- "Each premise `p` has one **outgoing relation** (i.e., there is a relation that has p as source component) and none or several **incoming relations** (i.e., there can be a relation with `p` as target component)." +- "A `Claim` can exhibit several **incoming relations** but no **outgoing relation**." (S&G, 2017, p. 68) +- "The relations from the claims of the arguments to the major claim are dotted since we will not explicitly annotated them. The relation of each argument to the major claim is indicated by a stance attribute of each claim. This attribute can either be for or against as illustrated in figure 1.4." (Stab & Gurevych, *Guidelines for Annotating Argumentation Structures in Persuasive Essays*, 2015, p. 5) + +See further description in Stab & Gurevych 2017, p.627 and the [annotation guideline](https://github.com/ArneBinder/pie-datasets/blob/db94035602610cefca2b1678aa2fe4455c96155d/data/datasets/ArgumentAnnotatedEssays-2.0/guideline.pdf). + +#### Relation Conversions + +When converting from `BratDocument(WithMergedSpan)` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`, +we apply a relation-conversion methods to build relations between `Claim`'s and `MajorClaim`'s, based on the annotated `Claim`'s attribution. + +The two conversion methods are: + +1. `connect_first` (default): + - build a `Support` or `Attack` relation from each `Claim` to the first `MajorClaim`, and + - build a `semantically_same` relation between following `MajorClaim` to the first `MajorClaim` + +The relation counts for this conversion method is as follows: + +| Relations | Count | Percentage | +| -------------------------- | ----: | ---------: | +| support: `Support` | 4841 | 85.1 % | +| attack: `Attack` | 497 | 8.7 % | +| other: `semantically_same` | 349 | 6.2 % | + +2. `connect_all` + - build a `Support` or `Attack` relation from each `Claim` to every `MajorClaim` + - no relations between each `MajorClaim` + +The relation counts for this conversion method is as follows: + +| Relations | Count | Percentage | +| ------------------ | ----: | ---------: | +| support: `Support` | 5958 | 89.3 % | +| attack: `Attack` | 715 | 10.7 % | + +## Dataset Creation + +### Curation Rationale + +"The identification of argumentation structures involves several subtasks like separating argumentative from non-argumentative text units (Moens et al. 2007; Florou +et al. 2013), classifying argument components into claims and premises (Mochales-Palau and Moens 2011; Rooney, Wang, and Browne 2012; Stab and Gurevych 2014b), +and identifying argumentative relations (Mochales-Palau and Moens 2009; Peldszus +2014; Stab and Gurevych 2014b). However, an approach that covers all subtasks is still +missing. However, an approach that covers all subtasks is still +missing. Furthermore, most approaches operate locally and do not optimize the global +argumentation structure. + +"In addition, +to the lack of end-to-end approaches for parsing argumentation structures, there are +relatively few corpora annotated with argumentation structures at the discourse-level." (p. 621) + +"Our primary motivation for this work is to create argument analysis methods +for argumentative writing support systems and to achieve a better understanding +of argumentation structures." (p. 622) + +### Source Data + +Persuasive essays were collected from [essayforum.com](https://essayforum.com/) (See essay prompts, along with the essay's `id`'s [here](https://github.com/ArneBinder/pie-datasets/blob/db94035602610cefca2b1678aa2fe4455c96155d/data/datasets/ArgumentAnnotatedEssays-2.0/prompts.csv)). + +#### Initial Data Collection and Normalization + +"We randomly selected 402 English essays with a description of the writing prompt from +essayforum.com. This online forum is an active community that provides correction and +feedback about different texts such as research papers, essays, or poetry. For example, +students post their essays in order to receive feedback about their writing skills while +preparing for standardized language tests. The corpus includes 7,116 sentences with +147,271 tokens." (p. 630) + +#### Who are the source language producers? + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### Annotations + +#### Annotation process + +The annotation were done using BRAT Rapid Annotation Tool ([Stenetorp et al., 2012](https://aclanthology.org/E12-2021/)). + +All three annotators independently annotated a random subset of 80 essays. The +remaining 322 essays were annotated by the expert annotator. + +The authors evaluated the inter-annotator agreement using observed agreement and Fleiss’ κ (Fleiss 1971), on each label on each sub-tasks, +namely, component identification, component classification, and relation identification. +The results were reported in their [paper](https://aclanthology.org/J17-3005.pdf) in Tables 2-4. + +#### Who are the annotators? + +Three non-native speakers; one of the three being an expert annotator. + +### Personal and Sensitive Information + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +## Considerations for Using the Data + +### Social Impact of Dataset + +"\[Computational Argumentation\] have +broad application potential in various areas such as legal decision support (Mochales-Palau and Moens 2009), information retrieval (Carstens and Toni 2015), policy making (Sardianos et al. 2015), and debating technologies (Levy et al. 2014; Rinott et al. +2015)." (p. 619) + +### Discussion of Biases + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### Other Known Limitations + +The relations between claims and major claims are not explicitly annotated. + +"The proportion of non-argumentative text amounts to 47,474 tokens (32.2%) and +1,631 sentences (22.9%). The number of sentences with several argument components +is 583, of which 302 include several components with different types (e.g., a claim followed by premise)... +\[T\]he identification of argument components requires the +separation of argumentative from non-argumentative text units and the recognition of +component boundaries at the token level...The proportion of paragraphs with unlinked +argument components (e.g., unsupported claims without incoming relations) is 421 +(23%). Thus, methods that link all argument components in a paragraph are only of +limited use for identifying the argumentation structures in our corpus. + +"Most of the arguments are convergent—that is, the depth of the +argument is 1. The number of arguments with serial structure is 236 (20.9%)." (p. 634) + +## Additional Information + +### Dataset Curators + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### Licensing Information + +**License**: [License description by TU Darmstadt](https://tudatalib.ulb.tu-darmstadt.de/bitstream/handle/tudatalib/2422/arg_annotated_essays_v2_license.pdf?sequence=2&isAllowed=y) + +**Funding**: This work has been supported by the +Volkswagen Foundation as part of the +Lichtenberg-Professorship Program under +grant no. I/82806 and by the German Federal +Ministry of Education and Research (BMBF) +as a part of the Software Campus project +AWS under grant no. 01—S12054. + +### Citation Information + +``` +@article{stab2017parsing, + title={Parsing argumentation structures in persuasive essays}, + author={Stab, Christian and Gurevych, Iryna}, + journal={Computational Linguistics}, + volume={43}, + number={3}, + pages={619--659}, + year={2017}, + publisher={MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info~…} +} +``` + +``` +@misc{https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2422, +url = { https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2422 }, +author = { Stab, Christian and Gurevych, Iryna }, +keywords = { Argument Mining, 409-06 Informationssysteme, Prozess- und Wissensmanagement, 004 }, +publisher = { Technical University of Darmstadt }, +year = { 2017 }, +copyright = { License description }, +title = { Argument Annotated Essays (version 2) } +} +``` + +### Contributions + +Thanks to [@ArneBinder](https://github.com/ArneBinder) and [@idalr](https://github.com/idalr) for adding this dataset. diff --git a/dataset_builders/pie/aae2/aae2.py b/dataset_builders/pie/aae2/aae2.py index e69de29b..fa44ea63 100644 --- a/dataset_builders/pie/aae2/aae2.py +++ b/dataset_builders/pie/aae2/aae2.py @@ -0,0 +1,181 @@ +import os +from typing import Dict + +import pandas as pd +from pie_modules.document.processing import RegexPartitioner +from pytorch_ie.annotations import BinaryRelation +from pytorch_ie.documents import ( + TextDocumentWithLabeledSpansAndBinaryRelations, + TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, +) + +from pie_datasets.builders import BratBuilder +from pie_datasets.builders.brat import BratConfig, BratDocumentWithMergedSpans +from pie_datasets.core.dataset import DocumentConvertersType +from pie_datasets.document.processing import Caster, Converter, Pipeline + + +def get_split_paths(url_split_ids: str, subdirectory: str) -> Dict[str, str]: + df_splits = pd.read_csv(url_split_ids, sep=";") + splits2ids = df_splits.groupby(df_splits["SET"]).agg(list).to_dict()["ID"] + return { + split.lower(): [os.path.join(subdirectory, split_id) for split_id in split_ids] + for split, split_ids in splits2ids.items() + } + + +URL = "https://github.com/ArneBinder/pie-datasets/raw/83fb46f904b13f335b6da3cce2fc7004d802ce4e/data/datasets/ArgumentAnnotatedEssays-2.0/brat-project-final.zip" +URL_SPLIT_IDS = "https://raw.githubusercontent.com/ArneBinder/pie-datasets/83fb46f904b13f335b6da3cce2fc7004d802ce4e/data/datasets/ArgumentAnnotatedEssays-2.0/train-test-split.csv" +SPLIT_PATHS = get_split_paths(URL_SPLIT_IDS, subdirectory="brat-project-final") + +DEFAULT_ATTRIBUTIONS_TO_RELATIONS_DICT = {"For": "supports", "Against": "attacks"} + + +def convert_aae2_claim_attributions_to_relations( + document: BratDocumentWithMergedSpans, + method: str, + attributions_to_relations_mapping: Dict[str, str] = DEFAULT_ATTRIBUTIONS_TO_RELATIONS_DICT, + major_claim_label: str = "MajorClaim", + claim_label: str = "Claim", + semantically_same_label: str = "semantically_same", +) -> TextDocumentWithLabeledSpansAndBinaryRelations: + """This function collects the attributions of Claims from BratDocumentWithMergedSpans, and + build new relations between MajorClaims and Claims based on these attributions in the following + way: + 1) "connect_first": + Each Claim points to the first MajorClaim, + and the other MajorClaim(s) is labeled as semantically same as the first MajorClaim. + The number of new relations created are: NoOfMajorClaim - 1 + NoOfClaim. + 2) "connect_all": + Each Claim points to every MajorClaim; creating many-to-many relations. + The number of new relations created are: NoOfMajorClaim x NoOfClaim. + + The attributions are transformed into the relation labels as listed in + DEFAULT_ATTRIBUTIONS_TO_RELATIONS_DICT dictionary. + """ + document = document.copy() + new_document = TextDocumentWithLabeledSpansAndBinaryRelations( + text=document.text, id=document.id, metadata=document.metadata + ) + # import from document + spans = document.spans.clear() + new_document.labeled_spans.extend(spans) + relations = document.relations.clear() + new_document.binary_relations.extend(relations) + + claim_attributes = [ + attribute + for attribute in document.span_attributes + if attribute.annotation.label == claim_label + ] + + # get all MajorClaims + # sorted by start position to ensure the first MajorClaim is really the first one that occurs in the text + major_claims = sorted( + [mc for mc in new_document.labeled_spans if mc.label == major_claim_label], + key=lambda span: span.start, + ) + + if method == "connect_first": + if len(major_claims) > 0: + first_major_claim = major_claims.pop(0) + + # Add relation between Claims and first MajorClaim + for claim_attribute in claim_attributes: + new_relation = BinaryRelation( + head=claim_attribute.annotation, + tail=first_major_claim, + label=attributions_to_relations_mapping[claim_attribute.value], + ) + new_document.binary_relations.append(new_relation) + + # Add relations between MajorClaims + for majorclaim in major_claims: + new_relation = BinaryRelation( + head=majorclaim, + tail=first_major_claim, + label=semantically_same_label, + ) + new_document.binary_relations.append(new_relation) + + elif method == "connect_all": + for major_claim in major_claims: + for claim_attribute in claim_attributes: + new_relation = BinaryRelation( + head=claim_attribute.annotation, + tail=major_claim, + label=attributions_to_relations_mapping[claim_attribute.value], + ) + new_document.binary_relations.append(new_relation) + + else: + raise ValueError(f"unknown method: {method}") + + return new_document + + +def get_common_pipeline_steps(conversion_method: str) -> dict: + return dict( + convert=Converter( + function=convert_aae2_claim_attributions_to_relations, + method=conversion_method, + ), + ) + + +class ArgumentAnnotatedEssaysV2Config(BratConfig): + def __init__(self, conversion_method: str, **kwargs): + """BuilderConfig for ArgumentAnnotatedEssaysV2. + + Args: + conversion_method: either "connect_first" or "connect_all", see convert_aae2_claim_attributions_to_relations + **kwargs: keyword arguments forwarded to super. + """ + super().__init__(**kwargs) + self.conversion_method = conversion_method + + +class ArgumentAnnotatedEssaysV2(BratBuilder): + BASE_DATASET_PATH = "DFKI-SLT/brat" + BASE_DATASET_REVISION = "bb8c37d84ddf2da1e691d226c55fef48fd8149b5" + + # we need to add None to the list of dataset variants to support the default dataset variant + BASE_BUILDER_KWARGS_DICT = { + dataset_variant: {"url": URL, "split_paths": SPLIT_PATHS} + for dataset_variant in ["default", "merge_fragmented_spans", None] + } + + BUILDER_CONFIGS = [ + ArgumentAnnotatedEssaysV2Config(name="default", conversion_method="connect_first"), + ArgumentAnnotatedEssaysV2Config( + name="merge_fragmented_spans", + merge_fragmented_spans=True, + conversion_method="connect_first", + ), + ] + + @property + def document_converters(self) -> DocumentConvertersType: + if self.config.name == "default": + # we do not support any auto-conversion for the default BratDocument for now + return {} + elif self.config.name == "merge_fragmented_spans": + return { + TextDocumentWithLabeledSpansAndBinaryRelations: Pipeline( + **get_common_pipeline_steps(conversion_method=self.config.conversion_method) + ), + TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions: Pipeline( + **get_common_pipeline_steps(conversion_method=self.config.conversion_method), + cast=Caster( + document_type=TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions + ), + add_partitions=RegexPartitioner( + partition_layer_name="labeled_partitions", + pattern="\n", + strip_whitespace=True, + verbose=False, + ), + ), + } + else: + raise ValueError(f"Unknown dataset variant: {self.config.name}") diff --git a/dataset_builders/pie/aae2/requirements.txt b/dataset_builders/pie/aae2/requirements.txt index e69de29b..0c3196c2 100644 --- a/dataset_builders/pie/aae2/requirements.txt +++ b/dataset_builders/pie/aae2/requirements.txt @@ -0,0 +1,2 @@ +pie-datasets>=0.6.0,<0.9.0 +pie-modules>=0.8.0,<0.9.0 diff --git a/tests/dataset_builders/pie/test_aae2.py b/tests/dataset_builders/pie/test_aae2.py index e69de29b..d1e90551 100644 --- a/tests/dataset_builders/pie/test_aae2.py +++ b/tests/dataset_builders/pie/test_aae2.py @@ -0,0 +1,650 @@ +from typing import List, Optional, Union + +import pytest +from datasets import disable_caching +from pie_modules.document.processing import tokenize_document +from pytorch_ie.documents import ( + TextDocumentWithLabeledSpansAndBinaryRelations, + TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, +) +from transformers import AutoTokenizer, PreTrainedTokenizer + +from dataset_builders.pie.aae2.aae2 import ArgumentAnnotatedEssaysV2 +from pie_datasets import DatasetDict +from pie_datasets.builders.brat import BratDocument, BratDocumentWithMergedSpans +from tests.dataset_builders.common import ( + PIE_BASE_PATH, + TestTokenDocumentWithLabeledSpansAndBinaryRelations, + TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, +) + +disable_caching() + +DATASET_NAME = "aae2" +PIE_DATASET_PATH = PIE_BASE_PATH / DATASET_NAME +SPLIT_SIZES = {"test": 80, "train": 322} + + +@pytest.fixture( + scope="module", params=[config.name for config in ArgumentAnnotatedEssaysV2.BUILDER_CONFIGS] +) +def dataset_variant(request) -> str: + return request.param + + +@pytest.fixture(scope="module") +def dataset(dataset_variant) -> DatasetDict: + return DatasetDict.load_dataset(str(PIE_DATASET_PATH), name=dataset_variant) + + +def test_dataset(dataset): + assert dataset is not None + assert {name: len(ds) for name, ds in dataset.items()} == SPLIT_SIZES + + +def test_no_fragmented_spans(dataset, dataset_variant): + if dataset_variant == "default": + for split, docs in dataset.items(): + for doc in docs: + # test the number of slices of the LabeledMultiSpan annotations + assert all([len(span.slices) == 1 for span in doc.spans]) + + +@pytest.fixture(scope="module") +def document(dataset, dataset_variant) -> Union[BratDocument, BratDocumentWithMergedSpans]: + result = dataset["train"][0] + if dataset_variant == "default": + assert isinstance(result, BratDocument) + elif dataset_variant == "merge_fragmented_spans": + assert isinstance(result, BratDocumentWithMergedSpans) + else: + raise ValueError(f"Unknown dataset variant: {dataset_variant}") + return result + + +def test_document(document, dataset_variant): + assert document is not None + assert document.id == "essay001" + + # check the annotation + if dataset_variant == "default": + span_texts_labels_tuples = [ + (" ".join([document.text[start:end] for start, end in span.slices]), span.label) + for span in document.spans + ] + elif dataset_variant == "merge_fragmented_spans": + span_texts_labels_tuples = [(str(span), span.label) for span in document.spans] + + # check spans + assert len(document.spans) == 11 + assert span_texts_labels_tuples[0] == ( + "we should attach more importance to cooperation during primary education", + "MajorClaim", + ) + assert span_texts_labels_tuples[1] == ( + "a more cooperative attitudes towards life is more profitable in one's success", + "MajorClaim", + ) + assert span_texts_labels_tuples[2] == ( + "through cooperation, children can learn about interpersonal skills which are significant in the future life " + "of all students", + "Claim", + ) + assert span_texts_labels_tuples[3] == ( + "What we acquired from team work is not only how to achieve the same goal with others but more importantly, " + "how to get along with others", + "Premise", + ) + assert span_texts_labels_tuples[4] == ( + "During the process of cooperation, children can learn about how to listen to opinions of others, how to " + "communicate with others, how to think comprehensively, and even how to compromise with other team members " + "when conflicts occurred", + "Premise", + ) + assert span_texts_labels_tuples[5] == ( + "All of these skills help them to get on well with other people and will benefit them for the whole life", + "Premise", + ) + assert span_texts_labels_tuples[6] == ("competition makes the society more effective", "Claim") + assert span_texts_labels_tuples[7] == ( + "the significance of competition is that how to become more excellence to gain the victory", + "Premise", + ) + assert span_texts_labels_tuples[8] == ( + "when we consider about the question that how to win the game, we always find that we need the cooperation", + "Premise", + ) + assert span_texts_labels_tuples[9] == ( + "Take Olympic games which is a form of competition for instance, it is hard to imagine how an athlete could " + "win the game without the training of his or her coach, and the help of other professional staffs such as " + "the people who take care of his diet, and those who are in charge of the medical care", + "Premise", + ) + assert span_texts_labels_tuples[10] == ( + "without the cooperation, there would be no victory of competition", + "Claim", + ) + + # check relations + assert len(document.relations) == 6 + document.relations[0].label == "supports" + document.relations[0].head == document.spans[3] + document.relations[0].tail == document.spans[2] + document.relations[1].label == "supports" + document.relations[1].head == document.spans[4] + document.relations[1].tail == document.spans[2] + document.relations[2].label == "supports" + document.relations[2].head == document.spans[5] + document.relations[2].tail == document.spans[2] + document.relations[3].label == "supports" + document.relations[3].head == document.spans[9] + document.relations[3].tail == document.spans[10] + document.relations[4].label == "supports" + document.relations[4].head == document.spans[8] + document.relations[4].tail == document.spans[10] + document.relations[5].label == "supports" + document.relations[5].head == document.spans[7] + document.relations[5].tail == document.spans[8] + + +@pytest.fixture(scope="module") +def dataset_of_text_documents_with_labeled_spans_and_binary_relations( + dataset, dataset_variant +) -> Optional[DatasetDict]: + if dataset_variant == "default": + with pytest.raises(ValueError) as excinfo: + dataset.to_document_type(TextDocumentWithLabeledSpansAndBinaryRelations) + assert ( + str(excinfo.value) + == "No valid key (either subclass or superclass) was found for the document type " + "'' in the " + "document_converters of the dataset. Available keys: set(). Consider adding a respective " + "converter to the dataset with dataset.register_document_converter(my_converter_method) " + "where my_converter_method should accept " + "as input and return ''." + ) + converted_dataset = None + elif dataset_variant == "merge_fragmented_spans": + converted_dataset = dataset.to_document_type( + TextDocumentWithLabeledSpansAndBinaryRelations + ) + else: + raise ValueError(f"Unknown dataset variant: {dataset_variant}") + return converted_dataset + + +def test_dataset_of_text_documents_with_labeled_spans_and_binary_relations( + dataset_of_text_documents_with_labeled_spans_and_binary_relations, +): + if dataset_of_text_documents_with_labeled_spans_and_binary_relations is not None: + # Check that the conversion is correct and the data makes sense + # get a document to check + doc = dataset_of_text_documents_with_labeled_spans_and_binary_relations["train"][0] + assert isinstance(doc, TextDocumentWithLabeledSpansAndBinaryRelations) + + entities = doc.labeled_spans + assert len(entities) == 11 + # sort the entities by their start position and convert them to tuples + sorted_entity_tuples = [ + (str(ent), ent.label) for ent in sorted(doc.labeled_spans, key=lambda ent: ent.start) + ] + assert sorted_entity_tuples[0] == ( + "we should attach more importance to cooperation during primary education", + "MajorClaim", + ) + assert sorted_entity_tuples[1] == ( + "through cooperation, children can learn about interpersonal skills which are significant in the future life " + "of all students", + "Claim", + ) + assert sorted_entity_tuples[2] == ( + "What we acquired from team work is not only how to achieve the same goal with others but more importantly, " + "how to get along with others", + "Premise", + ) + assert sorted_entity_tuples[3] == ( + "During the process of cooperation, children can learn about how to listen to opinions of others, how to " + "communicate with others, how to think comprehensively, and even how to compromise with other team members " + "when conflicts occurred", + "Premise", + ) + assert sorted_entity_tuples[4] == ( + "All of these skills help them to get on well with other people and will benefit them for the whole life", + "Premise", + ) + assert sorted_entity_tuples[5] == ( + "the significance of competition is that how to become more excellence to gain the victory", + "Premise", + ) + assert sorted_entity_tuples[6] == ("competition makes the society more effective", "Claim") + assert sorted_entity_tuples[7] == ( + "when we consider about the question that how to win the game, we always find that we need the cooperation", + "Premise", + ) + assert sorted_entity_tuples[8] == ( + "Take Olympic games which is a form of competition for instance, it is hard to imagine how an athlete could " + "win the game without the training of his or her coach, and the help of other professional staffs such as " + "the people who take care of his diet, and those who are in charge of the medical care", + "Premise", + ) + assert sorted_entity_tuples[9] == ( + "without the cooperation, there would be no victory of competition", + "Claim", + ) + assert sorted_entity_tuples[10] == ( + "a more cooperative attitudes towards life is more profitable in one's success", + "MajorClaim", + ) + + # check the relations + # for conversion_method="connect_first" + assert len(doc.binary_relations) == 10 + relation_tuples = [ + (str(rel.head), rel.label, str(rel.tail)) for rel in doc.binary_relations + ] + assert relation_tuples[0] == ( + "What we acquired from team work is not only how to achieve the same goal with others but more importantly, " + "how to get along with others", + "supports", + "through cooperation, children can learn about interpersonal skills which are significant in the future life " + "of all students", + ) + assert relation_tuples[1] == ( + "During the process of cooperation, children can learn about how to listen to opinions of others, how to " + "communicate with others, how to think comprehensively, and even how to compromise with other team members " + "when conflicts occurred", + "supports", + "through cooperation, children can learn about interpersonal skills which are significant in the future life " + "of all students", + ) + assert relation_tuples[2] == ( + "All of these skills help them to get on well with other people and will benefit them for the whole life", + "supports", + "through cooperation, children can learn about interpersonal skills which are significant in the future life " + "of all students", + ) + assert relation_tuples[3] == ( + "Take Olympic games which is a form of competition for instance, it is hard to imagine how an athlete could " + "win the game without the training of his or her coach, and the help of other professional staffs such as " + "the people who take care of his diet, and those who are in charge of the medical care", + "supports", + "without the cooperation, there would be no victory of competition", + ) + assert relation_tuples[4] == ( + "when we consider about the question that how to win the game, we always find that we need the cooperation", + "supports", + "without the cooperation, there would be no victory of competition", + ) + assert relation_tuples[5] == ( + "the significance of competition is that how to become more excellence to gain the victory", + "supports", + "competition makes the society more effective", + ) + assert relation_tuples[6] == ( + "through cooperation, children can learn about interpersonal skills which are significant in the future " + "life of all students", + "supports", + "we should attach more importance to cooperation during primary education", + ) + assert relation_tuples[7] == ( + "competition makes the society more effective", + "attacks", + "we should attach more importance to cooperation during primary education", + ) + assert relation_tuples[8] == ( + "without the cooperation, there would be no victory of competition", + "supports", + "we should attach more importance to cooperation during primary education", + ) + assert relation_tuples[9] == ( + "a more cooperative attitudes towards life is more profitable in one's success", + "semantically_same", + "we should attach more importance to cooperation during primary education", + ) + + +@pytest.fixture(scope="module") +def dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions( + dataset, dataset_variant +) -> Optional[DatasetDict]: + if dataset_variant == "default": + with pytest.raises(ValueError) as excinfo: + dataset.to_document_type( + TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions + ) + assert ( + str(excinfo.value) + == "No valid key (either subclass or superclass) was found for the document type " + "'' " + "in the document_converters of the dataset. Available keys: set(). Consider adding a respective " + "converter to the dataset with dataset.register_document_converter(my_converter_method) where " + "my_converter_method should accept as input and " + "return ''." + ) + converted_dataset = None + elif dataset_variant == "merge_fragmented_spans": + converted_dataset = dataset.to_document_type( + TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions + ) + else: + raise ValueError(f"Unknown dataset variant: {dataset_variant}") + return converted_dataset + + +def test_dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions( + dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions, + dataset_of_text_documents_with_labeled_spans_and_binary_relations, +): + if ( + dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions + is not None + ): + # Check that the conversion is correct and the data makes sense + # get a document to check + doc_without_partitions = dataset_of_text_documents_with_labeled_spans_and_binary_relations[ + "train" + ][0] + doc_with_partitions = ( + dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions[ + "train" + ][0] + ) + assert isinstance(doc_without_partitions, TextDocumentWithLabeledSpansAndBinaryRelations) + assert isinstance( + doc_with_partitions, TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions + ) + + partitions = doc_with_partitions.labeled_partitions + assert len(partitions) == 5 + assert [partition.label == "partition" for partition in partitions] + assert str(partitions[0]) == "Should students be taught to compete or to cooperate?" + assert ( + str(partitions[1]) + == "It is always said that competition can effectively promote the development of economy. In order to " + "survive in the competition, companies continue to improve their products and service, and as a result, " + "the whole society prospers. However, when we discuss the issue of competition or cooperation, what " + "we are concerned about is not the whole society, but the development of an individual's whole life. " + "From this point of view, I firmly believe that we should attach more importance to cooperation during " + "primary education." + ) + assert ( + str(partitions[2]) + == "First of all, through cooperation, children can learn about interpersonal skills which are " + "significant in the future life of all students. What we acquired from team work is not only how to " + "achieve the same goal with others but more importantly, how to get along with others. During the " + "process of cooperation, children can learn about how to listen to opinions of others, how to " + "communicate with others, how to think comprehensively, and even how to compromise with other team " + "members when conflicts occurred. All of these skills help them to get on well with other people and " + "will benefit them for the whole life." + ) + assert ( + str(partitions[3]) + == "On the other hand, the significance of competition is that how to become more excellence to gain the " + "victory. Hence it is always said that competition makes the society more effective. However, when we " + "consider about the question that how to win the game, we always find that we need the cooperation. " + "The greater our goal is, the more competition we need. Take Olympic games which is a form of " + "competition for instance, it is hard to imagine how an athlete could win the game without the " + "training of his or her coach, and the help of other professional staffs such as the people who take " + "care of his diet, and those who are in charge of the medical care. The winner is the athlete but the " + "success belongs to the whole team. Therefore without the cooperation, there would be no victory of " + "competition." + ) + assert ( + str(partitions[4]) + == "Consequently, no matter from the view of individual development or the relationship between " + "competition and cooperation we can receive the same conclusion that a more cooperative attitudes " + "towards life is more profitable in one's success." + ) + + # check the entities + assert doc_with_partitions.labeled_spans == doc_without_partitions.labeled_spans + + # check the relations + assert doc_with_partitions.binary_relations == doc_without_partitions.binary_relations + + +@pytest.fixture(scope="module") +def tokenizer() -> PreTrainedTokenizer: + return AutoTokenizer.from_pretrained("bert-base-uncased") + + +@pytest.fixture(scope="module") +def tokenized_documents_with_labeled_spans_and_binary_relations( + dataset_of_text_documents_with_labeled_spans_and_binary_relations, tokenizer +) -> Optional[List[TestTokenDocumentWithLabeledSpansAndBinaryRelations]]: + if dataset_of_text_documents_with_labeled_spans_and_binary_relations is None: + return None + + # get a document to check + doc = dataset_of_text_documents_with_labeled_spans_and_binary_relations["train"][0] + # Note, that this is a list of documents, because the document may be split into chunks + # if the input text is too long. + tokenized_docs = tokenize_document( + doc, + tokenizer=tokenizer, + return_overflowing_tokens=True, + result_document_type=TestTokenDocumentWithLabeledSpansAndBinaryRelations, + strict_span_conversion=True, + verbose=True, + ) + return tokenized_docs + + +def test_tokenized_documents_with_labeled_spans_and_binary_relations( + tokenized_documents_with_labeled_spans_and_binary_relations, +): + if tokenized_documents_with_labeled_spans_and_binary_relations is not None: + docs = tokenized_documents_with_labeled_spans_and_binary_relations + # check that the tokenization was fine + assert len(docs) == 1 + doc = docs[0] + assert len(doc.labeled_spans) == 11 + assert len(doc.binary_relations) == 10 + assert len(doc.tokens) == 427 + # Check the first ten tokens + assert doc.tokens[:10] == ( + "[CLS]", + "should", + "students", + "be", + "taught", + "to", + "compete", + "or", + "to", + "cooperate", + ) + # sort the entities by their start position + sorted_entities = sorted(doc.labeled_spans, key=lambda ent: ent.start) + assert ( + str(sorted_entities[0]) + == "('we', 'should', 'attach', 'more', 'importance', 'to', 'cooperation', 'during', 'primary', 'education')" + ) + assert ( + str(sorted_entities[1]) + == "('through', 'cooperation', ',', 'children', 'can', 'learn', 'about', 'inter', '##personal', 'skills', " + "'which', 'are', 'significant', 'in', 'the', 'future', 'life', 'of', 'all', 'students')" + ) + assert ( + str(sorted_entities[2]) + == "('what', 'we', 'acquired', 'from', 'team', 'work', 'is', 'not', 'only', 'how', 'to', 'achieve', 'the', " + "'same', 'goal', 'with', 'others', 'but', 'more', 'importantly', ',', 'how', 'to', 'get', 'along', " + "'with', 'others')" + ) + assert ( + str(sorted_entities[3]) + == "('during', 'the', 'process', 'of', 'cooperation', ',', 'children', 'can', 'learn', 'about', 'how', 'to', " + "'listen', 'to', 'opinions', 'of', 'others', ',', 'how', 'to', 'communicate', 'with', 'others', ',', " + "'how', 'to', 'think', 'comprehensive', '##ly', ',', 'and', 'even', 'how', 'to', 'compromise', 'with', " + "'other', 'team', 'members', 'when', 'conflicts', 'occurred')" + ) + assert ( + str(sorted_entities[4]) + == "('all', 'of', 'these', 'skills', 'help', 'them', 'to', 'get', 'on', 'well', 'with', 'other', 'people', " + "'and', 'will', 'benefit', 'them', 'for', 'the', 'whole', 'life')" + ) + assert ( + str(sorted_entities[5]) + == "('the', 'significance', 'of', 'competition', 'is', 'that', 'how', 'to', 'become', 'more', 'excellence', " + "'to', 'gain', 'the', 'victory')" + ) + assert ( + str(sorted_entities[6]) + == "('competition', 'makes', 'the', 'society', 'more', 'effective')" + ) + assert ( + str(sorted_entities[7]) + == "('when', 'we', 'consider', 'about', 'the', 'question', 'that', 'how', 'to', 'win', 'the', 'game', ',', " + "'we', 'always', 'find', 'that', 'we', 'need', 'the', 'cooperation')" + ) + assert ( + str(sorted_entities[8]) + == "('take', 'olympic', 'games', 'which', 'is', 'a', 'form', 'of', 'competition', 'for', 'instance', ',', " + "'it', 'is', 'hard', 'to', 'imagine', 'how', 'an', 'athlete', 'could', 'win', 'the', 'game', 'without', " + "'the', 'training', 'of', 'his', 'or', 'her', 'coach', ',', 'and', 'the', 'help', 'of', 'other', " + "'professional', 'staff', '##s', 'such', 'as', 'the', 'people', 'who', 'take', 'care', 'of', 'his', " + "'diet', ',', 'and', 'those', 'who', 'are', 'in', 'charge', 'of', 'the', 'medical', 'care')" + ) + assert ( + str(sorted_entities[9]) + == "('without', 'the', 'cooperation', ',', 'there', 'would', 'be', 'no', 'victory', 'of', 'competition')" + ) + assert ( + str(sorted_entities[10]) + == "('a', 'more', 'cooperative', 'attitudes', 'towards', 'life', 'is', 'more', 'profitable', " + "'in', 'one', \"'\", 's', 'success')" + ) + + +def test_tokenized_documents_with_entities_and_relations_all( + dataset_of_text_documents_with_labeled_spans_and_binary_relations, tokenizer, dataset_variant +): + if dataset_of_text_documents_with_labeled_spans_and_binary_relations is not None: + for ( + split, + docs, + ) in dataset_of_text_documents_with_labeled_spans_and_binary_relations.items(): + for doc in docs: + # Note, that this is a list of documents, because the document may be split into chunks + # if the input text is too long. + tokenized_docs = tokenize_document( + doc, + tokenizer=tokenizer, + return_overflowing_tokens=True, + result_document_type=TestTokenDocumentWithLabeledSpansAndBinaryRelations, + strict_span_conversion=True, + verbose=True, + ) + # we just ensure that we get at least one tokenized document + assert tokenized_docs is not None + assert len(tokenized_docs) > 0 + + +@pytest.fixture(scope="module") +def tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitions( + dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions, tokenizer +) -> List[TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions]: + if ( + dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions + is not None + ): + # get a document to check + doc = dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions[ + "train" + ][0] + # Note, that this is a list of documents, because the document may be split into chunks + # if the input text is too long. + tokenized_docs = tokenize_document( + doc, + tokenizer=tokenizer, + partition_layer="labeled_partitions", + return_overflowing_tokens=True, + result_document_type=TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, + strict_span_conversion=False, + verbose=True, + ) + return tokenized_docs + + +def test_tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitions( + tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitions, + tokenized_documents_with_labeled_spans_and_binary_relations, +): + if tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitions is not None: + docs_with_partitions = ( + tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitions + ) + + # check that the tokenization was fine + assert len(docs_with_partitions) == 5 + doc_with_partitions = docs_with_partitions[0] + assert len(doc_with_partitions.labeled_partitions) == 1 + assert len(doc_with_partitions.labeled_spans) == 0 + assert len(doc_with_partitions.binary_relations) == 0 + assert doc_with_partitions.tokens == ( + "[CLS]", + "should", + "students", + "be", + "taught", + "to", + "compete", + "or", + "to", + "cooperate", + "?", + "[SEP]", + ) + + +def test_tokenized_documents_with_entities_relations_and_partitions_all( + dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions, tokenizer +): + if ( + dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions + is not None + ): + for ( + split, + docs, + ) in ( + dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions.items() + ): + for doc in docs: + # Note, that this is a list of documents, because the document may be split into chunks + # if the input text is too long. + tokenized_docs = tokenize_document( + doc, + tokenizer=tokenizer, + partition_layer="labeled_partitions", + return_overflowing_tokens=True, + result_document_type=TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, + strict_span_conversion=False, + verbose=True, + ) + # we just ensure that we get at least one tokenized document + assert tokenized_docs is not None + assert len(tokenized_docs) > 0 + for tokenized_doc in tokenized_docs: + assert tokenized_doc.labeled_partitions is not None + # We use the partitions to partition the input, so each tokenized + # document should have exactly one partition annotation. + assert len(tokenized_doc.labeled_partitions) == 1 + + +def test_document_converters(dataset_variant): + builder = ArgumentAnnotatedEssaysV2(config_name=dataset_variant) + document_converters = builder.document_converters + + if dataset_variant == "default": + assert document_converters == {} + elif dataset_variant == "merge_fragmented_spans": + assert len(document_converters) == 2 + assert set(document_converters) == { + TextDocumentWithLabeledSpansAndBinaryRelations, + TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, + } + assert all(callable(v) for k, v in document_converters.items()) + else: + raise ValueError(f"Unknown dataset variant: {dataset_variant}") From 6d1f06d3898e1d1c74c75b0858b2afaf9486ffed Mon Sep 17 00:00:00 2001 From: Ruangrin L <88072261+idalr@users.noreply.github.com> Date: Tue, 26 Dec 2023 12:26:39 +0100 Subject: [PATCH 03/14] added test for conversion_method --- tests/dataset_builders/pie/test_aae2.py | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/tests/dataset_builders/pie/test_aae2.py b/tests/dataset_builders/pie/test_aae2.py index d1e90551..08d41c1a 100644 --- a/tests/dataset_builders/pie/test_aae2.py +++ b/tests/dataset_builders/pie/test_aae2.py @@ -9,7 +9,10 @@ ) from transformers import AutoTokenizer, PreTrainedTokenizer -from dataset_builders.pie.aae2.aae2 import ArgumentAnnotatedEssaysV2 +from dataset_builders.pie.aae2.aae2 import ( + ArgumentAnnotatedEssaysV2, + convert_aae2_claim_attributions_to_relations, +) from pie_datasets import DatasetDict from pie_datasets.builders.brat import BratDocument, BratDocumentWithMergedSpans from tests.dataset_builders.common import ( @@ -21,13 +24,12 @@ disable_caching() DATASET_NAME = "aae2" +BUILDER_CLASS = ArgumentAnnotatedEssaysV2 PIE_DATASET_PATH = PIE_BASE_PATH / DATASET_NAME SPLIT_SIZES = {"test": 80, "train": 322} -@pytest.fixture( - scope="module", params=[config.name for config in ArgumentAnnotatedEssaysV2.BUILDER_CONFIGS] -) +@pytest.fixture(scope="module", params=[config.name for config in BUILDER_CLASS.BUILDER_CONFIGS]) def dataset_variant(request) -> str: return request.param @@ -173,6 +175,19 @@ def dataset_of_text_documents_with_labeled_spans_and_binary_relations( return converted_dataset +@pytest.mark.parametrize("method", ["connect_first", "connect_all"]) +def test_convert_aae2_claim_attributions_to_relations_all(document, method): + if dataset_variant == "merge_fragmented_spans": + converted_doc = convert_aae2_claim_attributions_to_relations(document, method) + converted_binary_relations = converted_doc.binary_relations + if method == "connect_first": + assert len(converted_binary_relations) == 10 + elif method == "connect_all": + assert len(converted_binary_relations) == 12 + else: + raise ValueError(f"Unknown method: {method}") + + def test_dataset_of_text_documents_with_labeled_spans_and_binary_relations( dataset_of_text_documents_with_labeled_spans_and_binary_relations, ): @@ -634,7 +649,7 @@ def test_tokenized_documents_with_entities_relations_and_partitions_all( def test_document_converters(dataset_variant): - builder = ArgumentAnnotatedEssaysV2(config_name=dataset_variant) + builder = BUILDER_CLASS(config_name=dataset_variant) document_converters = builder.document_converters if dataset_variant == "default": From 3beaeecb0105c1cc179c7d8c1276a08cf9e53473 Mon Sep 17 00:00:00 2001 From: Ruangrin L <88072261+idalr@users.noreply.github.com> Date: Tue, 2 Jan 2024 23:33:45 +0100 Subject: [PATCH 04/14] converted to single dataset_variant --- dataset_builders/pie/aae2/aae2.py | 14 +- tests/dataset_builders/pie/test_aae2.py | 781 +++++++++++------------- 2 files changed, 371 insertions(+), 424 deletions(-) diff --git a/dataset_builders/pie/aae2/aae2.py b/dataset_builders/pie/aae2/aae2.py index fa44ea63..5b43a075 100644 --- a/dataset_builders/pie/aae2/aae2.py +++ b/dataset_builders/pie/aae2/aae2.py @@ -142,24 +142,24 @@ class ArgumentAnnotatedEssaysV2(BratBuilder): # we need to add None to the list of dataset variants to support the default dataset variant BASE_BUILDER_KWARGS_DICT = { dataset_variant: {"url": URL, "split_paths": SPLIT_PATHS} - for dataset_variant in ["default", "merge_fragmented_spans", None] + for dataset_variant in ["default", None] } BUILDER_CONFIGS = [ - ArgumentAnnotatedEssaysV2Config(name="default", conversion_method="connect_first"), ArgumentAnnotatedEssaysV2Config( - name="merge_fragmented_spans", + name=BratBuilder.DEFAULT_CONFIG_NAME, merge_fragmented_spans=True, conversion_method="connect_first", ), ] + DOCUMENT_TYPES = { + BratBuilder.DEFAULT_CONFIG_NAME: BratDocumentWithMergedSpans, + } + @property def document_converters(self) -> DocumentConvertersType: - if self.config.name == "default": - # we do not support any auto-conversion for the default BratDocument for now - return {} - elif self.config.name == "merge_fragmented_spans": + if self.config.name == "default" or None: return { TextDocumentWithLabeledSpansAndBinaryRelations: Pipeline( **get_common_pipeline_steps(conversion_method=self.config.conversion_method) diff --git a/tests/dataset_builders/pie/test_aae2.py b/tests/dataset_builders/pie/test_aae2.py index 08d41c1a..cf31c863 100644 --- a/tests/dataset_builders/pie/test_aae2.py +++ b/tests/dataset_builders/pie/test_aae2.py @@ -1,8 +1,9 @@ -from typing import List, Optional, Union +from typing import List import pytest from datasets import disable_caching from pie_modules.document.processing import tokenize_document +from pytorch_ie.core import Document from pytorch_ie.documents import ( TextDocumentWithLabeledSpansAndBinaryRelations, TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, @@ -14,7 +15,7 @@ convert_aae2_claim_attributions_to_relations, ) from pie_datasets import DatasetDict -from pie_datasets.builders.brat import BratDocument, BratDocumentWithMergedSpans +from pie_datasets.builders.brat import BratDocumentWithMergedSpans from tests.dataset_builders.common import ( PIE_BASE_PATH, TestTokenDocumentWithLabeledSpansAndBinaryRelations, @@ -44,23 +45,25 @@ def test_dataset(dataset): assert {name: len(ds) for name, ds in dataset.items()} == SPLIT_SIZES -def test_no_fragmented_spans(dataset, dataset_variant): - if dataset_variant == "default": - for split, docs in dataset.items(): - for doc in docs: - # test the number of slices of the LabeledMultiSpan annotations - assert all([len(span.slices) == 1 for span in doc.spans]) +@pytest.fixture(scope="module") +def builder(dataset_variant) -> BUILDER_CLASS: + return BUILDER_CLASS(config_name=dataset_variant) + + +def test_builder(builder, dataset_variant): + assert builder is not None + assert builder.config_id == dataset_variant + assert builder.dataset_name == DATASET_NAME + assert builder.document_type == BratDocumentWithMergedSpans @pytest.fixture(scope="module") -def document(dataset, dataset_variant) -> Union[BratDocument, BratDocumentWithMergedSpans]: +def document(dataset) -> BratDocumentWithMergedSpans: result = dataset["train"][0] - if dataset_variant == "default": - assert isinstance(result, BratDocument) - elif dataset_variant == "merge_fragmented_spans": - assert isinstance(result, BratDocumentWithMergedSpans) - else: - raise ValueError(f"Unknown dataset variant: {dataset_variant}") + # we can not assert the real document type because it may come from a dataset loading script + # downloaded to a temporary directory and thus have a different type object, although it is + # semantically the same + assert isinstance(result, Document) return result @@ -68,17 +71,9 @@ def test_document(document, dataset_variant): assert document is not None assert document.id == "essay001" - # check the annotation - if dataset_variant == "default": - span_texts_labels_tuples = [ - (" ".join([document.text[start:end] for start, end in span.slices]), span.label) - for span in document.spans - ] - elif dataset_variant == "merge_fragmented_spans": - span_texts_labels_tuples = [(str(span), span.label) for span in document.spans] - # check spans assert len(document.spans) == 11 + span_texts_labels_tuples = [(str(span), span.label) for span in document.spans] assert span_texts_labels_tuples[0] == ( "we should attach more importance to cooperation during primary education", "MajorClaim", @@ -152,21 +147,8 @@ def test_document(document, dataset_variant): @pytest.fixture(scope="module") def dataset_of_text_documents_with_labeled_spans_and_binary_relations( dataset, dataset_variant -) -> Optional[DatasetDict]: +) -> DatasetDict: if dataset_variant == "default": - with pytest.raises(ValueError) as excinfo: - dataset.to_document_type(TextDocumentWithLabeledSpansAndBinaryRelations) - assert ( - str(excinfo.value) - == "No valid key (either subclass or superclass) was found for the document type " - "'' in the " - "document_converters of the dataset. Available keys: set(). Consider adding a respective " - "converter to the dataset with dataset.register_document_converter(my_converter_method) " - "where my_converter_method should accept " - "as input and return ''." - ) - converted_dataset = None - elif dataset_variant == "merge_fragmented_spans": converted_dataset = dataset.to_document_type( TextDocumentWithLabeledSpansAndBinaryRelations ) @@ -177,7 +159,7 @@ def dataset_of_text_documents_with_labeled_spans_and_binary_relations( @pytest.mark.parametrize("method", ["connect_first", "connect_all"]) def test_convert_aae2_claim_attributions_to_relations_all(document, method): - if dataset_variant == "merge_fragmented_spans": + if dataset_variant == "default" or None: converted_doc = convert_aae2_claim_attributions_to_relations(document, method) converted_binary_relations = converted_doc.binary_relations if method == "connect_first": @@ -191,154 +173,136 @@ def test_convert_aae2_claim_attributions_to_relations_all(document, method): def test_dataset_of_text_documents_with_labeled_spans_and_binary_relations( dataset_of_text_documents_with_labeled_spans_and_binary_relations, ): - if dataset_of_text_documents_with_labeled_spans_and_binary_relations is not None: - # Check that the conversion is correct and the data makes sense - # get a document to check - doc = dataset_of_text_documents_with_labeled_spans_and_binary_relations["train"][0] - assert isinstance(doc, TextDocumentWithLabeledSpansAndBinaryRelations) - - entities = doc.labeled_spans - assert len(entities) == 11 - # sort the entities by their start position and convert them to tuples - sorted_entity_tuples = [ - (str(ent), ent.label) for ent in sorted(doc.labeled_spans, key=lambda ent: ent.start) - ] - assert sorted_entity_tuples[0] == ( - "we should attach more importance to cooperation during primary education", - "MajorClaim", - ) - assert sorted_entity_tuples[1] == ( - "through cooperation, children can learn about interpersonal skills which are significant in the future life " - "of all students", - "Claim", - ) - assert sorted_entity_tuples[2] == ( - "What we acquired from team work is not only how to achieve the same goal with others but more importantly, " - "how to get along with others", - "Premise", - ) - assert sorted_entity_tuples[3] == ( - "During the process of cooperation, children can learn about how to listen to opinions of others, how to " - "communicate with others, how to think comprehensively, and even how to compromise with other team members " - "when conflicts occurred", - "Premise", - ) - assert sorted_entity_tuples[4] == ( - "All of these skills help them to get on well with other people and will benefit them for the whole life", - "Premise", - ) - assert sorted_entity_tuples[5] == ( - "the significance of competition is that how to become more excellence to gain the victory", - "Premise", - ) - assert sorted_entity_tuples[6] == ("competition makes the society more effective", "Claim") - assert sorted_entity_tuples[7] == ( - "when we consider about the question that how to win the game, we always find that we need the cooperation", - "Premise", - ) - assert sorted_entity_tuples[8] == ( - "Take Olympic games which is a form of competition for instance, it is hard to imagine how an athlete could " - "win the game without the training of his or her coach, and the help of other professional staffs such as " - "the people who take care of his diet, and those who are in charge of the medical care", - "Premise", - ) - assert sorted_entity_tuples[9] == ( - "without the cooperation, there would be no victory of competition", - "Claim", - ) - assert sorted_entity_tuples[10] == ( - "a more cooperative attitudes towards life is more profitable in one's success", - "MajorClaim", - ) + # Check that the conversion is correct and the data makes sense + # get a document to check + doc = dataset_of_text_documents_with_labeled_spans_and_binary_relations["train"][0] + assert isinstance(doc, TextDocumentWithLabeledSpansAndBinaryRelations) + + # check the entities + entities = doc.labeled_spans + assert len(entities) == 11 + # sort the entities by their start position and convert them to tuples + sorted_entity_tuples = [ + (str(ent), ent.label) for ent in sorted(doc.labeled_spans, key=lambda ent: ent.start) + ] + assert sorted_entity_tuples[0] == ( + "we should attach more importance to cooperation during primary education", + "MajorClaim", + ) + assert sorted_entity_tuples[1] == ( + "through cooperation, children can learn about interpersonal skills which are significant in the future life " + "of all students", + "Claim", + ) + assert sorted_entity_tuples[2] == ( + "What we acquired from team work is not only how to achieve the same goal with others but more importantly, " + "how to get along with others", + "Premise", + ) + assert sorted_entity_tuples[3] == ( + "During the process of cooperation, children can learn about how to listen to opinions of others, how to " + "communicate with others, how to think comprehensively, and even how to compromise with other team members " + "when conflicts occurred", + "Premise", + ) + assert sorted_entity_tuples[4] == ( + "All of these skills help them to get on well with other people and will benefit them for the whole life", + "Premise", + ) + assert sorted_entity_tuples[5] == ( + "the significance of competition is that how to become more excellence to gain the victory", + "Premise", + ) + assert sorted_entity_tuples[6] == ("competition makes the society more effective", "Claim") + assert sorted_entity_tuples[7] == ( + "when we consider about the question that how to win the game, we always find that we need the cooperation", + "Premise", + ) + assert sorted_entity_tuples[8] == ( + "Take Olympic games which is a form of competition for instance, it is hard to imagine how an athlete could " + "win the game without the training of his or her coach, and the help of other professional staffs such as " + "the people who take care of his diet, and those who are in charge of the medical care", + "Premise", + ) + assert sorted_entity_tuples[9] == ( + "without the cooperation, there would be no victory of competition", + "Claim", + ) + assert sorted_entity_tuples[10] == ( + "a more cooperative attitudes towards life is more profitable in one's success", + "MajorClaim", + ) - # check the relations - # for conversion_method="connect_first" - assert len(doc.binary_relations) == 10 - relation_tuples = [ - (str(rel.head), rel.label, str(rel.tail)) for rel in doc.binary_relations - ] - assert relation_tuples[0] == ( - "What we acquired from team work is not only how to achieve the same goal with others but more importantly, " - "how to get along with others", - "supports", - "through cooperation, children can learn about interpersonal skills which are significant in the future life " - "of all students", - ) - assert relation_tuples[1] == ( - "During the process of cooperation, children can learn about how to listen to opinions of others, how to " - "communicate with others, how to think comprehensively, and even how to compromise with other team members " - "when conflicts occurred", - "supports", - "through cooperation, children can learn about interpersonal skills which are significant in the future life " - "of all students", - ) - assert relation_tuples[2] == ( - "All of these skills help them to get on well with other people and will benefit them for the whole life", - "supports", - "through cooperation, children can learn about interpersonal skills which are significant in the future life " - "of all students", - ) - assert relation_tuples[3] == ( - "Take Olympic games which is a form of competition for instance, it is hard to imagine how an athlete could " - "win the game without the training of his or her coach, and the help of other professional staffs such as " - "the people who take care of his diet, and those who are in charge of the medical care", - "supports", - "without the cooperation, there would be no victory of competition", - ) - assert relation_tuples[4] == ( - "when we consider about the question that how to win the game, we always find that we need the cooperation", - "supports", - "without the cooperation, there would be no victory of competition", - ) - assert relation_tuples[5] == ( - "the significance of competition is that how to become more excellence to gain the victory", - "supports", - "competition makes the society more effective", - ) - assert relation_tuples[6] == ( - "through cooperation, children can learn about interpersonal skills which are significant in the future " - "life of all students", - "supports", - "we should attach more importance to cooperation during primary education", - ) - assert relation_tuples[7] == ( - "competition makes the society more effective", - "attacks", - "we should attach more importance to cooperation during primary education", - ) - assert relation_tuples[8] == ( - "without the cooperation, there would be no victory of competition", - "supports", - "we should attach more importance to cooperation during primary education", - ) - assert relation_tuples[9] == ( - "a more cooperative attitudes towards life is more profitable in one's success", - "semantically_same", - "we should attach more importance to cooperation during primary education", - ) + # check the relations + # for conversion_method="connect_first" + assert len(doc.binary_relations) == 10 + relation_tuples = [(str(rel.head), rel.label, str(rel.tail)) for rel in doc.binary_relations] + assert relation_tuples[0] == ( + "What we acquired from team work is not only how to achieve the same goal with others but more importantly, " + "how to get along with others", + "supports", + "through cooperation, children can learn about interpersonal skills which are significant in the future life " + "of all students", + ) + assert relation_tuples[1] == ( + "During the process of cooperation, children can learn about how to listen to opinions of others, how to " + "communicate with others, how to think comprehensively, and even how to compromise with other team members " + "when conflicts occurred", + "supports", + "through cooperation, children can learn about interpersonal skills which are significant in the future life " + "of all students", + ) + assert relation_tuples[2] == ( + "All of these skills help them to get on well with other people and will benefit them for the whole life", + "supports", + "through cooperation, children can learn about interpersonal skills which are significant in the future life " + "of all students", + ) + assert relation_tuples[3] == ( + "Take Olympic games which is a form of competition for instance, it is hard to imagine how an athlete could " + "win the game without the training of his or her coach, and the help of other professional staffs such as " + "the people who take care of his diet, and those who are in charge of the medical care", + "supports", + "without the cooperation, there would be no victory of competition", + ) + assert relation_tuples[4] == ( + "when we consider about the question that how to win the game, we always find that we need the cooperation", + "supports", + "without the cooperation, there would be no victory of competition", + ) + assert relation_tuples[5] == ( + "the significance of competition is that how to become more excellence to gain the victory", + "supports", + "competition makes the society more effective", + ) + assert relation_tuples[6] == ( + "through cooperation, children can learn about interpersonal skills which are significant in the future " + "life of all students", + "supports", + "we should attach more importance to cooperation during primary education", + ) + assert relation_tuples[7] == ( + "competition makes the society more effective", + "attacks", + "we should attach more importance to cooperation during primary education", + ) + assert relation_tuples[8] == ( + "without the cooperation, there would be no victory of competition", + "supports", + "we should attach more importance to cooperation during primary education", + ) + assert relation_tuples[9] == ( + "a more cooperative attitudes towards life is more profitable in one's success", + "semantically_same", + "we should attach more importance to cooperation during primary education", + ) @pytest.fixture(scope="module") def dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions( dataset, dataset_variant -) -> Optional[DatasetDict]: +) -> DatasetDict: if dataset_variant == "default": - with pytest.raises(ValueError) as excinfo: - dataset.to_document_type( - TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions - ) - assert ( - str(excinfo.value) - == "No valid key (either subclass or superclass) was found for the document type " - "'' " - "in the document_converters of the dataset. Available keys: set(). Consider adding a respective " - "converter to the dataset with dataset.register_document_converter(my_converter_method) where " - "my_converter_method should accept as input and " - "return ''." - ) - converted_dataset = None - elif dataset_variant == "merge_fragmented_spans": converted_dataset = dataset.to_document_type( TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions ) @@ -351,72 +315,68 @@ def test_dataset_of_text_documents_with_labeled_spans_binary_relations_and_label dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions, dataset_of_text_documents_with_labeled_spans_and_binary_relations, ): - if ( - dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions - is not None - ): - # Check that the conversion is correct and the data makes sense - # get a document to check - doc_without_partitions = dataset_of_text_documents_with_labeled_spans_and_binary_relations[ + # Check that the conversion is correct and the data makes sense + # get a document to check + doc_without_partitions = dataset_of_text_documents_with_labeled_spans_and_binary_relations[ + "train" + ][0] + doc_with_partitions = ( + dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions[ "train" ][0] - doc_with_partitions = ( - dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions[ - "train" - ][0] - ) - assert isinstance(doc_without_partitions, TextDocumentWithLabeledSpansAndBinaryRelations) - assert isinstance( - doc_with_partitions, TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions - ) + ) + assert isinstance(doc_without_partitions, TextDocumentWithLabeledSpansAndBinaryRelations) + assert isinstance( + doc_with_partitions, TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions + ) - partitions = doc_with_partitions.labeled_partitions - assert len(partitions) == 5 - assert [partition.label == "partition" for partition in partitions] - assert str(partitions[0]) == "Should students be taught to compete or to cooperate?" - assert ( - str(partitions[1]) - == "It is always said that competition can effectively promote the development of economy. In order to " - "survive in the competition, companies continue to improve their products and service, and as a result, " - "the whole society prospers. However, when we discuss the issue of competition or cooperation, what " - "we are concerned about is not the whole society, but the development of an individual's whole life. " - "From this point of view, I firmly believe that we should attach more importance to cooperation during " - "primary education." - ) - assert ( - str(partitions[2]) - == "First of all, through cooperation, children can learn about interpersonal skills which are " - "significant in the future life of all students. What we acquired from team work is not only how to " - "achieve the same goal with others but more importantly, how to get along with others. During the " - "process of cooperation, children can learn about how to listen to opinions of others, how to " - "communicate with others, how to think comprehensively, and even how to compromise with other team " - "members when conflicts occurred. All of these skills help them to get on well with other people and " - "will benefit them for the whole life." - ) - assert ( - str(partitions[3]) - == "On the other hand, the significance of competition is that how to become more excellence to gain the " - "victory. Hence it is always said that competition makes the society more effective. However, when we " - "consider about the question that how to win the game, we always find that we need the cooperation. " - "The greater our goal is, the more competition we need. Take Olympic games which is a form of " - "competition for instance, it is hard to imagine how an athlete could win the game without the " - "training of his or her coach, and the help of other professional staffs such as the people who take " - "care of his diet, and those who are in charge of the medical care. The winner is the athlete but the " - "success belongs to the whole team. Therefore without the cooperation, there would be no victory of " - "competition." - ) - assert ( - str(partitions[4]) - == "Consequently, no matter from the view of individual development or the relationship between " - "competition and cooperation we can receive the same conclusion that a more cooperative attitudes " - "towards life is more profitable in one's success." - ) + partitions = doc_with_partitions.labeled_partitions + assert len(partitions) == 5 + assert [partition.label == "partition" for partition in partitions] + assert str(partitions[0]) == "Should students be taught to compete or to cooperate?" + assert ( + str(partitions[1]) + == "It is always said that competition can effectively promote the development of economy. In order to " + "survive in the competition, companies continue to improve their products and service, and as a result, " + "the whole society prospers. However, when we discuss the issue of competition or cooperation, what " + "we are concerned about is not the whole society, but the development of an individual's whole life. " + "From this point of view, I firmly believe that we should attach more importance to cooperation during " + "primary education." + ) + assert ( + str(partitions[2]) + == "First of all, through cooperation, children can learn about interpersonal skills which are " + "significant in the future life of all students. What we acquired from team work is not only how to " + "achieve the same goal with others but more importantly, how to get along with others. During the " + "process of cooperation, children can learn about how to listen to opinions of others, how to " + "communicate with others, how to think comprehensively, and even how to compromise with other team " + "members when conflicts occurred. All of these skills help them to get on well with other people and " + "will benefit them for the whole life." + ) + assert ( + str(partitions[3]) + == "On the other hand, the significance of competition is that how to become more excellence to gain the " + "victory. Hence it is always said that competition makes the society more effective. However, when we " + "consider about the question that how to win the game, we always find that we need the cooperation. " + "The greater our goal is, the more competition we need. Take Olympic games which is a form of " + "competition for instance, it is hard to imagine how an athlete could win the game without the " + "training of his or her coach, and the help of other professional staffs such as the people who take " + "care of his diet, and those who are in charge of the medical care. The winner is the athlete but the " + "success belongs to the whole team. Therefore without the cooperation, there would be no victory of " + "competition." + ) + assert ( + str(partitions[4]) + == "Consequently, no matter from the view of individual development or the relationship between " + "competition and cooperation we can receive the same conclusion that a more cooperative attitudes " + "towards life is more profitable in one's success." + ) - # check the entities - assert doc_with_partitions.labeled_spans == doc_without_partitions.labeled_spans + # check the entities + assert doc_with_partitions.labeled_spans == doc_without_partitions.labeled_spans - # check the relations - assert doc_with_partitions.binary_relations == doc_without_partitions.binary_relations + # check the relations + assert doc_with_partitions.binary_relations == doc_without_partitions.binary_relations @pytest.fixture(scope="module") @@ -427,7 +387,7 @@ def tokenizer() -> PreTrainedTokenizer: @pytest.fixture(scope="module") def tokenized_documents_with_labeled_spans_and_binary_relations( dataset_of_text_documents_with_labeled_spans_and_binary_relations, tokenizer -) -> Optional[List[TestTokenDocumentWithLabeledSpansAndBinaryRelations]]: +) -> List[TestTokenDocumentWithLabeledSpansAndBinaryRelations]: if dataset_of_text_documents_with_labeled_spans_and_binary_relations is None: return None @@ -449,203 +409,192 @@ def tokenized_documents_with_labeled_spans_and_binary_relations( def test_tokenized_documents_with_labeled_spans_and_binary_relations( tokenized_documents_with_labeled_spans_and_binary_relations, ): - if tokenized_documents_with_labeled_spans_and_binary_relations is not None: - docs = tokenized_documents_with_labeled_spans_and_binary_relations - # check that the tokenization was fine - assert len(docs) == 1 - doc = docs[0] - assert len(doc.labeled_spans) == 11 - assert len(doc.binary_relations) == 10 - assert len(doc.tokens) == 427 - # Check the first ten tokens - assert doc.tokens[:10] == ( - "[CLS]", - "should", - "students", - "be", - "taught", - "to", - "compete", - "or", - "to", - "cooperate", - ) - # sort the entities by their start position - sorted_entities = sorted(doc.labeled_spans, key=lambda ent: ent.start) - assert ( - str(sorted_entities[0]) - == "('we', 'should', 'attach', 'more', 'importance', 'to', 'cooperation', 'during', 'primary', 'education')" - ) - assert ( - str(sorted_entities[1]) - == "('through', 'cooperation', ',', 'children', 'can', 'learn', 'about', 'inter', '##personal', 'skills', " - "'which', 'are', 'significant', 'in', 'the', 'future', 'life', 'of', 'all', 'students')" - ) - assert ( - str(sorted_entities[2]) - == "('what', 'we', 'acquired', 'from', 'team', 'work', 'is', 'not', 'only', 'how', 'to', 'achieve', 'the', " - "'same', 'goal', 'with', 'others', 'but', 'more', 'importantly', ',', 'how', 'to', 'get', 'along', " - "'with', 'others')" - ) - assert ( - str(sorted_entities[3]) - == "('during', 'the', 'process', 'of', 'cooperation', ',', 'children', 'can', 'learn', 'about', 'how', 'to', " - "'listen', 'to', 'opinions', 'of', 'others', ',', 'how', 'to', 'communicate', 'with', 'others', ',', " - "'how', 'to', 'think', 'comprehensive', '##ly', ',', 'and', 'even', 'how', 'to', 'compromise', 'with', " - "'other', 'team', 'members', 'when', 'conflicts', 'occurred')" - ) - assert ( - str(sorted_entities[4]) - == "('all', 'of', 'these', 'skills', 'help', 'them', 'to', 'get', 'on', 'well', 'with', 'other', 'people', " - "'and', 'will', 'benefit', 'them', 'for', 'the', 'whole', 'life')" - ) - assert ( - str(sorted_entities[5]) - == "('the', 'significance', 'of', 'competition', 'is', 'that', 'how', 'to', 'become', 'more', 'excellence', " - "'to', 'gain', 'the', 'victory')" - ) - assert ( - str(sorted_entities[6]) - == "('competition', 'makes', 'the', 'society', 'more', 'effective')" - ) - assert ( - str(sorted_entities[7]) - == "('when', 'we', 'consider', 'about', 'the', 'question', 'that', 'how', 'to', 'win', 'the', 'game', ',', " - "'we', 'always', 'find', 'that', 'we', 'need', 'the', 'cooperation')" - ) - assert ( - str(sorted_entities[8]) - == "('take', 'olympic', 'games', 'which', 'is', 'a', 'form', 'of', 'competition', 'for', 'instance', ',', " - "'it', 'is', 'hard', 'to', 'imagine', 'how', 'an', 'athlete', 'could', 'win', 'the', 'game', 'without', " - "'the', 'training', 'of', 'his', 'or', 'her', 'coach', ',', 'and', 'the', 'help', 'of', 'other', " - "'professional', 'staff', '##s', 'such', 'as', 'the', 'people', 'who', 'take', 'care', 'of', 'his', " - "'diet', ',', 'and', 'those', 'who', 'are', 'in', 'charge', 'of', 'the', 'medical', 'care')" - ) - assert ( - str(sorted_entities[9]) - == "('without', 'the', 'cooperation', ',', 'there', 'would', 'be', 'no', 'victory', 'of', 'competition')" - ) - assert ( - str(sorted_entities[10]) - == "('a', 'more', 'cooperative', 'attitudes', 'towards', 'life', 'is', 'more', 'profitable', " - "'in', 'one', \"'\", 's', 'success')" - ) + docs = tokenized_documents_with_labeled_spans_and_binary_relations + # check that the tokenization was fine + assert len(docs) == 1 + doc = docs[0] + assert len(doc.labeled_spans) == 11 + assert len(doc.binary_relations) == 10 + assert len(doc.tokens) == 427 + # Check the first ten tokens + assert doc.tokens[:10] == ( + "[CLS]", + "should", + "students", + "be", + "taught", + "to", + "compete", + "or", + "to", + "cooperate", + ) + # sort the entities by their start position + sorted_entities = sorted(doc.labeled_spans, key=lambda ent: ent.start) + assert ( + str(sorted_entities[0]) + == "('we', 'should', 'attach', 'more', 'importance', 'to', 'cooperation', 'during', 'primary', 'education')" + ) + assert ( + str(sorted_entities[1]) + == "('through', 'cooperation', ',', 'children', 'can', 'learn', 'about', 'inter', '##personal', 'skills', " + "'which', 'are', 'significant', 'in', 'the', 'future', 'life', 'of', 'all', 'students')" + ) + assert ( + str(sorted_entities[2]) + == "('what', 'we', 'acquired', 'from', 'team', 'work', 'is', 'not', 'only', 'how', 'to', 'achieve', 'the', " + "'same', 'goal', 'with', 'others', 'but', 'more', 'importantly', ',', 'how', 'to', 'get', 'along', " + "'with', 'others')" + ) + assert ( + str(sorted_entities[3]) + == "('during', 'the', 'process', 'of', 'cooperation', ',', 'children', 'can', 'learn', 'about', 'how', 'to', " + "'listen', 'to', 'opinions', 'of', 'others', ',', 'how', 'to', 'communicate', 'with', 'others', ',', " + "'how', 'to', 'think', 'comprehensive', '##ly', ',', 'and', 'even', 'how', 'to', 'compromise', 'with', " + "'other', 'team', 'members', 'when', 'conflicts', 'occurred')" + ) + assert ( + str(sorted_entities[4]) + == "('all', 'of', 'these', 'skills', 'help', 'them', 'to', 'get', 'on', 'well', 'with', 'other', 'people', " + "'and', 'will', 'benefit', 'them', 'for', 'the', 'whole', 'life')" + ) + assert ( + str(sorted_entities[5]) + == "('the', 'significance', 'of', 'competition', 'is', 'that', 'how', 'to', 'become', 'more', 'excellence', " + "'to', 'gain', 'the', 'victory')" + ) + assert ( + str(sorted_entities[6]) + == "('competition', 'makes', 'the', 'society', 'more', 'effective')" + ) + assert ( + str(sorted_entities[7]) + == "('when', 'we', 'consider', 'about', 'the', 'question', 'that', 'how', 'to', 'win', 'the', 'game', ',', " + "'we', 'always', 'find', 'that', 'we', 'need', 'the', 'cooperation')" + ) + assert ( + str(sorted_entities[8]) + == "('take', 'olympic', 'games', 'which', 'is', 'a', 'form', 'of', 'competition', 'for', 'instance', ',', " + "'it', 'is', 'hard', 'to', 'imagine', 'how', 'an', 'athlete', 'could', 'win', 'the', 'game', 'without', " + "'the', 'training', 'of', 'his', 'or', 'her', 'coach', ',', 'and', 'the', 'help', 'of', 'other', " + "'professional', 'staff', '##s', 'such', 'as', 'the', 'people', 'who', 'take', 'care', 'of', 'his', " + "'diet', ',', 'and', 'those', 'who', 'are', 'in', 'charge', 'of', 'the', 'medical', 'care')" + ) + assert ( + str(sorted_entities[9]) + == "('without', 'the', 'cooperation', ',', 'there', 'would', 'be', 'no', 'victory', 'of', 'competition')" + ) + assert ( + str(sorted_entities[10]) + == "('a', 'more', 'cooperative', 'attitudes', 'towards', 'life', 'is', 'more', 'profitable', " + "'in', 'one', \"'\", 's', 'success')" + ) def test_tokenized_documents_with_entities_and_relations_all( dataset_of_text_documents_with_labeled_spans_and_binary_relations, tokenizer, dataset_variant ): - if dataset_of_text_documents_with_labeled_spans_and_binary_relations is not None: - for ( - split, - docs, - ) in dataset_of_text_documents_with_labeled_spans_and_binary_relations.items(): - for doc in docs: - # Note, that this is a list of documents, because the document may be split into chunks - # if the input text is too long. - tokenized_docs = tokenize_document( - doc, - tokenizer=tokenizer, - return_overflowing_tokens=True, - result_document_type=TestTokenDocumentWithLabeledSpansAndBinaryRelations, - strict_span_conversion=True, - verbose=True, - ) - # we just ensure that we get at least one tokenized document - assert tokenized_docs is not None - assert len(tokenized_docs) > 0 + for ( + split, + docs, + ) in dataset_of_text_documents_with_labeled_spans_and_binary_relations.items(): + for doc in docs: + # Note, that this is a list of documents, because the document may be split into chunks + # if the input text is too long. + tokenized_docs = tokenize_document( + doc, + tokenizer=tokenizer, + return_overflowing_tokens=True, + result_document_type=TestTokenDocumentWithLabeledSpansAndBinaryRelations, + strict_span_conversion=True, + verbose=True, + ) + # we just ensure that we get at least one tokenized document + assert tokenized_docs is not None + assert len(tokenized_docs) > 0 @pytest.fixture(scope="module") def tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitions( dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions, tokenizer ) -> List[TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions]: - if ( - dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions - is not None - ): - # get a document to check - doc = dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions[ - "train" - ][0] - # Note, that this is a list of documents, because the document may be split into chunks - # if the input text is too long. - tokenized_docs = tokenize_document( - doc, - tokenizer=tokenizer, - partition_layer="labeled_partitions", - return_overflowing_tokens=True, - result_document_type=TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, - strict_span_conversion=False, - verbose=True, - ) - return tokenized_docs + # get a document to check + doc = dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions[ + "train" + ][0] + # Note, that this is a list of documents, because the document may be split into chunks + # if the input text is too long. + tokenized_docs = tokenize_document( + doc, + tokenizer=tokenizer, + partition_layer="labeled_partitions", + return_overflowing_tokens=True, + result_document_type=TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, + strict_span_conversion=False, + verbose=True, + ) + return tokenized_docs def test_tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitions( tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitions, tokenized_documents_with_labeled_spans_and_binary_relations, ): - if tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitions is not None: - docs_with_partitions = ( - tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitions - ) + docs_with_partitions = ( + tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitions + ) - # check that the tokenization was fine - assert len(docs_with_partitions) == 5 - doc_with_partitions = docs_with_partitions[0] - assert len(doc_with_partitions.labeled_partitions) == 1 - assert len(doc_with_partitions.labeled_spans) == 0 - assert len(doc_with_partitions.binary_relations) == 0 - assert doc_with_partitions.tokens == ( - "[CLS]", - "should", - "students", - "be", - "taught", - "to", - "compete", - "or", - "to", - "cooperate", - "?", - "[SEP]", - ) + # check that the tokenization was fine + assert len(docs_with_partitions) == 5 + doc_with_partitions = docs_with_partitions[0] + assert len(doc_with_partitions.labeled_partitions) == 1 + assert len(doc_with_partitions.labeled_spans) == 0 + assert len(doc_with_partitions.binary_relations) == 0 + assert doc_with_partitions.tokens == ( + "[CLS]", + "should", + "students", + "be", + "taught", + "to", + "compete", + "or", + "to", + "cooperate", + "?", + "[SEP]", + ) def test_tokenized_documents_with_entities_relations_and_partitions_all( dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions, tokenizer ): - if ( - dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions - is not None + for ( + split, + docs, + ) in ( + dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions.items() ): - for ( - split, - docs, - ) in ( - dataset_of_text_documents_with_labeled_spans_binary_relations_and_labeled_partitions.items() - ): - for doc in docs: - # Note, that this is a list of documents, because the document may be split into chunks - # if the input text is too long. - tokenized_docs = tokenize_document( - doc, - tokenizer=tokenizer, - partition_layer="labeled_partitions", - return_overflowing_tokens=True, - result_document_type=TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, - strict_span_conversion=False, - verbose=True, - ) - # we just ensure that we get at least one tokenized document - assert tokenized_docs is not None - assert len(tokenized_docs) > 0 - for tokenized_doc in tokenized_docs: - assert tokenized_doc.labeled_partitions is not None - # We use the partitions to partition the input, so each tokenized - # document should have exactly one partition annotation. - assert len(tokenized_doc.labeled_partitions) == 1 + for doc in docs: + # Note, that this is a list of documents, because the document may be split into chunks + # if the input text is too long. + tokenized_docs = tokenize_document( + doc, + tokenizer=tokenizer, + partition_layer="labeled_partitions", + return_overflowing_tokens=True, + result_document_type=TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, + strict_span_conversion=False, + verbose=True, + ) + # we just ensure that we get at least one tokenized document + assert tokenized_docs is not None + assert len(tokenized_docs) > 0 + for tokenized_doc in tokenized_docs: + assert tokenized_doc.labeled_partitions is not None + # We use the partitions to partition the input, so each tokenized + # document should have exactly one partition annotation. + assert len(tokenized_doc.labeled_partitions) == 1 def test_document_converters(dataset_variant): @@ -653,8 +602,6 @@ def test_document_converters(dataset_variant): document_converters = builder.document_converters if dataset_variant == "default": - assert document_converters == {} - elif dataset_variant == "merge_fragmented_spans": assert len(document_converters) == 2 assert set(document_converters) == { TextDocumentWithLabeledSpansAndBinaryRelations, From f71e46efafa2d861ffb2cf8867846fb84c9890b8 Mon Sep 17 00:00:00 2001 From: Ruangrin L <88072261+idalr@users.noreply.github.com> Date: Tue, 2 Jan 2024 23:39:30 +0100 Subject: [PATCH 05/14] edit pie/readme --- dataset_builders/pie/aae2/README.md | 23 +++++++++-------------- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md index 120b3a21..1ef207c2 100644 --- a/dataset_builders/pie/aae2/README.md +++ b/dataset_builders/pie/aae2/README.md @@ -21,9 +21,9 @@ The language in the dataset is English (persuasive essays). ### Dataset Variants -See [PIE-Brat Dataset Variants](https://huggingface.co/datasets/pie/brat#dataset-variants). +The `aae2` dataset comes in a single version (`default`) with `BratDocumentWithMergedSpans` as document type. Note, that this in contrast to the base brat dataset, where the document type for the `default` variant is `BratDocument`. The reason is that the AAE2 dataset has already been published with only single-fragment spans. Without any need to merge fragments, the document type `BratDocumentWithMergedSpans` is easier to handle for most of the task modules. -## Data Schema +### Data Schema See [PIE-Brat Data Schema](https://huggingface.co/datasets/pie/brat#data-schema). @@ -35,28 +35,23 @@ from pie_datasets import load_dataset, builders # load default version datasets = load_dataset("pie/aae2") doc = datasets["train"][0] -assert isinstance(doc, builders.brat.BratDocument) - -# load version with merged span fragments -dataset_merged_spans = load_dataset("pie/aae2", name="merge_fragmented_spans") -doc_merged_spans = dataset_merged_spans["train"][0] -assert isinstance(doc_merged_spans, builders.brat.BratDocumentWithMergedSpans) +assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans) ``` -## Document Converters +### Document Converters The dataset provides document converters for the following target document types: - `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations` - - `LabeledSpans`, converted from `BratDocument`'s `spans` + - `LabeledSpans`, converted from `BratDocumentWithMergedSpans`'s `spans` - labels: `MajorClaim`, `Claim`, `Premise` - - `BinaryRelations`, converted from `BratDocument`'s `relations` + - `BinaryRelations`, converted from `BratDocumentWithMergedSpans`'s `relations` - labels: `support`, `attack`, `semantically_same` - there are two conversion methods that convert `Claim`'s attributes to their relations to `MajorClaim` (see [Relation Conversions](#relation-conversions) below, for more details) - `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions` - - `LabeledSpans`, as above - `BinaryRelations`, as above - - `LabeledPartitions`, partitioned `BratDocument`'s `text`, according to the paragraph, using regex. + - `LabeledPartitions`, partitioned `BratDocumentWithMergedSpans`'s `text`, according to the paragraph, using regex. - every partition is labeled as `labeled_partition` See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type @@ -105,7 +100,7 @@ See further description in Stab & Gurevych 2017, p.627 and the [annotation guide #### Relation Conversions -When converting from `BratDocument(WithMergedSpan)` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`, +When converting from `BratDocumentWithMergedSpan` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`, we apply a relation-conversion methods to build relations between `Claim`'s and `MajorClaim`'s, based on the annotated `Claim`'s attribution. The two conversion methods are: @@ -196,7 +191,7 @@ Three non-native speakers; one of the three being an expert annotator. ### Social Impact of Dataset "\[Computational Argumentation\] have -broad application potential in various areas such as legal decision support (Mochales-Palau and Moens 2009), information retrieval (Carstens and Toni 2015), policy making (Sardianos et al. 2015), and debating technologies (Levy et al. 2014; Rinott et al. +broad application potential in various areas such as legal decision support (Mochales-Palau and Moens 2009), information retrieval (Carstens and Toni 2015), policy making (Sardianos et al. 2015), and debating technologies (Levy et al. 2014; Rinott et al. 2015)." (p. 619) ### Discussion of Biases From b9af84f0f73f7f6450fbaac0b51bbc81ab9b7669 Mon Sep 17 00:00:00 2001 From: Ruangrin L <88072261+idalr@users.noreply.github.com> Date: Fri, 5 Jan 2024 22:08:25 +0100 Subject: [PATCH 06/14] changed partition label name and edited readme.md --- dataset_builders/pie/aae2/README.md | 58 ++++++++++++------------- dataset_builders/pie/aae2/aae2.py | 1 + tests/dataset_builders/pie/test_aae2.py | 2 +- 3 files changed, 29 insertions(+), 32 deletions(-) diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md index 1ef207c2..b85e1c43 100644 --- a/dataset_builders/pie/aae2/README.md +++ b/dataset_builders/pie/aae2/README.md @@ -38,25 +38,6 @@ doc = datasets["train"][0] assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans) ``` -### Document Converters - -The dataset provides document converters for the following target document types: - -- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations` - - `LabeledSpans`, converted from `BratDocumentWithMergedSpans`'s `spans` - - labels: `MajorClaim`, `Claim`, `Premise` - - `BinaryRelations`, converted from `BratDocumentWithMergedSpans`'s `relations` - - labels: `support`, `attack`, `semantically_same` - - there are two conversion methods that convert `Claim`'s attributes to their relations to `MajorClaim` (see [Relation Conversions](#relation-conversions) below, for more details) -- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions` - - - `LabeledSpans`, as above - - `BinaryRelations`, as above - - `LabeledPartitions`, partitioned `BratDocumentWithMergedSpans`'s `text`, according to the paragraph, using regex. - - every partition is labeled as `labeled_partition` - -See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type -definitions. - ### Data Splits | Statistics | Train | Test | @@ -98,18 +79,37 @@ See further statistics in Stab & Gurevych (2017), p. 650, Table A.1. See further description in Stab & Gurevych 2017, p.627 and the [annotation guideline](https://github.com/ArneBinder/pie-datasets/blob/db94035602610cefca2b1678aa2fe4455c96155d/data/datasets/ArgumentAnnotatedEssays-2.0/guideline.pdf). -#### Relation Conversions +### Document Converters -When converting from `BratDocumentWithMergedSpan` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`, -we apply a relation-conversion methods to build relations between `Claim`'s and `MajorClaim`'s, based on the annotated `Claim`'s attribution. +The dataset provides document converters for the following target document types: + +- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations` + - `LabeledSpans`, converted from `BratDocumentWithMergedSpans`'s `spans` + - labels: `MajorClaim`, `Claim`, `Premise` + - `BinaryRelations`, converted from `BratDocumentWithMergedSpans`'s `relations` + - labels: `support`, `attack`, `semantically_same` + - there are two conversion methods that convert `Claim`'s attributes to their relations to `MajorClaim` (see the label-count changes after this relation conversion [here below](#label-counts-after-document-converter)): + - `connect_first` (default setting): + - build a `Support` or `Attack` relation from each `Claim` to the first `MajorClaim`, and + - build a `semantically_same` relation between following `MajorClaim` to the first `MajorClaim` + - `connect_all` + - build a `Support` or `Attack` relation from each `Claim` to every `MajorClaim` + - no relations between each `MajorClaim` +- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions` + - - `LabeledSpans`, as above + - `BinaryRelations`, as above + - `LabeledPartitions`, partitioned `BratDocumentWithMergedSpans`'s `text`, according to the paragraph, using regex. + - every partition is labeled as `paragraph` -The two conversion methods are: +See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type +definitions. -1. `connect_first` (default): - - build a `Support` or `Attack` relation from each `Claim` to the first `MajorClaim`, and - - build a `semantically_same` relation between following `MajorClaim` to the first `MajorClaim` +#### Label Counts after Document Converter -The relation counts for this conversion method is as follows: +When converting from `BratDocumentWithMergedSpan` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`, +we apply a relation-conversion methods that change the number of label count for the relations, as follows: + +1. `connect_first` (default): | Relations | Count | Percentage | | -------------------------- | ----: | ---------: | @@ -118,10 +118,6 @@ The relation counts for this conversion method is as follows: | other: `semantically_same` | 349 | 6.2 % | 2. `connect_all` - - build a `Support` or `Attack` relation from each `Claim` to every `MajorClaim` - - no relations between each `MajorClaim` - -The relation counts for this conversion method is as follows: | Relations | Count | Percentage | | ------------------ | ----: | ---------: | diff --git a/dataset_builders/pie/aae2/aae2.py b/dataset_builders/pie/aae2/aae2.py index 5b43a075..d4c9d8f1 100644 --- a/dataset_builders/pie/aae2/aae2.py +++ b/dataset_builders/pie/aae2/aae2.py @@ -171,6 +171,7 @@ def document_converters(self) -> DocumentConvertersType: ), add_partitions=RegexPartitioner( partition_layer_name="labeled_partitions", + default_partition_label="paragraph", pattern="\n", strip_whitespace=True, verbose=False, diff --git a/tests/dataset_builders/pie/test_aae2.py b/tests/dataset_builders/pie/test_aae2.py index cf31c863..b4fd67c3 100644 --- a/tests/dataset_builders/pie/test_aae2.py +++ b/tests/dataset_builders/pie/test_aae2.py @@ -332,7 +332,7 @@ def test_dataset_of_text_documents_with_labeled_spans_binary_relations_and_label partitions = doc_with_partitions.labeled_partitions assert len(partitions) == 5 - assert [partition.label == "partition" for partition in partitions] + assert all([partition.label == "paragraph" for partition in partitions]) assert str(partitions[0]) == "Should students be taught to compete or to cooperate?" assert ( str(partitions[1]) From 491c1378edfabc4fa5103e9389656302cd8e6c0a Mon Sep 17 00:00:00 2001 From: Ruangrin L <88072261+idalr@users.noreply.github.com> Date: Fri, 5 Jan 2024 22:18:12 +0100 Subject: [PATCH 07/14] minor changes --- dataset_builders/pie/aae2/requirements.txt | 4 ++-- tests/dataset_builders/pie/test_aae2.py | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/dataset_builders/pie/aae2/requirements.txt b/dataset_builders/pie/aae2/requirements.txt index 0c3196c2..3cac4ea1 100644 --- a/dataset_builders/pie/aae2/requirements.txt +++ b/dataset_builders/pie/aae2/requirements.txt @@ -1,2 +1,2 @@ -pie-datasets>=0.6.0,<0.9.0 -pie-modules>=0.8.0,<0.9.0 +pie-datasets>=0.8.0,<0.9.0 +pie-modules>=0.8.3,<0.9.0 diff --git a/tests/dataset_builders/pie/test_aae2.py b/tests/dataset_builders/pie/test_aae2.py index b4fd67c3..07feaee2 100644 --- a/tests/dataset_builders/pie/test_aae2.py +++ b/tests/dataset_builders/pie/test_aae2.py @@ -530,7 +530,7 @@ def tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitio partition_layer="labeled_partitions", return_overflowing_tokens=True, result_document_type=TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, - strict_span_conversion=False, + strict_span_conversion=True, verbose=True, ) return tokenized_docs @@ -584,7 +584,7 @@ def test_tokenized_documents_with_entities_relations_and_partitions_all( partition_layer="labeled_partitions", return_overflowing_tokens=True, result_document_type=TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, - strict_span_conversion=False, + strict_span_conversion=True, verbose=True, ) # we just ensure that we get at least one tokenized document From 7fc08972849b70c669ca7f2869c05f23e377d7d1 Mon Sep 17 00:00:00 2001 From: Ruangrin L <88072261+idalr@users.noreply.github.com> Date: Fri, 5 Jan 2024 23:16:47 +0100 Subject: [PATCH 08/14] edited test_convert_aae2_claim_attributions_to_relations --- tests/dataset_builders/pie/test_aae2.py | 57 ++++++++++++++++++++----- 1 file changed, 46 insertions(+), 11 deletions(-) diff --git a/tests/dataset_builders/pie/test_aae2.py b/tests/dataset_builders/pie/test_aae2.py index 07feaee2..01bce1b7 100644 --- a/tests/dataset_builders/pie/test_aae2.py +++ b/tests/dataset_builders/pie/test_aae2.py @@ -3,6 +3,7 @@ import pytest from datasets import disable_caching from pie_modules.document.processing import tokenize_document +from pytorch_ie.annotations import BinaryRelation, LabeledSpan from pytorch_ie.core import Document from pytorch_ie.documents import ( TextDocumentWithLabeledSpansAndBinaryRelations, @@ -15,7 +16,7 @@ convert_aae2_claim_attributions_to_relations, ) from pie_datasets import DatasetDict -from pie_datasets.builders.brat import BratDocumentWithMergedSpans +from pie_datasets.builders.brat import BratAttribute, BratDocumentWithMergedSpans from tests.dataset_builders.common import ( PIE_BASE_PATH, TestTokenDocumentWithLabeledSpansAndBinaryRelations, @@ -158,16 +159,50 @@ def dataset_of_text_documents_with_labeled_spans_and_binary_relations( @pytest.mark.parametrize("method", ["connect_first", "connect_all"]) -def test_convert_aae2_claim_attributions_to_relations_all(document, method): - if dataset_variant == "default" or None: - converted_doc = convert_aae2_claim_attributions_to_relations(document, method) - converted_binary_relations = converted_doc.binary_relations - if method == "connect_first": - assert len(converted_binary_relations) == 10 - elif method == "connect_all": - assert len(converted_binary_relations) == 12 - else: - raise ValueError(f"Unknown method: {method}") +def test_convert_aae2_claim_attributions_to_relations(method): + # create sample document for testing + sample_doc = BratDocumentWithMergedSpans( + text="This is an example claim. This is the first major claim. " + "This is the second major claim." + ) + claim = LabeledSpan(start=0, end=25, label="Claim") + first_majorclaim = LabeledSpan(start=26, end=56, label="MajorClaim") + second_majorclaim = LabeledSpan(start=57, end=88, label="MajorClaim") + sample_doc.spans.extend([claim, first_majorclaim, second_majorclaim]) + # sanity check (works only after labeled spans were added to the document) + assert str(claim) == "This is an example claim." + assert str(first_majorclaim) == "This is the first major claim." + assert str(second_majorclaim) == "This is the second major claim." + # create claim attribute + claim_attribute = BratAttribute(annotation=claim, label="Stance", value="For") + sample_doc.span_attributes.append(claim_attribute) + + # check results + converted_doc = convert_aae2_claim_attributions_to_relations(sample_doc, method) + converted_binary_relations_tuples = [ + (str(relation.head), relation.label, str(relation.tail)) + for relation in converted_doc.binary_relations + ] + assert len(converted_binary_relations_tuples) == 2 + assert converted_binary_relations_tuples[0] == ( + "This is an example claim.", + "supports", + "This is the first major claim.", + ) + if method == "connect_first": + assert converted_binary_relations_tuples[1] == ( + "This is the second major claim.", + "semantically_same", + "This is the first major claim.", + ) + elif method == "connect_all": + assert converted_binary_relations_tuples[1] == ( + "This is an example claim.", + "supports", + "This is the second major claim.", + ) + else: + raise ValueError(f"Unknown method: {method}") def test_dataset_of_text_documents_with_labeled_spans_and_binary_relations( From 5e55793d3f10d7700dffb0839684fa56d3511e0c Mon Sep 17 00:00:00 2001 From: Ruangrin L <88072261+idalr@users.noreply.github.com> Date: Fri, 5 Jan 2024 23:35:02 +0100 Subject: [PATCH 09/14] make pre-commit happy --- tests/dataset_builders/pie/test_aae2.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/dataset_builders/pie/test_aae2.py b/tests/dataset_builders/pie/test_aae2.py index 01bce1b7..9f3f7498 100644 --- a/tests/dataset_builders/pie/test_aae2.py +++ b/tests/dataset_builders/pie/test_aae2.py @@ -163,7 +163,7 @@ def test_convert_aae2_claim_attributions_to_relations(method): # create sample document for testing sample_doc = BratDocumentWithMergedSpans( text="This is an example claim. This is the first major claim. " - "This is the second major claim." + "This is the second major claim." ) claim = LabeledSpan(start=0, end=25, label="Claim") first_majorclaim = LabeledSpan(start=26, end=56, label="MajorClaim") From 86bbe264bd423655ea276cb6d8409542162a7887 Mon Sep 17 00:00:00 2001 From: Ruangrin L <88072261+idalr@users.noreply.github.com> Date: Mon, 8 Jan 2024 15:40:08 +0100 Subject: [PATCH 10/14] with partitions in `test_tokenized_document...`, set `strict_span_conversion` to 'False' --- tests/dataset_builders/pie/test_aae2.py | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/tests/dataset_builders/pie/test_aae2.py b/tests/dataset_builders/pie/test_aae2.py index 9f3f7498..2b7e98bb 100644 --- a/tests/dataset_builders/pie/test_aae2.py +++ b/tests/dataset_builders/pie/test_aae2.py @@ -565,7 +565,10 @@ def tokenized_documents_with_labeled_spans_binary_relations_and_labeled_partitio partition_layer="labeled_partitions", return_overflowing_tokens=True, result_document_type=TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, - strict_span_conversion=True, + # We set strict_span_conversion to False + # because we added relations between Claims and MajorClaims from different paragraphs + # and those relations are lost in annotations + strict_span_conversion=False, verbose=True, ) return tokenized_docs @@ -619,7 +622,10 @@ def test_tokenized_documents_with_entities_relations_and_partitions_all( partition_layer="labeled_partitions", return_overflowing_tokens=True, result_document_type=TestTokenDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions, - strict_span_conversion=True, + # We set strict_span_conversion to False + # because we added relations between Claims and MajorClaims from different paragraphs + # and those relations are lost in annotations + strict_span_conversion=False, verbose=True, ) # we just ensure that we get at least one tokenized document From 1332fa2e44bf16cbd11096b7db8e94b763aaaa21 Mon Sep 17 00:00:00 2001 From: Arne Binder Date: Wed, 10 Jan 2024 17:18:14 +0100 Subject: [PATCH 11/14] several fixes and additions --- dataset_builders/pie/aae2/README.md | 56 ++++++++++++++--------------- 1 file changed, 28 insertions(+), 28 deletions(-) diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md index b85e1c43..c954ab8b 100644 --- a/dataset_builders/pie/aae2/README.md +++ b/dataset_builders/pie/aae2/README.md @@ -6,7 +6,7 @@ Therefore, the `aae2` dataset as described here follows the data structure from ### Dataset Summary -Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a `MajorClaim` component as well as `Claim` components, and `Premise` components justify or refute the claims. `Attack` and `Support` labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642) +Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a `MajorClaim` component as well as `Claim` components, and `Premise` components justify or refute the claims. `attacks` and `supports` labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642) There are two types of data: essay-level and paragraph-level ([Eger et al., 2017](https://aclanthology.org/P17-1002/)). In other words, a tree structure is complete within each paragraph, and there was no `Premise` that link to another `Premise` or `Claim` in a different paragraph, as seen in **Example** below. Therefore, it is possible to train a model on a paragraph-level which is also less memory-exhaustive (Eger et al., 2017, p. 16). @@ -44,7 +44,7 @@ assert isinstance(doc, builders.brat.BratDocumentWithMergedSpans) | ---------------------------------------------------------------- | -------------------------: | -----------------------: | | No. of document | 322 | 80 | | Components
- `MajorClaim`
- `Claim`
- `Premise` |
598
1202
3023 |
153
304
809 | -| Relations\*
- `Support`
- `Attack` |
3820
405 |
1021
92 | +| Relations\*
- `supports`
- `attacks` |
3820
405 |
1021
92 | \* included all relations between claims and premises and all claim attributions. @@ -60,18 +60,18 @@ See further statistics in Stab & Gurevych (2017), p. 650, Table A.1. | `Claim` | 1506 | 24.7 % | | `Premise` | 3832 | 62.9 % | -- `MajorClaim` is the root node of the argumentation structure and represents the author’s standpoint on the topic. Essay bodies either support or attack the author’s standpoint expressed in the major claim. -- `Claim` constitutes the central component of each argument. Each one has at least one premise and take the values "for" or "against" +- `MajorClaim` is the root node of the argumentation structure and represents the author’s standpoint on the topic. Essay bodies either support or attack the author’s standpoint expressed in the major claim. The major claim can be mentioned multiple times in a single document. +- `Claim` constitutes the central component of each argument. Each one has at least one premise and takes stance attribute values "for" or "against" with regarding the major claim. - `Premise` is the reasons of the argument; either linked to claim or another premise. -**Note that** relations between `MajorClaim` and `Claim` were not annotated; however, each claim is annotated with `Attribute`: `for` or `against` - which indicates the relation between itself and `MajorClaim`. In addition, when two non-related `claim` 's appear in one paragraph, there is also no relations to one another. +**Note that** relations between `MajorClaim` and `Claim` were not annotated; however, each claim is annotated with an `Attribute` annotation with value `for` or `against` - which indicates the relation between itself and `MajorClaim`. In addition, when two non-related `Claim` 's appear in one paragraph, there is also no relations to one another. #### Relations -| Relations | Count | Percentage | -| ------------------ | ----: | ---------: | -| support: `Support` | 3613 | 94.3 % | -| attack: `Attack` | 219 | 5.7 % | +| Relations | Count | Percentage | +| ------------------- | ----: | ---------: | +| support: `supports` | 3613 | 94.3 % | +| attack: `attacks` | 219 | 5.7 % | - "Each premise `p` has one **outgoing relation** (i.e., there is a relation that has p as source component) and none or several **incoming relations** (i.e., there can be a relation with `p` as target component)." - "A `Claim` can exhibit several **incoming relations** but no **outgoing relation**." (S&G, 2017, p. 68) @@ -83,46 +83,46 @@ See further description in Stab & Gurevych 2017, p.627 and the [annotation guide The dataset provides document converters for the following target document types: -- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations` - - `LabeledSpans`, converted from `BratDocumentWithMergedSpans`'s `spans` +- `pytorch_ie.documents.TextDocumentWithLabeledSpansAndBinaryRelations` with layers: + - `labeled_spans`: `LabeledSpan` annotations, converted from `BratDocumentWithMergedSpans`'s `spans` - labels: `MajorClaim`, `Claim`, `Premise` - - `BinaryRelations`, converted from `BratDocumentWithMergedSpans`'s `relations` - - labels: `support`, `attack`, `semantically_same` - - there are two conversion methods that convert `Claim`'s attributes to their relations to `MajorClaim` (see the label-count changes after this relation conversion [here below](#label-counts-after-document-converter)): + - `binary_relations`: `BinaryRelation` annotations, converted from `BratDocumentWithMergedSpans`'s `relations` + - there are two conversion methods that convert `Claim` attributes to their relations to `MajorClaim` (also see the label-count changes after this relation conversion [here below](#label-counts-after-document-converter)): - `connect_first` (default setting): - - build a `Support` or `Attack` relation from each `Claim` to the first `MajorClaim`, and + - build a `supports` or `attacks` relation from each `Claim` to the first `MajorClaim` depending on the `Claim`'s attribute (`for` or `against`), and - build a `semantically_same` relation between following `MajorClaim` to the first `MajorClaim` - `connect_all` - - build a `Support` or `Attack` relation from each `Claim` to every `MajorClaim` + - build a `supports` or `attacks` relation from each `Claim` to every `MajorClaim` - no relations between each `MajorClaim` -- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions` - - - `LabeledSpans`, as above - - `BinaryRelations`, as above - - `LabeledPartitions`, partitioned `BratDocumentWithMergedSpans`'s `text`, according to the paragraph, using regex. + - labels: `supports`, `attack`, and `semantically_same` if `connect_first` +- `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions` with layers: + - `labeled_spans`, as above + - `binary_relations`, as above + - `labeled_partitions`, `LabeledSpan` annotations, created from splitting `BratDocumentWithMergedSpans`'s `text` at new lines (`\n`). - every partition is labeled as `paragraph` See [here](https://github.com/ChristophAlt/pytorch-ie/blob/main/src/pytorch_ie/documents.py) for the document type definitions. -#### Label Counts after Document Converter +#### Label Statistics after Document Conversion When converting from `BratDocumentWithMergedSpan` to `TextDocumentWithLabeledSpansAndBinaryRelations` and `TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions`, -we apply a relation-conversion methods that change the number of label count for the relations, as follows: +we apply a relation-conversion method (see above) that changes the label counts for the relations, as follows: 1. `connect_first` (default): | Relations | Count | Percentage | | -------------------------- | ----: | ---------: | -| support: `Support` | 4841 | 85.1 % | -| attack: `Attack` | 497 | 8.7 % | +| support: `supports` | 4841 | 85.1 % | +| attack: `attacks` | 497 | 8.7 % | | other: `semantically_same` | 349 | 6.2 % | 2. `connect_all` -| Relations | Count | Percentage | -| ------------------ | ----: | ---------: | -| support: `Support` | 5958 | 89.3 % | -| attack: `Attack` | 715 | 10.7 % | +| Relations | Count | Percentage | +| ------------------- | ----: | ---------: | +| support: `supports` | 5958 | 89.3 % | +| attack: `attacks` | 715 | 10.7 % | ## Dataset Creation From 6f664e11909804b2a6525ca82475a6ad216dc35f Mon Sep 17 00:00:00 2001 From: Arne Binder Date: Wed, 10 Jan 2024 17:31:00 +0100 Subject: [PATCH 12/14] minor improvement --- dataset_builders/pie/aae2/aae2.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dataset_builders/pie/aae2/aae2.py b/dataset_builders/pie/aae2/aae2.py index d4c9d8f1..83e81cb5 100644 --- a/dataset_builders/pie/aae2/aae2.py +++ b/dataset_builders/pie/aae2/aae2.py @@ -142,7 +142,7 @@ class ArgumentAnnotatedEssaysV2(BratBuilder): # we need to add None to the list of dataset variants to support the default dataset variant BASE_BUILDER_KWARGS_DICT = { dataset_variant: {"url": URL, "split_paths": SPLIT_PATHS} - for dataset_variant in ["default", None] + for dataset_variant in [BratBuilder.DEFAULT_CONFIG_NAME, None] } BUILDER_CONFIGS = [ From a77b1d7ca549c41b3e76a90d5a6d049ed522dacd Mon Sep 17 00:00:00 2001 From: Arne Binder Date: Wed, 10 Jan 2024 17:41:12 +0100 Subject: [PATCH 13/14] improve dataset summary --- dataset_builders/pie/aae2/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md index c954ab8b..9a7a93d6 100644 --- a/dataset_builders/pie/aae2/README.md +++ b/dataset_builders/pie/aae2/README.md @@ -6,9 +6,9 @@ Therefore, the `aae2` dataset as described here follows the data structure from ### Dataset Summary -Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a `MajorClaim` component as well as `Claim` components, and `Premise` components justify or refute the claims. `attacks` and `supports` labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642) +Argument Annotated Essays Corpus (AAEC) ([Stab and Gurevych, 2017](https://aclanthology.org/J17-3005.pdf)) contains student essays. A stance for a controversial theme is expressed by a major claim component as well as claim components, and premise components justify or refute the claims. Attack and support labels are defined as relations. The span covers a statement, *which can stand in isolation as a complete sentence*, according to the AAEC annotation guidelines. All components are annotated with minimum boundaries of a clause or sentence excluding so-called "shell" language such as *On the other hand* and *Hence*. (Morio et al., 2022, p. 642) -There are two types of data: essay-level and paragraph-level ([Eger et al., 2017](https://aclanthology.org/P17-1002/)). In other words, a tree structure is complete within each paragraph, and there was no `Premise` that link to another `Premise` or `Claim` in a different paragraph, as seen in **Example** below. Therefore, it is possible to train a model on a paragraph-level which is also less memory-exhaustive (Eger et al., 2017, p. 16). +There is no premise that links to another premise or claim in a different paragraph. That means, an argumentation tree structure is complete within each paragraph. Therefore, it is possible to train a model on the full documents or just at the paragraph-level which is usually less memory-exhaustive (Eger et al., 2017, p. 16). ### Supported Tasks and Leaderboards From 1211ccb49a8d411eaddb8e17e169f1bbdb2d2dd8 Mon Sep 17 00:00:00 2001 From: Arne Binder Date: Wed, 10 Jan 2024 17:47:18 +0100 Subject: [PATCH 14/14] fix again --- dataset_builders/pie/aae2/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dataset_builders/pie/aae2/README.md b/dataset_builders/pie/aae2/README.md index 9a7a93d6..48b80c0b 100644 --- a/dataset_builders/pie/aae2/README.md +++ b/dataset_builders/pie/aae2/README.md @@ -94,7 +94,7 @@ The dataset provides document converters for the following target document types - `connect_all` - build a `supports` or `attacks` relation from each `Claim` to every `MajorClaim` - no relations between each `MajorClaim` - - labels: `supports`, `attack`, and `semantically_same` if `connect_first` + - labels: `supports`, `attacks`, and `semantically_same` if `connect_first` - `pytorch_ie.documents.TextDocumentWithLabeledSpansBinaryRelationsAndLabeledPartitions` with layers: - `labeled_spans`, as above - `binary_relations`, as above