Collecting statistics from AM dataset #100

idalr · 2024-01-25T17:29:29Z

In this PR, we collect various statistics from our argument-mining datasets, i.e., aae2,abstrct,argmicro,cdcp,sciarg,scidtb_argmin. We then present the numbers and histograms (if applies), according to their pre-defined splits in markdown files.

The statistics would help us understand and analyze the dataset better quantitatively, especially when given to the model training and evaluation.

Currently available statistics:

Relation argument (outer) token distance
Token length (token)
Span length token per label (requires pie-document-level #196)

After this PR is completed, these statistics will be moved to the respective dataset cards in huggingface/pie

move contents to the respective dataset card

codecov · 2024-01-25T17:35:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.52%. Comparing base (65ed8a1) to head (189c155).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   92.50%   98.52%   +6.01%     
==========================================
  Files          10        6       -4     
  Lines         921      407     -514     
==========================================
- Hits          852      401     -451     
+ Misses         69        6      -63

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ArneBinder · 2024-01-31T14:35:50Z

per-label relation argument token distances for sciarg (calculated with https://github.com/ArneBinder/pie-document-level/pull/203):

        Relation argument (outer) token distances (split: train, 40 documents)

	len	max	mean	min	std
ALL	15640	2864	30.524	3	45.351
contradicts	1392	238	32.565	6	19.771
parts_of_same	2594	374	28.18	3	26.845
semantically_same	84	2864	206.333	11	492.268
supports	11570	407	29.527	4	24.189

histogram:

idalr · 2024-02-09T10:15:15Z

Further suggestions on statistic collection

relation density
- introduced in Morio et al. (2022) computed from N_relation / (N_component - 1)
- e.g. cdcp was much less dense than other corpora aka. components that were not linked to any other components (Morio et al., 2022, p.643, T2)
- in similar fashion, sciarg also had a lot of standalone claims, unsupported claims, and unconnected subgraphs (Lauscher et al., 2018, p.43, T4)
- Diameter in Lauscher et al. also indicated how many outgoing and incoming relations there were for each component, which could be useful for graph-based corpora.
- importance
  - could indicate how easy or difficult it is to predict links between components or relation identification
non-crossing
- in Morio et al. (2022), referred to whether there is other component(s) between two components that have a relation; argumentative relation (components) distance
- e.g.abstrct had more crossing than other corpora, indicating that the argument structure was more complicated
- importance
  - it indicates how complicated the document is structured, and how easy the arguments in the document can be followed
  - larger distance between components that are related might be harder to identify and predict
  - Peldszus & Stede (2016preprint, T2-T3) also provided the argmicro statistics on where the major claim/objections was positioned in the argumentative paragraph.
  - In P&S, there was also attachment distance; a finer measure of how far the heads are away from their tails.
tree/graph depth
- Lauscher et al. (2018, p.43, T4) reported 'Max In-Degree' of sciarg
- This indicated the depth of the argumentation tree/graph; thus, the structure of the arguments

ArneBinder

This looks good! Can you now move the content to the respective dataset cards?

Two remarks:

regarding the "Remark on statistics collection" paragraph:
Please add the respective dataset config to that section, e.g.

"In the commands, we used the following dataset config configs/dataset/dataset_base.yaml:

_target_: src.utils.execute_pipeline
input:
  _target_: pie_datasets.DatasetDict.load_dataset
  path: pie/sciarg
  revision: 982d5682ba414ee13cf92cb93ec18fc8e78e2b81

"

Note that sometimes a local file is used as input, e.g. for cdcp there are these lines in the config:

  base_dataset_kwargs:
    data_dir: data/datasets/cdcp_acl17.zip

These files are not available in https://github.com/ArneBinder/pytorch-ie-hydra-template-1. However, I think they also work when leaving them out. But maybe try that
it really works without them by loading the dataset once without the respective parameter, e.g. by calling (without base_dataset_kwargs at all):

pie_datasets.DatasetDict.load_dataset(path="pie/cdcp", revision="001722894bdca6df6a472d0d186a3af103e392c5")

For the relation argument distance stats, the code for is not yet merged. Please add a reference to the respective PR and say that this branch needs to be checked out instead of the main branch.

ArneBinder

Looks good, just two small remarks, see below.

dataset_builders/pie/abstrct/README.md

minor edit

…older

idalr force-pushed the collecting_statistics branch from 867c26e to 4e3be74 Compare January 29, 2024 15:21

idalr requested a review from ArneBinder February 15, 2024 14:19

ArneBinder requested changes Feb 19, 2024

View reviewed changes

dataset_builders/pie/abstrct/README.md Outdated Show resolved Hide resolved

dataset_builders/pie/abstrct/README.md Outdated Show resolved Hide resolved

idalr force-pushed the collecting_statistics branch from afaa04c to d4167ce Compare February 23, 2024 11:05

idalr added 22 commits November 7, 2024 11:53

added statistics: Relation_argument_outer_token_distance.md

b4631db

minor fix

afba426

micro fix

4733900

micro fix

bac5acf

micro fix

a443e86

added commands and collapsed all histograms

27f80e7

minor edits

0d266ed

added text_length_tokens.md and

839a801

minor edit

micro fix

372485d

add span_lengths_tokens.md

acb951a

edited texts in markdowns

b5efb8b

micro fix

f28a0df

added relation_argument_outer_token_distance_per_label.md

88f6012

micro fix

e1e5706

deleted ~~relation_argument_outer_token_distance.md~~

4ffab8d

moved collected statistics content to pie/abstrct/readme.md

bd36737

minor fix

43fd083

edited text in abstrct/readme

9aff110

minor edit abstrct/readme

19d31a5

removed abstrct histograms from statistics folder

7a11771

moved content and files to aae2.readme, and removed from statistics f…

e943371

…older

minor fix

51e84bc

idalr and others added 20 commits November 7, 2024 11:53

minor edit

0c8efdf

moved content and files to argmicro/readme.md

01598ec

moved content and files to cdcp/readme.md

3a6d0f3

minor fix

6105b61

moved content and files to scidtb_argmin/readme.md

53afeab

minor fix

4477c28

moved content and files to sciarg/readme.md

075737c

deleted AM_statistics folder

d87bb85

minor edit

c55db07

fixed typos

2d04194

minor adjustments

768eecf

minor adjustments

10f21bd

improve usage examples

27f9215

add usage example to argmicro

19ecf02

fix argmicro usage example

5d0c88b

add cdcp usage example

d5157cb

add usage example to scidtb_argmin minor and fix

e72f2d4

add details to document converter

045496d

improve usage example for sciarg

f44870f

move usage example for sciarg

189c155

ArneBinder force-pushed the collecting_statistics branch from f41a720 to 189c155 Compare November 7, 2024 14:24

ArneBinder approved these changes Nov 7, 2024

View reviewed changes

ArneBinder merged commit e68f60d into main Nov 7, 2024
10 checks passed

ArneBinder deleted the collecting_statistics branch November 7, 2024 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collecting statistics from AM dataset #100

Collecting statistics from AM dataset #100

idalr commented Jan 25, 2024 •

edited

Loading

codecov bot commented Jan 25, 2024 •

edited

Loading

ArneBinder commented Jan 31, 2024

idalr commented Feb 9, 2024

ArneBinder left a comment

ArneBinder left a comment •

edited

Loading

Collecting statistics from AM dataset #100

Collecting statistics from AM dataset #100

Conversation

idalr commented Jan 25, 2024 • edited Loading

codecov bot commented Jan 25, 2024 • edited Loading

Codecov Report

ArneBinder commented Jan 31, 2024

idalr commented Feb 9, 2024

ArneBinder left a comment

Choose a reason for hiding this comment

ArneBinder left a comment • edited Loading

Choose a reason for hiding this comment

idalr commented Jan 25, 2024 •

edited

Loading

codecov bot commented Jan 25, 2024 •

edited

Loading

ArneBinder left a comment •

edited

Loading