Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collecting statistics from AM dataset #100

Merged
merged 42 commits into from
Nov 7, 2024
Merged

Conversation

idalr
Copy link
Collaborator

@idalr idalr commented Jan 25, 2024

In this PR, we collect various statistics from our argument-mining datasets, i.e., aae2,abstrct,argmicro,cdcp,sciarg,scidtb_argmin. We then present the numbers and histograms (if applies), according to their pre-defined splits in markdown files.

The statistics would help us understand and analyze the dataset better quantitatively, especially when given to the model training and evaluation.

Currently available statistics:

  • Relation argument (outer) token distance
  • Token length (token)
  • Span length token per label (requires pie-document-level #196)

After this PR is completed, these statistics will be moved to the respective dataset cards in huggingface/pie

  • move contents to the respective dataset card

Copy link

codecov bot commented Jan 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.52%. Comparing base (65ed8a1) to head (189c155).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   92.50%   98.52%   +6.01%     
==========================================
  Files          10        6       -4     
  Lines         921      407     -514     
==========================================
- Hits          852      401     -451     
+ Misses         69        6      -63     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@idalr idalr force-pushed the collecting_statistics branch from 867c26e to 4e3be74 Compare January 29, 2024 15:21
@ArneBinder
Copy link
Owner

per-label relation argument token distances for sciarg (calculated with https://github.com/ArneBinder/pie-document-level/pull/203):

        Relation argument (outer) token distances (split: train, 40 documents)  
len max mean min std
ALL 15640 2864 30.524 3 45.351
contradicts 1392 238 32.565 6 19.771
parts_of_same 2594 374 28.18 3 26.845
semantically_same 84 2864 206.333 11 492.268
supports 11570 407 29.527 4 24.189

histogram:
Bildschirmfoto vom 2024-01-31 15-34-42

@idalr
Copy link
Collaborator Author

idalr commented Feb 9, 2024

Further suggestions on statistic collection

  • relation density
    • introduced in Morio et al. (2022) computed from N_relation / (N_component - 1)
    • e.g. cdcp was much less dense than other corpora aka. components that were not linked to any other components (Morio et al., 2022, p.643, T2)
    • in similar fashion, sciarg also had a lot of standalone claims, unsupported claims, and unconnected subgraphs (Lauscher et al., 2018, p.43, T4)
    • Diameter in Lauscher et al. also indicated how many outgoing and incoming relations there were for each component, which could be useful for graph-based corpora.
    • importance
      • could indicate how easy or difficult it is to predict links between components or relation identification
  • non-crossing
    • in Morio et al. (2022), referred to whether there is other component(s) between two components that have a relation; argumentative relation (components) distance
    • e.g.abstrct had more crossing than other corpora, indicating that the argument structure was more complicated
    • importance
      • it indicates how complicated the document is structured, and how easy the arguments in the document can be followed
      • larger distance between components that are related might be harder to identify and predict
      • Peldszus & Stede (2016preprint, T2-T3) also provided the argmicro statistics on where the major claim/objections was positioned in the argumentative paragraph.
      • In P&S, there was also attachment distance; a finer measure of how far the heads are away from their tails.
  • tree/graph depth
    • Lauscher et al. (2018, p.43, T4) reported 'Max In-Degree' of sciarg
    • This indicated the depth of the argumentation tree/graph; thus, the structure of the arguments

@idalr idalr requested a review from ArneBinder February 15, 2024 14:19
Copy link
Owner

@ArneBinder ArneBinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Can you now move the content to the respective dataset cards?

Two remarks:

  1. regarding the "Remark on statistics collection" paragraph:
    Please add the respective dataset config to that section, e.g.

"In the commands, we used the following dataset config configs/dataset/dataset_base.yaml:

_target_: src.utils.execute_pipeline
input:
  _target_: pie_datasets.DatasetDict.load_dataset
  path: pie/sciarg
  revision: 982d5682ba414ee13cf92cb93ec18fc8e78e2b81

"

Note that sometimes a local file is used as input, e.g. for cdcp there are these lines in the config:

  base_dataset_kwargs:
    data_dir: data/datasets/cdcp_acl17.zip

These files are not available in https://github.com/ArneBinder/pytorch-ie-hydra-template-1. However, I think they also work when leaving them out. But maybe try that
it really works without them by loading the dataset once without the respective parameter, e.g. by calling (without base_dataset_kwargs at all):

pie_datasets.DatasetDict.load_dataset(path="pie/cdcp", revision="001722894bdca6df6a472d0d186a3af103e392c5")
  1. For the relation argument distance stats, the code for is not yet merged. Please add a reference to the respective PR and say that this branch needs to be checked out instead of the main branch.

Copy link
Owner

@ArneBinder ArneBinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just two small remarks, see below.

dataset_builders/pie/abstrct/README.md Outdated Show resolved Hide resolved
dataset_builders/pie/abstrct/README.md Outdated Show resolved Hide resolved
@idalr idalr force-pushed the collecting_statistics branch from afaa04c to d4167ce Compare February 23, 2024 11:05
@ArneBinder ArneBinder force-pushed the collecting_statistics branch from f41a720 to 189c155 Compare November 7, 2024 14:24
@ArneBinder ArneBinder merged commit e68f60d into main Nov 7, 2024
10 checks passed
@ArneBinder ArneBinder deleted the collecting_statistics branch November 7, 2024 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants