-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collecting statistics from AM dataset #100
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #100 +/- ##
==========================================
+ Coverage 92.50% 98.52% +6.01%
==========================================
Files 10 6 -4
Lines 921 407 -514
==========================================
- Hits 852 401 -451
+ Misses 69 6 -63 ☔ View full report in Codecov by Sentry. |
867c26e
to
4e3be74
Compare
per-label relation argument token distances for
|
Further suggestions on statistic collection
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! Can you now move the content to the respective dataset cards?
Two remarks:
- regarding the "Remark on statistics collection" paragraph:
Please add the respective dataset config to that section, e.g.
"In the commands, we used the following dataset config configs/dataset/dataset_base.yaml
:
_target_: src.utils.execute_pipeline
input:
_target_: pie_datasets.DatasetDict.load_dataset
path: pie/sciarg
revision: 982d5682ba414ee13cf92cb93ec18fc8e78e2b81
"
Note that sometimes a local file is used as input, e.g. for cdcp
there are these lines in the config:
base_dataset_kwargs:
data_dir: data/datasets/cdcp_acl17.zip
These files are not available in https://github.com/ArneBinder/pytorch-ie-hydra-template-1. However, I think they also work when leaving them out. But maybe try that
it really works without them by loading the dataset once without the respective parameter, e.g. by calling (without base_dataset_kwargs
at all):
pie_datasets.DatasetDict.load_dataset(path="pie/cdcp", revision="001722894bdca6df6a472d0d186a3af103e392c5")
- For the relation argument distance stats, the code for is not yet merged. Please add a reference to the respective PR and say that this branch needs to be checked out instead of the
main
branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just two small remarks, see below.
afaa04c
to
d4167ce
Compare
minor edit
f41a720
to
189c155
Compare
In this PR, we collect various statistics from our argument-mining datasets, i.e.,
aae2
,abstrct
,argmicro
,cdcp
,sciarg
,scidtb_argmin
. We then present the numbers and histograms (if applies), according to their pre-defined splits in markdown files.The statistics would help us understand and analyze the dataset better quantitatively, especially when given to the model training and evaluation.
Currently available statistics:
After this PR is completed, these statistics will be moved to the respective dataset cards in huggingface/pie