-
Notifications
You must be signed in to change notification settings - Fork 30
Size of weakly labeled data in paper/Annotate.ipynb and the number of sentences of 2021 PubMed Baseline don't match #1
Comments
Hi the entire pubmed data is very large. The data used in our paper contains 11M/15M samples. It was actually subsampled, so that it can be loaded in the memory of our machine. I believe in general, you can use larger data to achieve better performance. I wish to share the exact data we used. However, due to the policy, we can not directly redistribute the data. |
Thank you for your helpful comments :)
Thanks again! |
The way I generate the data is first using the notebook to generate all data and then randomly sub-sampled it by 50%.
|
I think I misunderstood the meaning of the term in your paper. Following your advice, I extracted the correct number of sentences (all_samples) . Thank you for the kind reply :) |
Hi soochem @soochem , do you get the same numer of sentences as needle? I also use pubmed2021, but |
@zhiyuanpeng hello, I just use 1015 text files of PubMed baseline 2021 by using |
Hello,
I have tried to annotate all pubmed sentences and struggled with large number of pubmed sentences and memory issue.
I changed
download_pubmed.sh
code to retrieve 2021 pubmed baseline since we could not find 2020 baseline.And then we encounter this issue: We have retrieved 2021 pubmed baseline only including 1-1015 files, and we assume data was accumulated from 2020. So we guess our retrieved data may have the same number of lines with yours.
But we have quite large number of lines for all_text and (un)labeled_lines, compared to your outputs of Annotate notebook.
Could you please give me some advice for the different number of pubmed sentences and expected effect of those large number of sentences?
Thank you in advance.
The text was updated successfully, but these errors were encountered: