Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset #9

Open
xiyan524 opened this issue Feb 17, 2019 · 19 comments
Open

Dataset #9

xiyan524 opened this issue Feb 17, 2019 · 19 comments

Comments

@xiyan524
Copy link

Thanks for your excellent works.

Would you mind provide XSum dataset directly just like CNN/Dialy Mail that we are familiar with? I believe it may save time and be more convenient for experiments.

I'd be appreciate if you could give any help. Thanks~

@shashiongithub
Copy link
Collaborator

Could you drop me an email and tell me what problems are you having with the download?

@xiyan524
Copy link
Author

Thanks a lot~My email is [email protected]

Actually I do not met any problem yet, while I am pressed to do some experiments so that direct dataset may be more helpful. I will try by myself to get dataset when I at my leisure.

@xiyan524
Copy link
Author

And I have a question about some parameters in your model. As I have understand that some parameters like t_d(the topic distribution of the document D) is obtained from pre-trained LDA. I am curious about if this vector will be trained during the training process? In other words, vector from pre-trained LDA is just a initial value for training or a fixed value which will not be changed? thx~

@chenbaicheng
Copy link

@xiyan524 你开始用中文数据训练了吗 , 卡住了, 不知道从哪里入手好. 按照理解 , 作者说过可以用fasttext,bert 效果会更加好,那个应该是说词向量. 作者输入词向量那段代码,在哪里 ,
有没有找到〒▽〒, 已经找到fasttext和bert怎么生成中文的词向量了

@xiyan524
Copy link
Author

@chenbaicheng 抱歉,我没有用文章所提出的模型,只是比较感兴趣XSum这个数据集。

@chenbaicheng
Copy link

@xiyan524 谢谢

@shashiongithub
Copy link
Collaborator

"vector from pre-trained LDA is just a initial value for training or a fixed value which will not be changed? ..." Yes, pre-trained LDA vectors are fixed during training. It varies for different documents and for different words in every document.

@xiyan524
Copy link
Author

@shashiongithub I got it. thx

@artidoro
Copy link

Hello @shashiongithub I am also having trouble downloading the dataset. After rerunning the script > 75 times I still have 11 articles that cannot be downloaded. I would like to make a fair comparison with your results that uses exactly the same train/test split.

To facilitate further research experimentation and development with this dataset could you make it available directly?

@shashiongithub
Copy link
Collaborator

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets.
Let me know if you have any questions.

@isabelcachola
Copy link

I downloaded the tar file above and it is in a different format than is expected for the script scripts/xsum-preprocessing-convs2s.py. Can you please share instructions for how to convert the data in the tar file to what this script expects? Thanks.

@isabelcachola
Copy link

For anyone who is trying to format the data in the link above, this is what I did to get it in the right format.

First, I used the following quick script to reformat the data

from os import listdir
from os.path import isfile, join
import re
from tqdm import tqdm 

bbc_dir = '/path/to/bbc-summary-data'
out_dir = '/path/to/XSum/XSum-Dataset/xsum-extracts-from-downloads'

bbc_files = [f for f in listdir(bbc_dir) if isfile(join(bbc_dir, f))]

for fname in tqdm(bbc_files):
    with open(join(out_dir, f'{fname.split(".")[0]}.data'), 'w') as f_out:
        text_in = open(join(bbc_dir, fname)).read()
        text_out = re.sub(r'\[SN\]', '\[XSUM\]', text_in)
        f_out.write(text_out)

From here, you can follow the instructions in the dataset README starting at the section, Postprocessing: Sentence Segmentation, Tokenization, Lemmatization and Final preparation.

As a side note, I am using a different version of the Stanford CoreNLP Toolkit (stanford-corenlp-full-2018-10-05), so I had to change this for loop in scripts/process-corenlp-xml-data.py to the following:

      for doc_sent, doc_sentlemma in zip(doc_sentences, doc_sentlemmas):
        clean_doc_sent = re.sub(r'\\ ', '', doc_sent)
        if "-LSB- XSUM -RSB- URL -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "URL"
          allcovered += 1
        elif "-LSB- XSUM -RSB- FIRST-SENTENCE -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "INTRODUCTION"
          allcovered += 1
        elif "-LSB- XSUM -RSB- RESTBODY -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "RestBody"
          allcovered += 1
        else:
          if modeFlag == "RestBody":
            restbodydata.append(doc_sent)
            restbodylemmadata.append(doc_sentlemma)
          if modeFlag == "INTRODUCTION":
            summarydata.append(doc_sent)

@matt9704
Copy link

matt9704 commented Jul 8, 2020

For anyone who is trying to format the data in the link above, this is what I did to get it in the right format.

First, I used the following quick script to reformat the data

from os import listdir
from os.path import isfile, join
import re
from tqdm import tqdm 

bbc_dir = '/path/to/bbc-summary-data'
out_dir = '/path/to/XSum/XSum-Dataset/xsum-extracts-from-downloads'

bbc_files = [f for f in listdir(bbc_dir) if isfile(join(bbc_dir, f))]

for fname in tqdm(bbc_files):
    with open(join(out_dir, f'{fname.split(".")[0]}.data'), 'w') as f_out:
        text_in = open(join(bbc_dir, fname)).read()
        text_out = re.sub(r'\[SN\]', '\[XSUM\]', text_in)
        f_out.write(text_out)

Hi! Thanks for providing the code. I'm wondering which decoder you used for the text files? When I use the same code as you provided, I have the following error:
`line 14, in
text_in = open(join(bbc_dir, fname)).read()

 UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 12896: illegal multibyte sequence`

@fajri91
Copy link

fajri91 commented Jul 13, 2020

Hi, I cant access the link,
Can you please fix it?

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets.
Let me know if you have any questions.

@shashiongithub
Copy link
Collaborator

Shay suggested to try this: http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

@Ricardokevins
Copy link

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets. Let me know if you have any questions.

hello , thanks a lot for sharing data.
After download data and unzip it. the folder contains *.summary. And .summary contains URL,TITLE,FIRST-SENTENCE,RESTBODY which is different from expected format. In README what should i do next? use StanfordNLP toolkit?
It seems that xsum-preprocessing-convs2s Requires 2 kind of file (
.document and *.summary which is different from provided data)

@sriram487
Copy link

Hi, I just have one question. What is the total number of instances? I got 237002 after preprocessing the files downloaded from the above given bollin.inf.ed.uk website. Is that same in your case because it is found that the number of instances is around 226000 in huggingfaces website

@sriram487
Copy link

For anyone who is trying to format the data in the link above, this is what I did to get it in the right format.

First, I used the following quick script to reformat the data

from os import listdir
from os.path import isfile, join
import re
from tqdm import tqdm 

bbc_dir = '/path/to/bbc-summary-data'
out_dir = '/path/to/XSum/XSum-Dataset/xsum-extracts-from-downloads'

bbc_files = [f for f in listdir(bbc_dir) if isfile(join(bbc_dir, f))]

for fname in tqdm(bbc_files):
    with open(join(out_dir, f'{fname.split(".")[0]}.data'), 'w') as f_out:
        text_in = open(join(bbc_dir, fname)).read()
        text_out = re.sub(r'\[SN\]', '\[XSUM\]', text_in)
        f_out.write(text_out)

From here, you can follow the instructions in the dataset README starting at the section, Postprocessing: Sentence Segmentation, Tokenization, Lemmatization and Final preparation.

As a side note, I am using a different version of the Stanford CoreNLP Toolkit (stanford-corenlp-full-2018-10-05), so I had to change this for loop in scripts/process-corenlp-xml-data.py to the following:

      for doc_sent, doc_sentlemma in zip(doc_sentences, doc_sentlemmas):
        clean_doc_sent = re.sub(r'\\ ', '', doc_sent)
        if "-LSB- XSUM -RSB- URL -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "URL"
          allcovered += 1
        elif "-LSB- XSUM -RSB- FIRST-SENTENCE -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "INTRODUCTION"
          allcovered += 1
        elif "-LSB- XSUM -RSB- RESTBODY -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "RestBody"
          allcovered += 1
        else:
          if modeFlag == "RestBody":
            restbodydata.append(doc_sent)
            restbodylemmadata.append(doc_sentlemma)
          if modeFlag == "INTRODUCTION":
            summarydata.append(doc_sent)

Hello, I used the process-corenlp-xml-data.py to process the bbcid.data.xml files but i got a error which says some information is missing. /stanfordOutput/bbcid.data.xml

It will be great if u help me in this issue, Thanks

@BaohaoLiao
Copy link

If anyone still has problems about:

  1. download and split XSum
  2. evaluate fine-tuned BART on XSum
    You might want to check my reproduction repository https://github.com/BaohaoLiao/NLP-reproduction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants