Skip to content

Code to obtain the training data for the ACL 2018 paper "Neural Document Summarization by Jointly Learning to Score and Select Sentences"

Notifications You must be signed in to change notification settings

magic282/cnndm_acl18

Repository files navigation

Data processing for NeuSum

This repo contains the code which can generate the training data (CNN / Daily Mail) needed by NeuSum.

  1. Preprocess CNN/DM dataset using abisee's scripts: https://github.com/abisee/cnn-dailymail

  2. Convert its output to the format shown in the sample_data folder. The format of files:

    1. File train.txt.src is the input document. Each line contains several tokenized sentences delimited by ##SENT## of a document.
    2. File train.txt.tgt is the summary of document. Each line contains several tokenized summaries delimited by ##SENT## of the corresponding document.
  3. Use find_oracle.py to search the best sentences to be extracted. The arguments of the main functions are: document_file, summary_file and output_path.

  4. Next, build the ROUGE score gain file using get_mmr_regression_gain.py. The usage is shown in the code entry.

Note

The algorithm is a brute-force search, which can be slow in some cases. Therefore, running it in parallel is recommended (and it is what I did in my experiments).

Recently, I modify the find_oracle.py a little using multiprocessing so that it can be easier to run it in parallel. Please check out find_oracle_para.py.

About

Code to obtain the training data for the ACL 2018 paper "Neural Document Summarization by Jointly Learning to Score and Select Sentences"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages