This repository contains the code used for creating the dataset in the paper:
SumREN: Summarizing Reported Speech about Events in News Revanth Gangi Reddy, Heba Elfardy, Hou Pong Chan, Kevin Small, and Heng Ji. AAAI 2023.
Please follow the steps below for installation:
conda create --name sumren python=3.8.15
conda activate sumren
pip install -r requirements.txt
The gold training data was scraped from Wayback machines. You can create the training data using the following script in the parent directory of data
folder.
python expand_train.py
This generates expanded_train.json
which contains the gold training data with news article text and gol summaries.
The evaluation data, which comprises the dev and test sets, contains articles from 2017 - 2021 obtained from CC-News. Getting the news corpus for the dev and test sets involves first downloading the CC-News dump for these years and then extracting news articles for the URLs in the eval data.
We note that CC-News
corpus requires considerable storage space (up to 25 TB)
and we suggest that you run the below scripts on a cloud provider.
We also recommend downloading each year's data into a separate directory/volume since it might not be possible to create a single storage volume with size up to 25 TB.
Installing and configuring AWS CLI
Before starting, you will need to install AWS CLI to be able to download the CC-News
from S3.
To do so, please follow the instructions here: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
Using the S3 bucket to download CC-News
requires your AWS CLI to be authenticated.
Please follow: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config
Downloading CC-News from s3
Run the script download_cc.sh
with the corresponding output directory for each year
bash download_cc.sh 2017 <output_dir_for_2017>
bash download_cc.sh 2018 <output_dir_for_2018>
bash download_cc.sh 2019 <output_dir_for_2019>
bash download_cc.sh 2020 <output_dir_for_2020>
bash download_cc.sh 2021 <output_dir_for_2021>
Mapping URLs in SumREN evaluation to CC-News
To extract news articles from CC-News
corresponding to the URLs in SumREN,
run the below script for each year
(Note: Please make sure to run the below scripts in the parent directory of the data
folder.)
bash map_cc.sh <dir_for_cc_download_2017> <out_dir>
bash map_cc.sh <dir_for_cc_download_2018> <out_dir>
bash map_cc.sh <dir_for_cc_download_2019> <out_dir>
bash map_cc.sh <dir_for_cc_download_2020> <out_dir>
bash map_cc.sh <dir_for_cc_download_2022> <out_dir>
python expand_eval.py
This script generates expanded_dev.json
and expanded_test.json
which comprise the dev and test sets respectively with the news article text.
If expand_eval.py
outputs that some files are missing from a particular year,
this indicates that CC-News wasn't fully downloaded (i.e. some files are missing) for this year.
To resolve this, please re-run download_cc.sh
and map_cc.sh
for the year with the missing files.
The code is licensed under the license here and the data is licensed under the license here.