Code for the Paper "Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks" by Vikas Raunak and Arul Menezes.
This repo provides:
- Data, Model and Code to Replicate the results in the paper
- Scripts to Train and Run the experiments on your own dataset
- Pointers to modify the underlying algorithms
If you find our code or paper useful, please cite the paper:
@inproceedings{raunak-etal-finding-memo,
title = "Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks",
author = "Raunak, Vikas and Arul, Menezes",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
publisher = "Association for Computational Linguistics",
year = "2022"
}
The code is tested in a conda environment with python=3.8
.
pip install -r requirements.txt
wget https://www.dropbox.com/s/dxnlwziqzr9iz7i/data.zip
unzip data.zip
bash run_experiment.sh
The directory and file paths could be set as per your custom setup.
bash train_spm.sh
bash train.sh
This threshold could be set in src/compute_dist.py
.
CJKT flag could be set to true in src/parse_memorized.py
for working with CJKT languages on the source side.
Different Masked Language Models pipelines (BERT, Roberta, Multilingual BERT) are defined in src/substitutions.py
and consumed in src/get_substitutions.py
.
The finetuning data could be obtained by running scripts/augment.sh
.
The recovery symbol is set in scripts/augment.sh
as the symbol
variable.
A couple of examples on Microsoft Bing Translator and Google Translator are presented at this link.
Please leave issues for any questions about the paper or the code.
The above code release README format is borrowed from https://github.com/Alrope123/rethinking-demonstrations.