Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source truncation size in summarization task #3

Open
XinyuHua opened this issue Sep 16, 2019 · 1 comment
Open

source truncation size in summarization task #3

XinyuHua opened this issue Sep 16, 2019 · 1 comment

Comments

@XinyuHua
Copy link

Hi,

According to the README file, for summarization (cnndm) task the following truncation setup is recommended:
-src_seq_length_trunc 400

However, on the training data, the average/median length of the source is 925/841, more than 90% of the data is longer than 400 BPE tokens, would it be problematic to throw away the rest of the text? Or is this simply for efficiency consideration? Thanks!

@zackziegler95
Copy link
Collaborator

Hi,

This is a preprocessing choice we inherited from previous summarization work with OpenNMT, which found that the first 400 tokens is often plenty to compose a good summary. That work was largely conducted with LSTMs, though, so perhaps performance would improve measurably by increasing this truncation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants