source truncation size in summarization task #3

XinyuHua · 2019-09-16T22:17:12Z

Hi,

According to the README file, for summarization (cnndm) task the following truncation setup is recommended:
-src_seq_length_trunc 400

However, on the training data, the average/median length of the source is 925/841, more than 90% of the data is longer than 400 BPE tokens, would it be problematic to throw away the rest of the text? Or is this simply for efficiency consideration? Thanks!

The text was updated successfully, but these errors were encountered:

zackziegler95 · 2019-09-23T15:39:11Z

Hi,

This is a preprocessing choice we inherited from previous summarization work with OpenNMT, which found that the first 400 tokens is often plenty to compose a good summary. That work was largely conducted with LSTMs, though, so perhaps performance would improve measurably by increasing this truncation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source truncation size in summarization task #3

source truncation size in summarization task #3

XinyuHua commented Sep 16, 2019

zackziegler95 commented Sep 23, 2019

source truncation size in summarization task #3

source truncation size in summarization task #3

Comments

XinyuHua commented Sep 16, 2019

zackziegler95 commented Sep 23, 2019