-
hello everyone, I did some work to reproduce training from scratch with bert, my codes mainly based on transformers, and I get the benchmark below; What i want to ask is, for training faster, i choose max_seq_len=128 on all 1 million steps; Is that the main cause which leads to the decrease on benchmark? Did Gluon train with max_seq_len=512 for all steps? Looking forward to your reply, thx! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @Jetcodery. Yes, I think the decreased sequence length would decrease performance. In the original experiment for reproducing BERT we did the training with sequence length 512. Nowadays, many people train BERT in two stages, with the first stage under length 128, followed by a second stage training of 512. The two stage training appears to close the performance gap. |
Beta Was this translation helpful? Give feedback.
Hi @Jetcodery. Yes, I think the decreased sequence length would decrease performance. In the original experiment for reproducing BERT we did the training with sequence length 512. Nowadays, many people train BERT in two stages, with the first stage under length 128, followed by a second stage training of 512. The two stage training appears to close the performance gap.