this is a submission to babel hack https://babel.tilda.ws
We use language models as initialization for the Transformer network in order to improve MT results on limited parallel data
Our model acheves up to +2 BLEU score on 20k dataset
[TODO] -Add links to our paper -Add links to connected research / papers