This repository is a result of our participation in the shared task.
We went through the process of building, analyzing, and improving the neural machine translation system.
Poster: link
The shared task was for Estonian-English language pair. It included working with ~19.000.000 sentence pairs.
Shared task main page: link
Shared task on course page: link
Sections below summarize key milestones we went through.
- Our baseline system was a default OpenNMT-py model with 2-layers of 500 LSTM hidden units for both encoder and decoder using 30k BPE vocabulary.
- As a result, we got 21.95 BLEU points on the shared dev set.
More details: report1
-
We manually analyzed 60 baseline translations.
-
Or main observation was that a lot of sentences lacked fluency. Often in a long sentence a part of the sentence lacked fluency or was completely nonsensical.
-
Take a look at the motivating example 1 produced by baseline system:
- Human: "The biggest forest owners ( state , local governments and some private forestry companies , owning thousands of hectares of forest areas ) can ensure a continuous process of production throughout the long forest management cycle ."
- Baseline: "The largest forest owners ( the country , local authorities and some of the private sector companies to whom thousands of hectares of forest land ) can be guaranteed throughout the long term management cycle ."
-
Example 2:
- Human: "The European Union is set up with the aim of ending the frequent and bloody wars between neighbours , which culminated in the Second World War ."
- Baseline: "The European Union was created to end the frequent bloody wars of the neighbours , which became the Second World War ."
More details: report2
- In order to address translation issues found after our manual evaluation we used Amazons sockeye library to train a system using context gates and instead of attention we used coverage, our bpe vocabulary size was 30k. For translation we used beam size 10.
- The trained system gave us 22.89 BLEU points on the shared dev set that means SMALL increase over the baseline.
More details: report3 and report4
- Generally speaking, the majority of sentences are fluent and meaning preserving. Especially long sentences are translated much better than with the baseline model.
- Let's look at example 1 where the fluency was greatly improved:
- Final system: "The largest forest owners ( country , local authorities and some private forestry companies with thousands of hectares of forest areas ) can ensure a continuous production process throughout the long forest management cycle ."
- As a result you can see that this sentence is completely fluent and adequate. It is a great improvement compared to the baseline model.
- In example 2 the fluency was also greatly improve:
- Final system: "The European Union was set up to put an end to the frequent bloody wars between neighbours , the culmination of which became the Second World War ."
- Here you can see that although the sentence structure is changed, it is completely fluent and adequate.
Do not forget to check our poster: Poster
We also tried replacing all dots except last with special symbols and various beam sizes.
The dot replacement gave 22.29 BLEU points on shared dev set and actually helped with translations. Below is a translation with this approach.
- Baseline: This part of our website will find information on how Parliament will organise its work through the various committees .
- Dot-model: This section of our website will find information on how Parliament operates
its work through a system of various committees , and the work of the
European Parliament is therefore important because decisions on new
European laws are jointly made by the Parliament and the Council of
Ministers .
Also we tried 40k vocabulary for coverage + context to tackle some words not translated correctly (bigger voc should cover more words), however the results got worse based on manual evaluation and BLEU, last was only 21.31.
Finally we tried different beam sizes for translation. Bigger beam size gave slightly better results based on manual evaluation and BLEU, which also increased by little.
Lastly, we wanted to try hyperparameter tuning, however the model did not converge. There were too many hyperparameter to tune to really find out which value suits what parameter is good. Furthermore we wanted to try POS-tags and ensembling multiple models.
- On final test set we got 25.66 BLEU score. The translations were mostly quite fluent and adequate, nevertheless sometimes the meaning got lost, some words were repeated or there were mistranslations. Example : "China has just refused the sale of human organs and restricting the possibility of obtaining sirens from foreigners." (ID: 250). Instead of "sirens" there should be "transplants", otherwise great translation.
- We had issues with models training rather long - especially for OpenNMT, Sockeye was much faster. Queue times were sometimes really long, especially at the ending part of semester.
- We learnt that in order to train great model, it needs much analysing, trying, evaluating.
Project board: link