Morphologically Motivated Input Variations and Data Augmentation in Turkish-English Neural Machine Translation

This repository contains scripts to segment Turkish text into morphologically motivated subwords and training scripts for Turkish-English neural machine translation with the Marian toolkit.

For the installation of the Marian toolkit, visit their website. For evaluation and pre-processing (truecasing, cleaning, etc.) Moses scripts have been used.

See the article Morphologically Motivated Input Variations and Data Augmentation in Turkish-English Neural Machine Translation for details.

Morphologically motivated input variations

Morphemes
Allomorphs
Morphological tags
Multi-source

NMT scripts:

Training set is a combination of the SETimes corpus, and augmented monolingual data (WMT News Crawl). The WMT 16 test set has been used as validation, and WMT17 and WMT18 test sets have been used for testing.

Citation:

If you use the segmentation or training scripts, please cite the paper:

@article{10.1145/3571073,
author = {Yirmibe\c{s}o\u{g}lu, Zeynep and G\"{u}ng\"{o}r, Tunga},
title = {Morphologically Motivated Input Variations and Data Augmentation in Turkish-English Neural Machine Translation},
year = {2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {2375-4699},
url = {https://doi.org/10.1145/3571073},
doi = {10.1145/3571073},
note = {Just Accepted},
journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
month = {nov},
keywords = {attention, low-resource, neural machine translation, transformer, data augmentation, encoder-decoder, morphology, word segmentation}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
segmentation		segmentation
train		train
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Morphologically Motivated Input Variations and Data Augmentation in Turkish-English Neural Machine Translation