This repo contains the source code and other details for a neural machine translation based on attention using pytorch. This model translates Korean into English.
- Weekly Report : check here :)
From February 2020, the weekly report can be found there.
- BLEU(Bilingual Evaluation Understudy) score
BLEU | BLEU1 | BLEU2 | BLEU3 | BLEU4 |
---|---|---|---|---|
33.55 | 64.6 | 40.0 | 27.5 | 19.4 |
- Translation Sentence
차를 마시러 공원에 가던 차 안에서 나는 그녀에게 차였다.
> I was dumped by her in a car on the way to the park to drink tea .
사과의 의미로 사과를 먹으며 사과했다.
> I apologize while eating an apple for the meaning of an apology .
내가 그린 기린 그림은 긴 기린 그림이냐, 그냥 그린 기린 그림이냐?
> Is the giraffe I drew a long giraffe picture or just a giraffe picture ?
-
Preprocess
Delete
the sentence with the length of 149(Korean) or more and 387(English) or more based on space.Delete
the sentence containing some special characters.
-
Configuration
Dataset | Sentences | Download |
---|---|---|
Written + Spoken | 920,000 | - AI-Hub (한-영 말뭉치 AI 데이터) - Tatoeba (Korean - English) |
!python preprocess.py
The source text file(src
) and target text file(tgt
) are tokenized through Mecab
+SentencePiece
.
!python train.py
If you want to continue training the model, add --train_from (model path)/model.pt
later.
!python translate.py -model data/model/model.pt -src data/src-test.txt -tgt data/tgt-test.txt -replace_unk -verbose -gpu 0
!perl tools/multi-bleu.perl data/tgt-test.txt < data/pred.txt
!pyhton gui.py
You have to change from "data/src-test.txt"
to "data/demo/KoreanTokenInput.txt"
of translate_opts > --src
in opts.py
and "data/pred.txt"
to "data/demo/EnglishTokenOutput.txt"
of translate_opts > --output
.