This is an implementation of the Transformer model described in Vaswani, Ashish, et al. "Attention is all you need.".
Quick Start: Prerequisites & use on machine translation datasets.
Run Your Customized Experiments: Hands-on tutorial of data preparation, configuration, and model training/testing.
Run the following command to install necessary packages for the example:
pip install -r requirements.txt
Two example datasets are provided:
- IWSLT'15 EN-VI for English-Vietnamese translation
- WMT'14 EN-DE for English-German translation
Download and pre-process the IWSLT'15 EN-VI data with the following commands:
sh scripts/iwslt15_en_vi.sh
sh preprocess_data.sh spm en vi
By default, the downloaded dataset is in ./data/en_vi
.
As with the official implementation,
spm
(sentencepiece
) encoding is used to encode the raw text as data pre-processing. The encoded data is by default
in ./temp/run_en_vi_spm
.
For the WMT'14 EN-DE data, download and pre-process with:
sh scripts/wmt14_en_de.sh
sh preprocess_data.sh bpe en de
Note that this is a large dataset and preprocessing requires large amounts of memory.
By default, the downloaded dataset is in ./data/en_de
. Note that for this dataset, bpe
encoding (Byte pair encoding)
is used instead. The encoded data is by default in ./temp/run_en_de_bpe
.
Train the model with the command:
python transformer_main.py \
--run-mode=train_and_evaluate \
--config-model=config_model \
--config-data=config_iwslt15
- Specify
--output-dir
to dump model checkpoints and training logs to a desired directory. By default it is set to./outputs
. - Specifying
--output-dir
will also restore the latest model checkpoint under the directory, if any checkpoint exists. - Specify
--config-data=config_wmt14
to train on the WMT'14 data. - Additionally, you can also specify
--load-checkpoint
to load a previously trained checkpoint fromoutput_dir
.
To only evaluate a model checkpoint without training, first load the checkpoint and generate samples:
python transformer_main.py \
--run-mode=test \
--config-data=config_iwslt15 \
--output-dir=./outputs
The latest checkpoint in ./outputs
is used. Generated samples are in the file ./outputs/test.output.hyp
, and
reference sentences are in the file ./outputs/test.output.ref
. The script shows the cased BLEU score as provided by
the tx.evals.file_bleu
function.
Alternatively, you can also compute the BLEU score with the raw sentences using the bleu_main
script:
python bleu_main.py --reference=data/en_vi/test.vi --translation=temp/test.output.hyp
-
On IWSLT'15, the implementation achieves around
BLEU_cased=29.05
andBLEU_uncased=29.94
(reported by bleu_main.py), which are comparable to the base_single_gpu results by the official implementation (28.12
and28.97
, respectively, as reported here). -
On WMT'14, the implementation achieves around
BLEU_cased=25.02
following the setting inconfig_wmt14.py
(setting: base_single_gpu, batch_size=3072). It takes more than 18 hours to finish training 250k steps. You can modifymax_train_epoch
inconfig_wmt14.py
to adjust the training time.
INFO 2019-08-15 22:04:15 : Begin running with train_and_evaluate mode
WARNING 2019-08-15 22:04:15 : Specified checkpoint directory 'outputs' exists, previous checkpoints might be erased
INFO 2019-08-15 22:04:15 : Training started
INFO 2019-08-15 22:04:15 : Model architecture:
ModelWrapper(
(model): Transformer(
...
)
)
2019-08-15 22:05:51 : Epoch 1 @ 500it (13.0%, 172.63ex/s), lr = 2.184e-05, loss = 7.497
2019-08-15 22:07:27 : Epoch 1 @ 1000it (26.0%, 172.91ex/s), lr = 4.367e-05, loss = 6.784
2019-08-15 22:09:03 : Epoch 1 @ 1500it (39.0%, 172.52ex/s), lr = 6.551e-05, loss = 6.365
2019-08-15 22:10:40 : Epoch 1 @ 2000it (51.9%, 172.03ex/s), lr = 8.735e-05, loss = 5.847
2019-08-15 22:15:50 : Epoch 1, valid BLEU = 2.075
INFO 2019-08-15 22:15:54 : Current checkpoint saved to outputs/1565921750.7879117.pt
Using an NVIDIA GTX 1080Ti, the model usually converges within 5 hours (~15 epochs) on IWSLT'15.
Here is a hands-on tutorial on running Transformer with your own customized dataset.
Create a data directory and put the raw data in the directory. To be compatible with the data preprocessing in the next step, you may follow the convention below:
- The data directory should be named as
data/${src}_${tgt}/
. Take the data downloaded withscripts/iwslt15_en_vi.sh
for example, the data directory isdata/en_vi
. - The raw data should have 6 files, which contain source and target sentences of training/dev/test sets, respectively.
In the
iwslt15_en_vi
example,data/en_vi/train.en
contains the source sentences of the training set, where each line is a sentence. Other files aretrain.vi
,dev.en
,dev.vi
,test.en
,test.vi
.
To obtain the processed dataset, run
preprocess_data.sh ${encoder} ${src} ${tgt} ${vocab_size} ${max_seq_length}
where
- The
encoder
parameter can bebpe
(byte pairwise encoding),spm
(sentence piece encoding), orraw
(no subword encoding). vocab_size
is optional. The default is 32000.- At this point, this parameter is used only when
encoder
is set tobpe
orspm
. Forraw
encoding, you'd have to truncate the vocabulary by yourself. - For
spm
encoding, the preprocessing may fail (due to the Python sentencepiece module) ifvocab_size
is too large. So you may want to try smallervocab_size
if it happens.
- At this point, this parameter is used only when
max_seq_length
is optional. The default is 70.
In the iwslt15_en_vi
example, the command is sh preprocess_data.sh spm en vi
.
By default, the preprocessed data are dumped under temp/run_${src}_${tgt}_${encoder}
. In the iwslt15_en_vi
example,
the directory is temp/run_en_vi_spm
.
If you choose to use raw
encoding method, notice that:
- By default, the word embedding layer is built with the combination of source vocabulary and target vocabulary. For example, if the source vocabulary is of size 3K and the target vocabulary of size 3K and there is no overlap between the two vocabularies, then the final vocabulary used in the model is of size 6K.
- By default, the final output layer of transformer decoder (hidden_state -> logits) shares the parameters with the word embedding layer.
Customize the Python configuration files to config the model and data.
Please refer to the example configuration files config_model.py
for model configuration and config_iwslt15.py
for
data configuration.
Train the model with the following command:
python transformer_main.py \
--run-mode=train_and_evaluate \
--config-model=<custom_config_model> \
--config-data=<custom_config_data>
where the model and data configuration files are custom_config_model.py
and custom_config_data.py
, respectively.
Outputs such as model checkpoints are by default under outputs/
.
Test with the following command:
python transformer_main.py \
--run-mode=test \
--config-data=<custom_config_data> \
--output-dir=./outputs
Generated samples on the test set are in outputs/test.output.hyp
, and reference sentences are in
outputs/test.output.ref
. If you've used bpe
or spm
encoding in the data preprocessing step, make sure to set
encoding
in the data configuration to the appropriate encoding type. The generated output will be decoded using the
specified encoding.
Finally, to evaluate the BLEU score against the ground truth on the test set:
python bleu_main.py --reference=<your_reference_file> --translation=temp/test.output.hyp.final
For the iwslt15_en_vi
example, use --reference=data/en_vi/test.vi
.