Skip to content
Phil Williams edited this page Feb 1, 2019 · 1 revision

Notes on Training

Training with Large Minibatches

If you are training on a large parallel corpus, then you may be able to improve translation quality by training with larger minibatches (for smaller training sets, using large minibatches is generally not recommended as it is likely to lead to overfitting). Nematus provides a couple of features to support the use of large minibatches: synchronous multi-GPU training and gradient aggregation.

Synchronous Multi-GPU Training

Synchronous multi-GPU training splits a minibatch equally (more or less) between the available GPUs, each of which runs a complete replica of the model. The training is 'synchronous' in the sense that the resulting gradients from each sub-minibatch are collected and averaged before being applied in a single update. The maximum possible minibatch size that you can use scales linearly with the number of available GPUs, so if you were able to train with a minibatch size of 80 sentences on a single GPU, then you should be able to use a minibatch size of 320 with four GPUs.

In addition to enabling larger batch sizes, multi-GPU training can potentially speed up training by reducing the number of updates needed to reach convergence. And since model parameters and data must compete for limited GPU memory, it allows the use of larger models by reducing the per-device data footprint.

You don't need to do anything special to enable multi-GPU training: Nematus automatically uses all of the GPUs that are visible to TensorFlow (which you can control via the CUDA_VISIBLE_DEVICES environment variable).

Gradient Aggregation

The second feature for large minibatch training is 'gradient aggregation.' Like multi-GPU training, this splits a minibatch into a number of (roughly) equal sized sub-batches. Each sub-batch is processed in turn, with the resulting gradients being accumulated at each step before being aggregated and applied (in a single update). Adding gradient aggregation steps increases the maximum possible minibatch size, so using three gradient aggregation steps instead of one will allow the use of minibatches that are three times as large.

There are two ways to enable gradient aggregation. The first is to manually set the number of steps using the --gradient_aggregation_steps option. The second is to set a maximum per-device sub-batch size (specified as either a number of sentences or tokens). The number of aggregation steps will then be determined automatically. For example, if you set --max_sentences_per_device=80 and you set --batch_size=300 (and there is one visible GPU), then the minibatch will be processed in four steps.

Gradient aggregation and multi-GPU training can be combined. In the following example, a batch size of 25,000 tokens can be used despite the capacity of a single device being a more modest 4,096 tokens.

CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py \
    --token_batch_size 25000 \
    --max_tokens_per_device 4096 \
    ...
Clone this wiki locally