The above benchmark was done on 128 servers with 4 Pascal GPUs each connected by a RoCE-capable 25 Gbit/s network. Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16.
To reproduce the benchmarks:
- Install Horovod using the instructions provided on the Horovod on GPU page.
- Clone https://github.com/tensorflow/benchmarks
$ git clone https://github.com/tensorflow/benchmarks
$ cd benchmarks
- Run the benchmark. Examples below are for Open MPI.
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model resnet101 \
--batch_size 64 \
--variable_update horovod
- At the end of the run, you will see the number of images processed per second:
total images/sec: 1656.82
The benchmark instructions above are for the synthetic data benchmark.
To run the benchmark on a real data, you need to download the ImageNet dataset and convert it using the TFRecord preprocessing script.
Now, simply add --data_dir /path/to/imagenet/tfrecords --data_name imagenet --num_batches=2000
to your training command:
$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 \
python scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--model resnet101 \
--batch_size 64 \
--variable_update horovod \
--data_dir /path/to/imagenet/tfrecords \
--data_name imagenet \
--num_batches=2000
Horovod also comes with out-of-the-box benchmarking support for TensorFlow v1, TensorFlow v2, and PyTorch.
These benchmarks allow you to measure Horovod's performance and scalability in your environment, as well as try advanced Horovod features like gradient compression:
$ horovodrun -np 4 server1:2,server2:2 \
python --fp16-allreduce tensorflow2_synthetic_benchmark.py
When diagnosing performance issues, we recommend running these synthetic benchmarks first to ensure that the issues are not originating from the training script itself.