Language Translation with TensorFlow Serving on CPU using Transformer-LT

Goal

This tutorial will introduce you to the CPU performance considerations for language translation and how to use Intel® Optimizations for TensorFlow Serving to improve inference time on CPUs. This tutorial uses a pre-trained Transformer-LT model for translating English to German and a sample of English news excerpts from the WMT14 dataset. We provide sample code that you can use to get your optimized TensorFlow model server and gRPC client up and running quickly. In this tutorial using Transformer-LT, you will measure inference performance in two situations:

Online inference, where batch_size=1. In this case, a lower number means better runtime performance.
Batch inference, where batch_size>1. In this case, a higher number means better runtime performance.

Prerequisites

This tutorial assumes you have already:

Installed TensorFlow Serving
Read and understood the General Best Practices, especially these sections:
- Performance Metrics
- TensorFlow Serving Configuration Settings
Ran an example end-to-end using a gRPC client, such as the one in the Installation Guide

Note: We use gRPC in this tutorial and offer another tutorial that illustrates the use of the REST API if you are interested in that protocol.

Background

The Transformer-LT model is a popular solution for language translation. It is based on an encoder-decoder architecture with an added attention mechanism. The encoder is used to encode the original sentence to a meaningful fixed-length vector, and the decoder is responsible for extracting the context data from the vector. The encoder and decoder process the inputs and outputs, which are in the form of a time sequence.

In a traditional encoder/decoder model, each element in the context vector is treated equally, but this is typically not the ideal solution. For instance, when you translate the phrase “I travel by train” from English into Chinese, the word “I” has a greater influence than other words when producing its counterpart in Chinese. Thus, the attention mechanism was introduced to differentiate contributions of each element in the source sequence to their counterpart in the destination sequence, through the use of a hidden matrix. This matrix contains weights of each element in the source sequence when producing elements in the destination sequence.

Intel® oneAPI Deep Neural Network Library (Intel® oneDNN) offers significant performance improvements for many neural network operations. Tuning TensorFlow Serving to take full advantage of your hardware for language translation inference involves:

Running a TensorFlow Serving docker container configured for performance given your hardware resources
Running a gRPC client to verify prediction accuracy and measure online and batch inference performance
Experimenting with the TensorFlow Serving settings on your own to further optimize for your model and use case

Hands-on Tutorial with pre-trained Transformer-LT (Official) model

Clone this repository: Clone the intelai/models repository into your home directory.
```
cd ~
git clone https://github.com/IntelAI/models.git
```
Clone the tensorflow/models repository: Tokenization of the input data requires utility functions in the tensorflow/models repository.
```
cd ~
mkdir tensorflow-models
cd tensorflow-models
git clone https://github.com/tensorflow/models.git
cd models
```
Now add the required directory to the PYTHONPATH variable:
```
export PYTHONPATH=$PYTHONPATH:$(pwd)/official/nlp/transformer
```
Set up the client environment: We need to create a virtual environment for this tutorial.
- We will use a virtual environment to install the required packages. If you do not have pip or virtualenv, you will need to get them first:
```
sudo apt-get install -y python python-pip virtualenv
```
- Create and activate the python virtual environment in your home directory and install the pandas and tensorflow-serving-api packages.
```
cd ~
virtualenv -p python3 lt_venv
source lt_venv/bin/activate
pip install pandas tensorflow-serving-api
```

Download the pre-trained model and test data: Download and extract the packaged pre-trained model and dataset transformer_lt_official_fp32_pretrained_model.tar.gz (refer to the model README to get the latest location of this archive).

wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_8/transformer_lt_official_fp32_pretrained_model.tar.gz
tar -xzvf transformer_lt_official_fp32_pretrained_model.tar.gz

After extraction, you should see the following folders and files in the transformer_lt_official_fp32_pretrained_model directory:

ls -l transformer_lt_official_fp32_pretrained_model/*

Console out:

transformer_lt_official_fp32_pretrained_model/data:
total 1064
-rw-r--r--. 1 <user> <group> 359898 Feb 20 16:05 newstest2014.en
-rw-r--r--. 1 <user> <group> 399406 Feb 20 16:05 newstest2014.de
-rw-r--r--. 1 <user> <group> 324025 Mar 15 17:31 vocab.txt
 
transformer_lt_official_fp32_pretrained_model/graph:
total 241540
-rwx------. 1 <user> <group> 247333269 Mar 15 17:29 fp32_graphdef.pb

newstest2014.en: Input file with English text
newstest2014.de: German translation of the input file for measuring accuracy
vocab.txt: Dictionary of vocabulary
fp32_graphdef.pb: Pre-trained model

Create a SavedModel: Using the conversion script transformer_graph_to_saved_model.py, convert the pre-trained model graph to a SavedModel.
```
cd ~/models/benchmarks/language_translation/tensorflow_serving/transformer_lt_official/inference/fp32
python transformer_graph_to_saved_model.py --import_path ~/transformer_lt_official_fp32_pretrained_model/graph/fp32_graphdef.pb
```
This will create a /tmp/1/ directory with a saved_model.pb file in it. This is the file we will serve from TensorFlow Serving. The transformer_graph_to_saved_model.py script attaches a signature definition to the model in order to make it compatible with TensorFlow Serving. You can take a look at the script, its flags/options, and these resources for more information:
- SavedModel
- SignatureDefs
Discover the number of physical cores: Compute num_physical_cores by executing the lscpu command and multiplying Core(s) per socket by Socket(s). For example, for a machine with Core(s) per socket: 28 and Socket(s): 2, num_physical_cores = 28 * 2 = 56. To compute num_physical_cores with bash commands:
```
cores_per_socket=`lscpu | grep "Core(s) per socket" | cut -d':' -f2 | xargs`
num_sockets=`lscpu | grep "Socket(s)" | cut -d':' -f2 | xargs`
num_physical_cores=$((cores_per_socket * num_sockets))
echo $num_physical_cores
```
Recommended Settings: To optimize overall performance, start with the following settings from the General Best Practices. Playing around with these settings can improve performance even further, so you should experiment with your own hardware and model if you have strict performance requirements.

Options Recommendations

TENSORFLOW_INTER_OP_PARALLELISM 2

TENSORFLOW_INTRA_OP_PARALLELISM Number of physical cores

OMP_NUM_THREADS Number of physical cores

Batch Size 64
Start the server: We can now start up the TensorFlow model server. Using -d (for "detached") runs the container as a background process.
```
cd ~
docker run \
     --name=tfserving \
     -d \
     -p 8500:8500 \
     -v "/tmp:/models/transformer_lt_official" \
     -e MODEL_NAME=transformer_lt_official \
     -e OMP_NUM_THREADS=$num_physical_cores \
     -e TENSORFLOW_INTER_OP_PARALLELISM=2 \
     -e TENSORFLOW_INTRA_OP_PARALLELISM=$num_physical_cores \
     intel/intel-optimized-tensorflow-serving:2.2.0
```
You can make sure the container is running using the docker ps command.

Note: After running some basic tests, you may wish to constrain the inference server to a single socket. Docker has many runtime flags that allow you to control the container's access to the host system's CPUs, memory, and other resources.
- See our Best Practices document for information and examples
- See the Docker document on this topic for more options and definitions

Online and batch performance: Run the transformer_benchmark.py python script, which can measure both online and batch performance.

If you are not already there, go to the model's benchmarks directory:

cd ~/models/benchmarks/language_translation/tensorflow_serving/transformer_lt_official/inference/fp32

Online Inference (batch_size=1):

python transformer_benchmark.py \
    -d ~/transformer_lt_official_fp32_pretrained_model/data/newstest2014.en \
    -v ~/transformer_lt_official_fp32_pretrained_model/data/vocab.txt \
    -b 1

Batch Inference (batch_size=64):

python transformer_benchmark.py \
    -d ~/transformer_lt_official_fp32_pretrained_model/data/newstest2014.en \
    -v ~/transformer_lt_official_fp32_pretrained_model/data/vocab.txt \
    -b 64

Note: If you want an output file of translated sentences, set the -o flag to an output file name of your choice. If this option is set, the script will take a significantly longer time to finish.

Clean up:
- After you are finished sending requests to the server, you can stop the container running in the background. To restart the container with the same name, you need to stop and remove the container from the registry. To view your running containers run docker ps.
```
docker rm -f tfserving
```
- Deactivate your virtual environment with deactivate.

Conclusion

You have now seen an end-to-end example of serving a language translation model for inference using TensorFlow Serving, and learned:

How to create a SavedModel from a Transformer-LT TensorFlow model graph
How to choose good values for the performance-related runtime parameters exposed by the docker run command
How to test online and batch inference metrics using a gRPC client

With this knowledge and the example code provided, you should be able to get started serving your own custom language translation model with good performance. If desired, you should also be able to investigate a variety of different settings combinations to see if further performance improvements are possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial.md

Tutorial.md

Language Translation with TensorFlow Serving on CPU using Transformer-LT

Goal

Prerequisites

Background

Hands-on Tutorial with pre-trained Transformer-LT (Official) model

Conclusion

Options	Recommendations
TENSORFLOW_INTER_OP_PARALLELISM	2
TENSORFLOW_INTRA_OP_PARALLELISM	Number of physical cores
OMP_NUM_THREADS	Number of physical cores
Batch Size	64

Files

Tutorial.md

Latest commit

History

Tutorial.md

File metadata and controls

Language Translation with TensorFlow Serving on CPU using Transformer-LT

Goal

Prerequisites

Background

Hands-on Tutorial with pre-trained Transformer-LT (Official) model

Conclusion