Vitis AI Tutorials

Denoising Variational Autoencoder with TensorFlow2 and Vitis-AI

Current Status

Tested on ZCU102 with Vitis-AI 1.4

Introduction

The Xilinx DPU can accelerate the execution of many different types of operations and layers that are commonly found in convolutional neural networks but occasionally we need to execute models that have fully custom layers. One such layer is the sampling function of a convolutional variational autoencoder. The DPU can accelerate the convolutional encoder and decoder but not the statistical sampling layer - this must be executed in software on a CPU. This tutorial will use the variational autoencoder as an example of how to approach this situation.

The Variational Autoencoder

An autoencoder is an artificial neural network that learns how to efficiently compress and encode data to a lower dimensionality and also how to reconstruct the data back from the reduced encoded representation to something that is as close as possible to the original input. This is a form of representation learning. An autoencoder consists of three main parts, the encoder, the decoder and between them a ‘bottleneck’ or ‘latent space’ which is the encoded version of the input. The encoder and decoder can be made up from MLPs, CNNs or LSTMs - in this tutorial we will use a CNN based encoder and decoder. A variational autoencoder maps the input to a latent space which is a normal distribution. We pass the mean and standard deviation of the learned distribution to the decoder.

Figure 1: Variational Autoencoder architecture

The variational autoencoder model is defined in the vae.py Python script. The encoder section is a series of 2D convolution layers (with batchnorm and ReLU activations) that reduce the dimensions of the input feature map. The final feature map is flatten and then passed to two dense/FC layers. The dense/FC layer outputs are encoder_mu, encoder_log_variance and the sampled latent space (encoder_z).

The custom layer in our model samples the latent space using the well know 'reparametrization trick' which overcomes problems related to backpropagation:

class Sampling(layers.Layer):
    """Uses (encoder_mu, encoder_log_variance) to sample encoder, the vector encoding a digit."""

  def call(self, inputs):
      encoder_mu, encoder_log_variance = inputs
      batch = tf.shape(encoder_mu)[0]
      dim = tf.shape(encoder_mu)[1]
      epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
      return encoder_mu + tf.exp(0.5 * encoder_log_variance) * epsilon

Before you begin

The host machine has several requirements that need to be met before we begin. You will need:

An x86 host machine with a supported OS and either the CPU or GPU versions of the Vitis-AI docker installed - see System Requirements.
The host machine will require Docker to be installed and the Vitis-AI CPU or GPU docker image to be built - see Getting Started.
A GPU card suitable for training is recommended, but the training in this tutorial is quite simple and a CPU can be used.
If you plan to use the ZCU102 evaluation board, it should be prepared with the board image as per the Step2: Setup the Target instructions. Hints on how to connect the various cables to the ZCU102 are also available here.
For the Alveo U50, follow the Setup Alveo Accelerator Card instructions.

For more details, refer to the latest version of the Vitis AI User Guide (UG1414).

This tutorial assumes the user is familiar with Python3, TensorFlow and has some knowledge of machine learning principles.

Setting up the workspace

Copy this repository by doing either of the following:
- Download the repository as a ZIP file to the host machine, and then unzip the archive.
- From a terminal, use the git clone command.
Open a linux terminal, cd to <path_to_autoencoder_design>/files folder.

Start either the Vitis AI GPU or CPU docker (we recommend using the GPU docker if possible):

# navigate to files folder
cd <path_to_autoencoder_design>/files

# to start GPU docker container
./docker_run.sh xilinx/vitis-ai-gpu:latest

# ..or if you wish to use CPU docker container
./docker_run.sh xilinx/vitis-ai-cpu:latest

The docker container will start and after accepting the license agreement, you should see something like this in the terminal:

 ```shell
 ==========================================
 
 __      ___ _   _                   _____
 \ \    / (_) | (_)            /\   |_   _|
  \ \  / / _| |_ _ ___ ______ /  \    | |
   \ \/ / | | __| / __|______/ /\ \   | |
    \  /  | | |_| \__ \     / ____ \ _| |_
     \/   |_|\__|_|___/    /_/    \_\_____|
 
 ==========================================

 Docker Image Version:  1.4.776
 Build Date: 2021-06-22
 VAI_ROOT: /opt/vitis_ai

 For TensorFlow 1.15 Workflows do:
      conda activate vitis-ai-tensorflow 
 For Caffe Workflows do:
      conda activate vitis-ai-caffe 
 For Neptune Workflows do:
      conda activate vitis-ai-neptune 
 For PyTorch Workflows do:
      conda activate vitis-ai-pytorch 
 For TensorFlow 2.3 Workflows do:
      conda activate vitis-ai-tensorflow2 
 For Darknet Optimizer Workflows do:
      conda activate vitis-ai-optimizer_darknet 
 For Caffe Optimizer Workflows do:
      conda activate vitis-ai-optimizer_caffe 
 For TensorFlow 1.15 Workflows do:
      conda activate vitis-ai-optimizer_tensorflow 
 For LSTM Workflows do:
      conda activate vitis-ai-lstm 
 Vitis-AI /workspace > 
 ```

💡 If you get a "Permission Denied" error when starting the docker container, it is almost certainly because the docker_run.sh script is not set to be executable. You can fix this by running the following command:
 chmod +x docker_run.sh

Activate the Tensorflow2 python virtual environment with conda activate vitis-ai-tensorflow2 and you should see the prompt change to indicate that the environment is active:

Vitis-AI /workspace > conda activate vitis-ai-tensorflow2
(vitis-ai-tensorflow2) Vitis-AI /workspace >

Implementing the design

The remainder of this README describes each single step to implement the tutorial - each command needs to be run from within the Vitis-AI Docker container which was started in the previous section.

A shell script called run_all.sh is also provided - this contains all the commands needed to run the complete flow:

source run_all.sh

Step 0 - Training and evaluation of the floating-point model

To run the training and evaluation of the floating-point model:

python -u train.py -p 2>&1 | tee build/logs/train.log

We will use the MNIST dataset as a simple example of image denoising. The dataset download and preprocessing is done by the mnist_download() function defined in utils.py. The training and test data is downloaded using the built-in download function of the tf.keras API:

def mnist_download():
  (x_train, _), (x_test, _) = mnist.load_data()

..and then we scale the pixel data from range 0:255 to range 0:1 by dividing by 255. We add the channel dimension to each image (they are downloaded as (28,28) and we require them to be (28,28,1) ). Then the random noise is added to create a noisy training set and a noisy test set:

# scale to (0,1)
x_train = (x_train/255.0).astype(np.float32)
x_test = (x_test/255.0).astype(np.float32)
# add channel dimension
x_train = x_train.reshape(x_train.shape[0],28,28,1)
x_test = x_test.reshape(x_test.shape[0],28,28,1)
# add noise
noise = np.random.normal(loc=0.2, scale=0.3, size=x_train.shape)
x_train_noisy = np.clip(x_train + noise, 0, 1)
noise = np.random.normal(loc=0.2, scale=0.3, size=x_test.shape)
x_test_noisy = np.clip(x_test + noise, 0, 1)
return x_train, x_test, x_train_noisy, x_test_noisy

In train.py, we create the training and test datasets:

x_train, x_test, x_train_noisy, x_test_noisy = mnist_download()
train_dataset = input_fn((x_train_noisy,x_train), batchsize, True)
test_dataset = input_fn((x_test_noisy,x_test), batchsize, False)
predict_dataset = input_fn((x_test_noisy), batchsize, False)

Note how the train and test datasets are composed of noisy images which will be the input to the variational autoencoder model and clean images whic are the ground truths - the autoencoder will learn how to generate clean images from noisy ones.

During training, we use a loss function which quantifies the difference between the learned distribution and a standard normal distribution using Kullback-Liebler divergence (KL divergence). The second component of the loss function is a reconstruction loss - often mean-squared error (MSE) or cross-entropy is used here. The total loss is the sum of KL divergence and reconstruction loss.

The trained checkpoint will be saved at the end of each epoch if the mean squared error improves. At the end of training, we can optionally make a set of predictions using the test dataset. To make the predictions, we first reload the best checkpoint, including the custom layer:

with custom_object_scope({'Sampling': Sampling}):
  model = load_model(float_model, compile=False, custom_objects={'Sampling': Sampling})
model.compile(loss=lambda y_true,y_predict: loss_func(y_true,y_predict,encoder_mu,encoder_log_variance))
predictions = model.predict(predict_dataset, verbose=1)

..the predictions are returned as numpy arrays, so we can then save the predictions and inputs as PNG images:

for i in range(20):
  cv2.imwrite(pred_dir+'/pred_'+str(i)+'.png', predictions[i] * 255.0)
  cv2.imwrite(pred_dir+'/input_'+str(i)+'.png', x_test_noisy[i] * 255.0)
print('Inputs and Predictions saved as images in ./' + pred_dir)

Here are some samples so that they can be directly compared:

Figure 2: Noisy inputs and predicted outputs of the trained autoencoder

Step 1 - Generation and evaluation of the quantized model

To run the generation of the quantized model:

python -u quantize.py -p 2>&1 | tee build/logs/quant.log

The Xilinx DPU family of ML accelerators execute models and networks that have their parameters in integer format so we must convert the trained, floating-point checkpoint into a fixed-point integer checkpoint - this process is known as quantization.

The quantize.py script loads the trained floating-point checkpoint, creates a quantizer object and then a quantized model:

with custom_object_scope({'Sampling': Sampling}):
  # load trained floating-point model    
  float_model = load_model(float_model, compile=False, custom_objects={'Sampling': Sampling} )
  # quantizer with custom strategy for Sampling layer
  quantizer = vitis_quantize.VitisQuantizer(float_model)
  quantized_model = quantizer.quantize_model(calib_dataset=calib_dataset)

The quantized model is saved and will be the input to the compiler phase:

quantized_model.save(quant_model)

If the appropriate command line option is provided, quantize.py will run a number of predictions using the quantized model and save the results as image files:

if (predict):
  # remake predictions folder
  shutil.rmtree(pred_dir, ignore_errors=True)
  os.makedirs(pred_dir)
  predict_dataset = input_fn((x_test_noisy), batchsize, False)
  predictions = quantized_model.predict(predict_dataset, verbose=0)
  # scale pixel values back up to range 0:255 then save as PNG
  for i in range(20):
    cv2.imwrite(pred_dir+'/pred_'+str(i)+'.png', predictions[i] * 255.0)
    cv2.imwrite(pred_dir+'/input_'+str(i)+'.png', x_test_noisy[i] * 255.0)

Step 2 - Compile the quantized model

To run compile the quantized model in xmodel format run the source compile.sh with one of the target boards as a command line argument, for example:

source compile.sh zcu102
source compile.sh u50
source compile.sh vck190

The compile.sh shell script will compile the quantized model and create an .xmodel file which contains the instructions and data to be executed by the DPU.

Step 3 - Determine the subgraphs of the compiled model

If we look at the compiler report log (./logs/compile_.log) we can see one line that indicates how the compiled model has been divided into subgraphs:

[UNILOG][INFO] Total device subgraph number 6, DPU subgraph number 2

..so we can tell from this line that our compiled xmodel contains a total of six subgraphs of which two are DPU subgraphs (and hence executed on the DPU accelerator). The other four subgraphs are either CPU subgraphs or User subgraphs.

To establish how the different subgraphs are connected together, we can generate a PNG image of the subgraphs using the following command:

xir png build/compiled_model_zcu102/autoenc.xmodel build/autoenc_zcu102.png

When you open the PNG file, you will see numerous colored boxes with connection lines between them. Boxes with green outlines are User subgraphs and appear only at the inputs:

Figure 3: User subgraph

..in this case, we have a single input called 'quant_input_3' with shape (1, 28, 28, 1). The shape is in NHWC format. This tensor input corresponds to the encoder input.

The 'quant_input_3' tensor feeds into another subgraph called 'subgraph_quant_conv2d' that contains only boxes with blue outlines - these will be executed on the DPU and hence 'subgraph_quant_conv2d' is a DPU subgraph. There are two output tensors 'quant_dense_1_fix' and 'quant_dense_fix' which correspond to the 'encoder_log_variance' and 'encoder_mu' outputs from the encoder block.

'quant_dense_1_fix' and 'quant_dense_fix' feed into another subgraph called 'subgraph_quant_dense_1_fix_opt_mode8_elemmul'. This subgraph contains boxes with red outlines and hence is a CPU subgraph. This subgraph implements the custom sampling layer:

Figure 4: CPU subgraph - custom sampling layer

The next subgraph is 'subgraph_quant_conv2d_transpose' containing blue boxes indicating a DPU subgraph - this is our decoder.

The final subgraph, 'subgraph_activation_8' is another CPU subgraph and contains the sigmoid activation function.

The complete compiled architecture can be summarized like this:

Figure 5: All subgraphs and connections

Step 4 - Create the Python application code

Our application code must do the following things:

Pre-process the images as done in training (i.e scale the pixel values to range 0:1)
For each DPU subgraph..
- Create a DPU runner
For each CPU subgraph..
- Provide code that will be executed by the CPU tom implement the CPU sbugraph functions
Create one or more Python threads that execute the DPU and CPU subgraphs in the correct sequence. In each thread we must also..
- initialize an input buffer for each input tensor of each DPU subgraph
- initialize an output buffer for each output tensor of each DPU subgraph

The application code is contained in the application/app_mt.py Python script. Let's look in detail at each of these steps...

Image pre-processing

Each image will need to be pre-processed in exactly the same way that was done during training and quantization. The images are read as grayscale format and hence have shape (28,28). We need to add the channel dimension so that they have shape (28,28,1) before scaling each pixel to the range 0, 1.0.

def preprocess_fn(image_path):
  image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
  image = np.reshape(image, [image.shape[0],image.shape[1],1] )
  image = (image/255.0).astype(np.float32)
  return image

The preprocess_fn is used in the main application function (app) to create a list of pre-processed images:

img = []
for i in range(runTotal):
  path = os.path.join(image_dir,listimage[i])
  img.append(preprocess_fn(path))

DPU runners

In this tutorial, we have two DPU subgraphs which correspond to the encoder and decoder sections of our variational autoencoder.

In the main application function (app) we first load the .xmodel file (which was generated by the quantize and compile tools of Vitis-AI) and then get a topographically sorted list of DPU subgraphs:

g = xir.Graph.deserialize(model)
subgraphs = get_child_subgraph_dpu(g)

Then we create a DPU runner for each DPU subgraph. Because the list of DPU subgraphs (the subgraphs variable in the above code snippet) is topographically sorted, we know that subgraphs[0] is the encoder and subgraphs[1] is the decoder:

all_dpu_runners = []
for i in range(threads):
  all_dpu_runners.append( [vart.Runner.create_runner(subgraphs[0], "run"),
                           vart.Runner.create_runner(subgraphs[1], "run")]  )

Code for CPU subgraphs

We have two CPU subgraphs - one for the custom sampling layer..

def sampling_layer(encoder_mu, encoder_log_variance):
  '''
  Sampling layer
  '''
  batch = encoder_mu.shape[0]
  dim = encoder_mu.shape[1]
  epsilon = np.random.normal(size=(batch, dim))
  sample = encoder_mu + (np.exp(0.5 * encoder_log_variance) * epsilon)
  return sample

and one for the final sigmoid activation:

def sigmoid(x):
  '''
  calculate sigmoid
  '''
  pos = x >= 0
  neg = np.invert(pos)
  result = np.empty_like(x)
  result[pos] = 1 / (1 + np.exp(-x[pos]))
  result[neg] = np.exp(x[neg]) / (np.exp(x[neg]) + 1)
  return result

Threads

The function called by each thread (runThread) will set up input buffers and output buffers for each of the two DPU runners ans also establish the batchsize for

'''
Set up encoder DPU runner buffers & I/O mapping dictionary
'''
encoder_dict, encoder_inbuffer, encoder_outbuffer = init_dpu_runner(encoder_dpu_runner)

# batchsize
batchSize = encoder_dict['quant_input_3'].shape[0]

'''
Set up decoder DPU runner buffers
'''
decoder_dict, decoder_inbuffer, decoder_outbuffer = init_dpu_runner(decoder_dpu_runner)

Then for each batch of pre-processed images, it will run the encoder DPU runner, the code for the sampling layer CPU subgraph, the decoder DPU runner and finally the code for the sigmoid CPU subgraph. The predicted images are written into a global list called predictions_buffer.

Once all the threads have completed execution, predictions_buffer will contain a list of predicted image outputs in numpy array format. The final step is to convert them into PNG image files and write them into a folder:

'''
post-processing - save output images
'''
# make folder for saving predictions
os.makedirs(pred_dir, exist_ok=True)

for i in range(len(predictions_buffer)):
  cv2.imwrite(os.path.join(pred_dir,'pred_'+str(i)+'.png'), predictions_buffer[i]*255.0)

print('Predicted images saved to','./'+pred_dir)

Step 5 - Make the target folder for use on target board

To prepare the images, xmodel and application code for copying to the selected target, run any or all of the following commands:

python -u make_target.py -m build/compiled_model_zcu102/autoenc.xmodel -td build/target_zcu102 2>&1 | tee build/logs/target_zcu102.log
python -u make_target.py -m build/compiled_model_u50/autoenc.xmodel -td build/target_u50 2>&1 | tee build/logs/target_u50.log
python -u make_target.py -m build/compiled_model_vck190/autoenc.xmodel -td build/target_vck190 2>&1 | tee build/logs/target_vck190.log

The make_target.py script will do the following:

Create a set of noisy images in JPEG files and then copy them to the target folder.
- the number of images is set by the --num_images command line argument which defaults to 2000.
Copy the compiled model to the target folder.
Copy the Python application code to the target folder.

Step 6 - Run the application on the target

ZCU102

The entire target_zcu102 folder will be copied to the ZCU102's SDcard. Copy it to the /home/root folder of the flashed SD card, this can be done in one of several ways:

Direct copy to SD Card:

If the host machine has an SD card slot, insert the flashed SD card and when it is recognised you will see two volumes, BOOT and ROOTFS. Navigate into the ROOTFS and then into the /home folder. Make the ./root folder writeable by issuing the command sudo chmod -R 777 root and then copy the entire target_zcu102 folder from the host machine into the /home/root folder of the SD card.
Unmount both the BOOT and ROOTFS volumes from the host machine and then eject the SD Card from the host machine.

With scp command:

If the target evaluation board is connected to the same network as the host machine, the target_zcu102 folder can be copied using scp.
The command will be something like scp -r ./build/target_zcu102 root@192.168.1.227:~/. assuming that the target board IP address is 192.168.1.227 - adjust this as appropriate for your system.
If the password is asked for, insert 'root'.

With the target_zcu102 folder copied to the SD Card and the evaluation board booted, you can issue the command for launching the application - note that this done on the target evaluation board, not the host machine, so it requires a connection to the board such as a serial connection to the UART or an SSH connection via Ethernet.

The application can be started by navigating into the target_zcu102 folder on the evaluation board and then issuing the command python3 app_mt.py. The application will start and after a few seconds will show the throughput in frames/sec, like this:

root@xilinx-zcu102-2021_1:~/target_zcu102# python3 app_mt.py
------------------------------------
Command line options:
 --image_dir :  images
 --pred_dir  :  predictions
 --threads   :  1
 --model     :  autoenc.xmodel
------------------------------------
Pre-processing 2000 images...
------------------------------------
Found 2 DPU subgraphs
Starting 1 threads...
------------------------------------
Throughput=646.65 fps, total frames = 2000, time=3.0929 seconds
Predicted images saved to ./predictions

The predicted images will be written to a folder called 'predictions' on the SDcard.

The throughput can be improved by increasing the number of threads with the --threads option:

root@xilinx-zcu102-2021_1:~/target_zcu102# python3 app_mt.py -t 3
------------------------------------
Command line options:
 --image_dir :  images
 --pred_dir  :  predictions
 --threads   :  3
 --model     :  autoenc.xmodel
------------------------------------
Pre-processing 2000 images...
------------------------------------
Found 2 DPU subgraphs
Starting 3 threads...
------------------------------------
Throughput=1290.60 fps, total frames = 2000, time=1.5497 seconds
Predicted images saved to ./predictions

Acknowledgements & References

Python code for sampling layer taken from Keras example

Appendix - Script command line arguments

train.py

Argument	Default	Description
--float_model or -m	float_model/f_model.h5	Full path of floating-point model
--batchsize or -b	100	Batchsize used in training and validation - adjust for memory capacity of your GPU(s)
--epochs or -e	40	Number of training epochs
--learnrate or -lr	0.0001	Learning rate for optimizer
--predict or -p	False	Will enable predictions if specified
--pred_dir or -pd	float_predict	Full path of folder for saving predictions

quantize.py

Argument	Default	Description
--float_model or -m	float_model/f_model.h5	Full path of floating-point model
--quant_model or -q	quant_model/q_model.h5	Full path of quantized model
--batchsize or -b	100	Batchsize used in training and validation - adjust for memory capacity of your GPU(s)
--predict or -p	False	Will enable predictions if specified
--pred_dir or -pd	quant_predict	Full path of folder for saving predictions

make_target.py

Argument	Default	Description
--target_dir or -td	target	Full path of target folder
--image_format or -f	png	Image file format - valid choices are png, jpg, bmp
--num_images or -n	2000	Number of images to create
--app_dir or -a	application	Full path of application code folder
--model or -m	compiled_model/autoenc.xmodel	Full path of compiled model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Vitis AI Tutorials

Denoising Variational Autoencoder with TensorFlow2 and Vitis-AI

Current Status

Introduction

The Variational Autoencoder

Before you begin

Setting up the workspace

Implementing the design

Step 0 - Training and evaluation of the floating-point model

Step 1 - Generation and evaluation of the quantized model

Step 2 - Compile the quantized model

Step 3 - Determine the subgraphs of the compiled model

Step 4 - Create the Python application code

Image pre-processing

DPU runners

Code for CPU subgraphs

Threads

Step 5 - Make the target folder for use on target board

Step 6 - Run the application on the target

ZCU102

Acknowledgements & References

Appendix - Script command line arguments

train.py

quantize.py

make_target.py

Files

README.md

Latest commit

History

README.md

File metadata and controls

Vitis AI Tutorials

Denoising Variational Autoencoder with TensorFlow2 and Vitis-AI

Current Status

Introduction

The Variational Autoencoder

Before you begin

Setting up the workspace

Implementing the design

Step 0 - Training and evaluation of the floating-point model

Step 1 - Generation and evaluation of the quantized model

Step 2 - Compile the quantized model

Step 3 - Determine the subgraphs of the compiled model

Step 4 - Create the Python application code

Image pre-processing

DPU runners

Code for CPU subgraphs

Threads

Step 5 - Make the target folder for use on target board

Step 6 - Run the application on the target

ZCU102

Acknowledgements & References

Appendix - Script command line arguments

train.py

quantize.py

make_target.py