See Vitis™ Development Environment on xilinx.com See Vitis-AI™ Development Environment on xilinx.com |
-
Embedded system designed with Vitis 2020.2 environment.
-
Tested in hardware on VCK190PP with
XVDPU TRD
platform.
02 December 2021
This repository contains the Pre- and Post-processing kernels to be used in Machine Learning (ML) jointly to the Deep learning Processor Unit (shortly DPU) to accelerate in the Programmable Logic (shortly PL) the same tasks that otherwise would be executed by the ARM host CPU of the FPGA target device. Off-loading those two tasks from the ARM CPU improves the overall system performance in terms of frames-per-second (fps).
The two accelerators were tested using data coming from the Semantic Segmentation CNN of this tutorial: VAI-KERAS-FCN8-SEMSEG, where the CNN was retrained with larger image sizes as 1920x832, but the accelerators are general enough to be used or easily adapted with few changes also to other Deep Learning applications, such as Object Detection or Image Classification.
At the moment we are targeting the VCK190 Pre-Production (PP) board, with the so called XVDPU TRD
platform, which contains a DPU designed with 96 AI Engine cores (over the 400 available) besides other PL resources (BRAMs, URAMs, FFs, LUTs, DSPs).
The two accelerators do not use any core from the AI Engine array of the Versal ACAP, to be more portable later also on MPSoC devices. Their design is done with Vitis High Level Synthesis (shortly HLS in the following of this document) within the Vitis suite.
The application running on the host ARM CPU applies XRT APIs.
This tutorial can also be seen as a complete example of how using the WAA flow with Vitis 2020.2 targeting the VCK190 PP board.
There are two major commands that basically run all what is explained in the following sections:
cd VDPU-PRE-POST-PLACC/files # you are supposed to be here
# whole section 2
source ./run_hls_projects.sh
# whole section 4
cd makefile_flow
source ./run_makefile_flow.sh
Everything shown in this project was done on an Ubuntu 10.04.7 Desktop with related Vitis 2020.2 suite. This project was never tried on a Windows OS PC.
In case you might get some strange errors during the execution of the scripts, you have to pre-process -just once- all the*.sh
, *.cpp
, *.h
files with the dos2unix utility.
In that case run the following commands from your Ubuntu host PC (out of the Vitis AI docker images):
#sudo apt-get install dos2unix
cd <WRK_DIR> #your working directory
for file in $(find . -name "*.sh" ); do dos2unix ${file}; done
for file in $(find . -name "*.tcl"); do dos2unix ${file}; done
for file in $(find . -name "*.h" ); do dos2unix ${file}; done
for file in $(find . -name "*.c*" ); do dos2unix ${file}; done
For each accelerator there are two project folders named hls and vitis, respectively with the source files adopted in the standalone HLS design and in the final Vitis system design.
For each accelerator the files are the same among the two subfolders, the only difference being that the vitis folder requires also the ARM host code with XRT APIs, which is not needed
by the vitis_hls folder. Therefore, the file dpupreproc_defines.h must have the line #define ARM_HOST
commented when used in the kernels subproject, but it must have such line not commented when used in the host code, as shown in the dpupreproc_defines.h (this is the only difference between these two files that have the same name and are placed in different folders).
The same concept is valid also for the post-processing kernel and its related folders hls and vitis, respectively for the source files adopted in the standalone HLS design and in the final Vitis system design.
In order to avoid proliferation of files with the same name, I used soft-links for the files that are in common between either the standalone HLS or the Vitis project. Run the following command before reading the rest of this document:
cd VDPU-PRE-POST-PLACC/files
bash -x ./prepare_files
In ML, the preprocessing job has to change the statistics on the data to be used for training the CNN in order to facilitate such training. There many ways to do that preprocessing, the most popular methods are the following two explained with Python code fragments, respectively the "Caffe" and "TensorFlow" mode (this is my terminology to explain with simple words):
. . .
if (TensorFlow_preproc): #TensorFLow mode
_B_MEAN = 127.5
_G_MEAN = 127.5
_R_MEAN = 127.5
MEANS = [_B_MEAN, _G_MEAN, _R_MEAN]
SCALES = [0.007843137, 0.007843137, 0.007843137] # 1.0/127.5
else: #Caffe mode
_B_MEAN = 104.0
_G_MEAN = 117.0
_R_MEAN = 123.0
MEANS = [_B_MEAN, _G_MEAN, _R_MEAN]
SCALES = [1.0, 1.0, 1.0]
. . .
def preprocess_one_image_fn(image_path, pre_fix_scale, width, height):
means = MEANS
scales = SCALES
image = cv2.imread(image_path)
image = cv2.resize(image, (width, height))
B, G, R = cv2.split(image)
B = (B - means[0]) * scales[0] * pre_fix_scale
G = (G - means[1]) * scales[1] * pre_fix_scale
R = (R - means[2]) * scales[2] * pre_fix_scale
image = cv2.merge([R, G, B])
image = image.astype(np.int8)
return image
From one hand, in Caffe normally the input image R G B pixels are manipulated by subtracting the R G B mean values (MEANS
) of all the training dataset images and so the output data is of type signed char
(in C/C++) or int8
(python numpy), with a possible range from -128 to +127, being 8-bit.
From another hand, in TensorFlow normally the pixels are manipulated by normalizing them in the interval from -1.0 to 1.0.
During the CNN training phase the pre-processing works on floating point data, but in real life the DPU works with int8
after quantization with Vitis AI tools and so in the application running on the target device in real time, you have to scale the data with the pre_fix_scale
parameter that comes from a query to the DPU before starting the ML prediction (inference) task itself, with Python code similar to this:
input_fixpos = all_dpu_runners[0].get_input_tensors()[0].get_attr("fix_point")
pre_fix_scale = 2**input_fixpos
In conclusion, before starting its job, the image pre-processing module requires 6 floating point input parameters:
float MEANS[3];
float SCALES[3];
and the scaling factor that could be either
float pre_fix_scale;
or alternatively
int input_fixpos;
this last one being a value from 1 to 7 because it represents the exponent i
of a power of 2,
that is 2^i
.
In the HLS TestBench (TB) all those parameters are fixed in the dpupreproc_defines.h file, to test the functionality of the core.
The input image used in the self-checking TB was taken from the test dataset of the VAI-KERAS-FCN8HDTV-SEMSEG CNN.
After having setup the Vitis environment, just launch the command
cd VDPU-PRE-POST-PLACC/files # you are supposed to be here
cd preproc/hls
vitis_hls -f hls_script.tcl
and the whole HLS flow will run in its steps: CSIM, SYN, coSIM and IMP. See the related screenshots of Figures 1, 2, 3 and 4.
Figure 1. Pre-processing CSIM step with Vitis HLS
Figure 2. Pre-processing SYN step with Vitis HLS
Figure 3. Pre-processing coSIM step with Vitis HLS
Figure 4. Pre-processing IMP step with Vitis HLS
Note that the file dpupreproc_defines.h must have the line #define ARM_HOST
commented.
As you see from figure 4, after Place-And-Route, the accelerator consumes the following resources: 4294 LUT, 7042 FF, 2 BRAM and 13 DSP from the Versal 1902 device with a minimum clock period of 2.8ns, which corresponds to 356MHz maximum clock frequency.
Figure 3 reports the cycle accurate simulation (coSIM step), considering the amount of clock cycles to process the whole image before sending it back to DDR memory, the latency of this kernel is given by 1198260 (cycles) x 2.8ns (clock period) = 3.359ms.
Even assuming a longer clock period of 5ns (corresponding to 200MHz clock frequency) the latency would become 5.99ms.
Note that this latency is the time to process the entire frame (1920x832x3) of pixels because this is the way Vitis HLS works if you want to do a functional cycle accurate simulation (acknowledged as "coSIM") of the accelerator. But in itself this core has a real latency of few dozens of clock cycles. Such effective latency could be exploited either by using AXI4 Streaming interfaces (which are not accepted by the DPU core, which is unable to work in a streaming mode) instead of full MAXI4 interfaces or by adding a ping-pong buffer of few image lines among the Pre-processing accelerator and the external DDR memory.
In ML, the post-processing job has to present the "features map" generated by the CNN in a form that can be understood by human beings; in case of Semantic Segmentation this require to understand which pixel of the image belongs to which class.
In this application case there are 12 effective classes in a maximum amount of 28 classes per each pixel, so the output tensor generated by the DPU is a 3D volume with half the horizontal and vertical size of the input images -that is 1920/2 and 832/2 respectively- and 28 channels.
For each set of 28 values related to one pixel, the post-processing task computes first the Softmax classifier and then search for its maximum value and related index: the index of this max value represent the object class (coded with a number from 0 to 27) with the highest probability to be predicted by the CNN. This can be illustrated by looking at the C/C++ code of the file dpupostproc_ref.cpp:
void ref_SoftMax(signed char *inp_data, float *out_data, float post_scale_factor, unsigned char size)
{
float result[MAX_NUM_OF_CLASSES];
float sum = 0.0f;
for (int i=0; i<size; i++) {
int addr = 128+inp_data[i];
assert( (addr>=0) & (addr<=255) );
float x = addr*post_scale_factor;
result[i]= expf(x);
sum += result[i];
}
float div = 1.0f / sum;
for (int i=0; i<size; i++)
out_data[i]=result[i] * div;
}
void ref_ArgMax(float *inp_data, unsigned char *out_max, unsigned char *out_index, unsigned char size)
{
unsigned char max=0, index=0;
for (int i=0; i<size; i++) {
float val = inp_data[i];
val = val * 255.0f;
int i_val = (int) val;
assert( (i_val<=255) & (i_val>=0) );
unsigned char u_val = i_val;
if (u_val > max) {
max = u_val;
index = i;
}
}
*out_index = index;
*out_max = max;
}
void ref_dpupostproc(signed char *inp_data, unsigned char *out_max,
unsigned char *out_index, float post_scale_factor, unsigned short int height, unsigned short int width)
{
unsigned short int rows = height;
unsigned short int cols = width;
unsigned short int size = MAX_NUM_OF_CLASSES;
float softmax[MAX_NUM_OF_CLASSES];
signed char ch_vect[MAX_NUM_OF_CLASSES];
unsigned char index, max;
for (int r = 0; r < rows; r++) {
for (int c = 0; c < cols; c++) {
for(int cl=0; cl<size; cl++) {
signed char tmp_data = inp_data[r*POST_MAX_WIDTH*MAX_NUM_OF_CLASSES + c*MAX_NUM_OF_CLASSES + cl];
ch_vect[cl] = tmp_data;
}
ref_SoftMax(ch_vect, softmax, post_scale_factor, size);
ref_ArgMax(softmax, &max, &index, size);
out_max[ r*POST_MAX_WIDTH + c] = (unsigned char) max;
out_index[r*POST_MAX_WIDTH + c] = index;
}
}
}
As already done for the pre-processing, also in this case there is the need to scale the data generated by the DPU before inputting them into the SoftMax classifier and this is done with the post_scale_factor
parameter that comes from a query to the DPU at run time, with Python code similar to this:
output_fixpos = outputTensors[0].get_attr("fix_point")
post_scale_fact = 1 / (2**output_fixpos)
Note that output_fixpos
is value from 1 to 7 because it represents the exponent i
of a power of 2,
that is 2^i
.
The SoftMax function is computed by a Look Up Table (LUT), since there are 7 possible output_fixpos
values the file luts.h contains basically 7 different LUTs, one for each value.
In the HLS TB this parameter is fixed in the dpupostproc_defines.h file, to test the functionality of the core.
The input data used in the HLS self-checking TB were taken by running the CNN xmodel
generated in the VAI-KERAS-FCN8HDTV-SEMSEG tutorial directly on the VCK190 board at run time, they were saved as npy
(python numpy) files, then converted in mat
(MATLAB) files and finally in .txt
text files.
Note: the ARM CPU could compute all the Look Up Table and send it to the post-processor as an alternative architectural choice to save BRAMs
After having setup the Vitis environment, just launch the command
cd VDPU-PRE-POST-PLACC/files # you are supposed to be here
cd postproc/hls
vitis_hls -f hls_script.tcl
and the whole HLS flow will run in its steps: CSIM, SYN, coSIM and IMP. See the related screenshots of Figures 5, 6, 7 and 8.
Figure 5. Post-processing CSIM step with Vitis HLS
Figure 6. Post-processing SYN step with Vitis HLS
Figure 7. Post-processing coSIM step with Vitis HLS
Figure 8. Post-processing IMP step with Vitis HLS
Note that the file dpupostproc_defines.h must have the line #define ARM_HOST
commented.
As you see from Figure 8, after Place-And-Route, the accelerator consumes the following resources: 14347 LUT, 17395 FF, 38 BRAM and 58 DSP from the Versal 1902 device with a minimum clock period of 2.891ns, which corresponds to 345MHz maximum clock frequency.
Figure 7 reports the cycle accurate simulation (coSIM step), considering the amount of clock cycles to process the whole image before sending it back to DDR memory, the latency of this kernel is given by 1722479 (cycles) x 2.981ns (clock period) = 5.134ms.
Even assuming a longer clock period of 5ns (corresponding to 200MHz clock frequency) the latency would become 8.61ms.
Note that this latency is the time to process the entire frame (860x416x28) of data because this is the way Vitis HLS works if you want to do a functional cycle accurate simulation (acknowledged as "coSIM") of the accelerator. But in itself this core has a real latency of few dozens of clock cycles. Such effective latency could be exploited either by using AXI4 Streaming interfaces (which are not accepted by the DPU core, which is unable to work in a streaming mode) instead of full MAXI4 interfaces or by adding a ping-pong buffer of few image lines among the Post-processing accelerator and the external DDR memory.
This section explains how to build the embedded system project with the Vitis GUI, now that you have developed the two accelerator kernels as standalone HLS projects. You must have available the following platform
and petalinux
folders/files related to the XVDPU TRD
platform design:
# TRD platform file
ZF_VDPU_TRD/platform/vck190_dpu/vck190_dpu.xpfm
# Sysroot path
ZF_VDPU_TRD/petalinux/xilinx-vck190-base-trd/images/linux/sdk/sysroots/aarch64-xilinx-linux/
# Root FS file
ZF_VDPU_TRD/petalinux/xilinx-vck190-base-trd/images/linux/rootfs.ext4
# Linux Kernel Image file
ZF_VDPU_TRD/petalinux/xilinx-vck190-base-trd/images/linux/Image
Since the DPU core is not yet in this design, the two PL accelerators work with pre-defined scaling factors. In the real application the information about such scaling factors should arrive by searching for the fix_point
attributes of the input and output tensors of the CNN subgraph running in the DPU.
This section contains the instructions to create an embedded system design in which both the Pre- and Post-processing kernels are working in parallel (of course on different data).
This step was done after having created an embedded system with only one kernel at a time and then functionally tested such standalone kernel.
Then the host code was written in a way to encapsulate the code related to each kernel so that they could work in parallel without any interference.
If you look at the host_preproc_xrt.cpp and
host_postproc_xrt.cpp files you will note that the main()
routine is embedded by #ifndef TWO_KERNELS
.
Since the instructions to create the standalone projects are basically the same, I prefer to describe here those steps, once, for the sake of conciseness.
-
From Vitis GUI create a new application project and select the
vck190_dpu.xpfm
file associated to theXVDPU TRD
platform design, as illustrated in Figure 9. -
Select the ARM Cortex A72 as application domain and fill the appropriate Sysroot path , Root FS and Linux Kernel Image tags, with the above mentioned files, see also Figure 10.
-
Select the "AI Engine System Design Examples -> Empty Application" as design template, see Figure 11. Then set the Active build configuration to Hardware.
-
Delete the subproject "two_kernels_kernels" as illustrated in Figure 12.
-
With the mouse, click on "File" menu and select "New->Hw Kernel project" and give it a name as "postproc", this will be the subsystem of the post-processing accelerator (or kernel). Make sure to have selected the "two_kernels" as "system project name". See Figure 13.
-
Now import the following three files of source code for such accelerator: dpupostproc_vhls.cpp, dpupostproc_defines.h (this one with
#define ARM_HOST
commented) and lut_exp.h. See Figure 14. -
With the mouse click on the file
postproc.prj
and select the top level functionhls_dpupostproc_m_axi
, which is the name of the accelerator in this Vitis flow. With the mouse right click on the "postproc" kernel in the project Explorer (on the left) and select "Build". Make sure that you have first put the Active build configuration to Hardware. See Figure 15. -
Similarly to what done in the steps 5 and 6, you now create the Pre-processing kernel. Again, with the mouse, click on "File" menu and select "New->Hw Kernel project" and give it a name as "preproc", this will be the subsystem of the pre-processing accelerator. Now add the source code files dpupreproc_vhls.cpp and dpupreproc_defines.h (this last one with
#define ARM_HOST
commented). -
Similarly to what done in the step 7, with the mouse click on the file
preproc.prj
and select the top level functionhls_dpupreproc_m_axi
, which is the name of the accelerator in this Vitis flow. With the mouse right click on the "preproc" kernel in the project Explorer (on the left) and select "Build". Make sure that you have first put the Active build configuration to Hardware. See Figure 16. -
Now you have to import all the necessary files for the host application from preproc/vitis/host and postproc/vitis/host. At the end of the process you will have what illustrated in Figure 17. You need also to add the file host_main.cpp.
-
Now you have to set the C/C++ Build Settings for the host application. With the mouse right-click on the "two_kernels[linux on psv_cortexa72]" in the project Explorer and select "C/C++ Build -> Settings -> Dialect" and choose ISO C++1y. Add the
TWO_KERNELS
macro in thePreprocessor
settings. See Figure 18. -
Still in the C/C++ Build Settings, you have to remove the OpenCL library and add the XRT
xrt_coreutil
library. See Figure 19. -
Now right click with the mouse on the "two_kernel_system[vck190_dpu]" and launch the "Build" action. You have to wait now for several minutes, depending on your host PC. The
sd_card
to boot the Linux OS on the VCK190 PP board together with thebinary_container_1.xclbin
bitstream to program the device will be created at the end of this process. -
Prepare a new SD card to boot the VCK190PP by writing the file
sd_card.img
with an utility likeWin32DiskImager
(on my Windows10 OS laptop). See Figure 20.
NOTE Most of the the above actions could be skipped by opening the Vitis GUI and importing the vitis archive two_kernels_system.ide.zip. The only problem is that you have to manually adapt the TRD platform file, the Sysroot path, the Root FS file and the Linux Kernel Image file to have it correctly working.
Figure 9. Vitis GUI flow: selecting the platform file
Figure 10. Vitis GUI flow: setting the domain specific tags
Figure 11. Vitis GUI flow: design template selection
Figure 12. Vitis GUI flow: remove the "two_kernels_kernels" subproject
Figure 13. Vitis GUI flow: add the "postproc" kernel subproject
Figure 14. Vitis GUI flow: add the source code files to the "postproc" kernel
Figure 15. Vitis GUI flow: build the "postproc" kernel
Figure 16. Vitis GUI flow: build the "preproc" kernel
Figure 17. Vitis GUI flow: import the source code for the host application
Figure 18. Vitis GUI flow: ISO C++1y dialect set to compile the host application (top) and preprocessor defines (bottom)
Figure 19. Vitis GUI flow: adjust the libraries to compile the host application,
by removing xilinxopencl
(top) and adding xrt_coreutil
(bottom)
Figure 20. Write the sd card to boot the VCK190PP target board
- Now turn on and boot your VCK190PP target board, open a PuTTY terminal from your host PC to communicate in UART directly with you target board. As illustrated in the right part of Figure 21, set the board IP Address (for example 190.1268.1.200) with the following command running it on the PuTTY terminal:
ifconfig eth0 192.168.1.200
-
To test the two PL kernel at runtime on the target board, you have to transfer their input data. Use a file-transfer (
scp
based) utility like FileZilla and copy the data folders data_post and data_pre from your host PC to the folder/mnt/
of the target board. See the left part of Figure 21. -
With the mouse right click on the project Explorer "two kernels system[vck190-dpu]" and select "Debug Configurations" and the click twice on "System Project Debug", as shown in Figure 22.
-
Then you need to set the "Linux Agent" for the debug server using the same IP address of item 1 above, as illustrated in Figure 23.
-
Run the debugger. You should see the positive results reported in Figure 24.
Figure 21. Debug flow: file transfer of accelerator I/O data between host PC and target board
Figure 22. Debug flow: setting the Debug Configurations
Figure 23. Debug flow: set the Linux agent between host and target
Figure 24. Debug flow: test ended successfully
When adding also the DPU software application to the PL pre- and post-processing accelerators, you have to temporary leave the Vitis GUI-based flow and use the Makefile-based flow.
Assuming you have properly setup the Vitis environment, the complete software application with the cascade of the three kernels (pre-processing, DPU, post-processing) can be compiled with the Vitis Makefile-based flow, by launching the following commands in the Makefile from the host_apps:
cd VDPU-PRE-POST-PLACC/files # you are supposed to be here
cd makefile_flow
bash -x ./run_makefile_flow.sh
These commands will compile the host applications with a Makefile flow for the standalone pre-processing (preproc
folder, the application is named host_preproc_xrt
),
the standalone post-processing (postproc
folder, the application is named host_postproc_xrt
) and the cascade of "preprocessing -> DPU -> postoprocessing" (pre2post
folder, the application is named pre2post
).
Note that in the run_makefile_flow.sh
script the following environmental variables need to be correctly set:
#change the following two directories according to your needs
export VDPU_PRE_POST_PL_ACC=/media/danieleb/DATA/ZF/new_VDPU-PRE-POST-PL-ACC/files
export DB_FATHER_PATH=/media/danieleb/DATA/ZF/ZF_ProAI-main/NEW_ZF_PACKAGE_FINAL
You can create an archive and copy it on your VCK190 target board with scp
utility (assuming your board has a certain IP address VCK190_IP_ADDRESS
):
#from HOST PC
cd VDPU-PRE-POST-PLACC/files # you are supposed to be here
cd makefile_flow
# -h to replace softlinks with real files
tar -hcvf host_apps.tar ./host_apps
# transfer archive from host to target
scp host_apps.tar root@VCK190_IP_ADDRESS:~/
Then you can work on the UART terminal of your target board with the following commands:
#FROM TARGET BOARD
tar -xvf host_apps.tar
cd host_apps
bash -x ./run_all_acc.sh | tee logfile_host_apps_vck190p.txt
You should see something like what reported in the [logfile_host_apps_vck190p.txt](files/makefile_flow/img/ logfile_host_apps_vck190p.txt) file.
Each host application generates an output that perfectly matches the reference:
-
the standalone
preproc
PL kernel generates thetesting_0_1920x832_out.bmp
image that is bit-a-bit equal to thetesting_0_1920x832_ref.bmp
image produced by the software task running on the ARM CPU as reference; -
the standalone
postproc
PL kernel generates thepl_hls_index.bin
binary file that is bit-a-bit equal to thearm_ref_index.bin
binary file produced by the software task running on the ARM CPU as reference; -
the processing chain
pre2post
composed by the cascade ofpreproc
dpu
andpostproc
kernels produces the outputs of Figures 25 and 26 and the output filespost_uint8_out_idx.bin
(PL pre-precessing, DPU and PL post-processing kernels) andpost_uint8_ref_idx.bin
(DPU and ARM sw post-processing task) perfectly match each other.
Figure 25. Pre-processing output data, represented as an image.
Figure 26. Post-processing output data, represented as an image. On the left the input image, on the right the output segmented image.
Besides running the host applications directly on the target board as commands, you can use the Vitis GUI and debug the application one step at a time, as in the following:
cd VDPU-PRE-POST-PLACC/files # you are supposed to be here
cd makefile_flow/host_apps/pre2post/src/
vitis -debug -flow embedded -os linux -host-exe-file ../../../../host_apps/makefile_flow/pre2post/pre2post -program-args "/home/root/pre2post/model/fcn8.xmodel /home/root/pre2post/data_pre2post/dataset1/img_test/ 1 1 1" -host VCK190_IP_ADDRESS -target-work-dir /home/root/pre2post
Be careful in not making any mistake with the directory names and levels either in the host or in the target, if you make a mistake the GUI could not pop-up correctly.
You should see something similar to what illustrated in Figure 27:
Figure 27. Vitis GUI to debug the application with pre- and post-processing and DPU kernels
When the semantic segmentation CNN is executed with a single thread on the system composed by DPU and pre- and post-processing PL accelerators you can note the following throughput performance in terms of average fps
(frames per second):
- pre-processing task: 37fps by the PL accelerator vs. 3fps by the ARM CPU software task
- post-processing task: 78fps by the PL accelerator vs. the ARM CPU software task
- DPU task: 51fps.
Note that the latency of the PL accelerators could be further reduced by making them to work in purely streaming dataflow mode.
Copyright© 2021 Xilinx