This is the final project for Special Course on Computer Architecture, in which FPGA and GPU are used for acclerating a simple CNN LeNet-5. For FPGA and GPU, HLS and Cuda are used respectively.
Hardware info:
- CPU: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
- latency = 1.59ms/inference
- GPU: GeForce GTX 970
- latency = 3.3ms (with a batch size of 100)
- FPGA: Xilinx Kintex Ultrascale
- latency = 0.54ms/inference
For real-time application or inference of small batch size, FPGA is the fastest.
- run
vivado_hls -f build_prj.tcl
in this repo, and you'll have a synthesizale project.
- Train the model by running
keras_lenet.py
, which generates two things:saved_model.json
andsaved_weights.h5
- Clone into this repo, which is a fork of this great project hls4ml
cd hls4ml/keras-to-hls
, create a new directorylenet5-model-files
, and put the generatedsaved_model.json
andsaved_weights.h5
python keras-to-hls.py -c keras-config.yml
, which generates the C source files for the HLS project- Build the HLS project by running
vivado_hls -f build_prj.tcl
(30~40 min)- There will be a lot of 1's printed out, which mean correct inferences. 0 means incorrect inference. More information can be found in the hls4ml repo. However the generated project without manual optimization can be not-synthesizable or have very poor performance.
- testbench:
lenet5_test.cpp
- test data:
test_images/
(generated by the scripttest_images/save_images.py
)
- wrap inner loops into functions, such as the elementwise multiplication of a weight filter with a block of image. In 2d convolution, the innerest two loops are wrapped into one function, so that the filtering can be reused and pipelined.
- pipeline and partition at the same time:
for (int oh = 0; oh < CONFIG_T::out_height; oh++)
for (int ow = 0; ow < CONFIG_T::out_width; ow++)
#pragma HLS pipeline
for (int ff = 0; ff < CONFIG_T::n_filt; ff++)
acc[oh * CONFIG_T::out_width + ow][ff] = biases[ff]; // partition acc in dim 2 and biases in dim 1
- use temporary variable to store the accumulation result to reduce memory access
- for fully-connected layers (matrix-vector multiplication), pipeline the first-level loop:
// keras (tf backend) uses transposed weight matrices
Product1: for(int ii = 0; ii < CONFIG_T::n_in; ii++) {
#pragma HLS PIPELINE
cache = data[ii];
Product2: for(int jj = 0; jj < CONFIG_T::n_out; jj++) {
int index = ii*CONFIG_T::n_out+jj;
typename CONFIG_T::accum_t mult = cache * weights[index];
acc[jj] += mult;
}
}
- remove unnecessary flatenning and unflatenning, match the dimension of the output of a layer and the input of the next layer.
- don't use local arrays too deep in the loop structure, which can result in large resource consumption
The original accuracy is 98.89% and using 16-bit fixed point (w/o the softmax layer), accuracy becomes to 98.87%. The softmax layer introduces a further tiny drop in the accuracy.
- xcku095-ffvb2104-1-c
- latency = interval = 54064 cycles * 10 ns/cycle
- resource usage
- BRAM 49%
- DSP 96%
- FF 12%
- LUT 46%
- xcku115-flvb2014-2-e:
The code for GPU is located under gpu
. Under the gpu
directory, run make cpu
or make gpu
to build the CPU or GPU version of the code. The CPU/GPU coding style is very different from HLS, especially the convolution part, as can be learned from this project.
The most expensive part, i.e. the second conv layer is accelerated with GPU, by replacing the 3 outermost loops with 3D grids.
Except the second conv layer, everything else stays in the host. Since there is a large overhead induced by the memory allocation and kernel launching, the overall latency averaged over 100 inferences is slightly worse than CPU (3.50 vs 1.59ms/inference). However if we compare the kernel only, GPU is abuot 20 times faster than CPU (0.042ms vs 0.9ms). This advantage only manifests itself when there is a large amount of data using the same parameters, which can be prefetched into GPU.
It is also possible to implement the Cuda kernel as C++ templates, as explained in this blog post.
For a comparison, the evaluation is done with Keras + TF:
The reduction in latency flattens out mainly due to limited memory on my laptop, if the CPU/GPU memory is large enough, the framework implementation will show even better scalability.
Note that the CPU/GPU used are different from above, which are Intel i7-7500 and GeForce 940MX, which is because I have the environment setup on my own laptop.