Floorplan-Aware Directive Optimization for HLS designs on Multi-die FPGAs
Thanks for using our FADO framework! FADO is developed by the Reconfiguration Computing System Lab @ HKUST, and to appear as a regular paper (oral) in the International Symposium FPGA 2023.
For personal usage, not redistribution, you can refer to the pre-print...
- in this repo as fado.pdf
- on arXiv: https://arxiv.org/abs/2212.11582
Linfeng Du, Tingyuan Liang, Sharad Sinha, Zhiyao Xie, and Wei Zhang. 2022. FADO: Floorplan-Aware Directive Optimization for High-Level Synthesis Designs on Multi-Die FPGAs. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’23), February 12–14, 2023, Monterey, CA, USA. ACM, New York, NY, USA, 11 pages.
- Step 0: System Checking (only verified versions are listed, other versions are not guaranteed)
- Ubuntu OS: 20.04.4 LTS / 20.04.5 LTS
- Linux: 5.4.0-050400-generic / 5.14.0-1054-oem
- Vitis/Vitis_HLS/Vivado 2020.2
-
$\geq$ 64GB DDR4 for back-end implementation using Vitis, as suggested by Xilinx document UG1301
- Step 1: Apt: a single command:
bash step1-install-apt-packages.sh
, or separate commands:sudo apt install <the following packages>
- faketime
- iverilog
- swig pre-requisite for
pip install oapackage
- Step 2: Python 3.9: a single command:
pip install -r step2-pip-requirements.txt
, or separate commands:pip install <the following packages>
- OApackage==2.7.1 for ploting pareto front
- matplotlib==3.5.1
- defaultlist==1.0.0
- graphviz==0.20
- anytree==2.8.0
- pyverilog==1.3.0
- mip==1.14.0
- Step 3: Packages for Alveo U250 Board (Please notice that (2) and (3) in our environment are too old and abandoned on the current Xilinx website. Hence, please use our archive to have the same experiment environment.)
- (1) Download (https://www.xilinx.com/products/boards-and-kits/alveo/u250.html#gettingStarted) and install the Xilinx Runtime using
sudo apt install ./xrt*.deb
- (2) Download (deployment_archive) and install the U250 Deployment Platform using
sudo apt-get install ./xilinx-u250-xdma-201830.2-2580015_18.04.deb
- (3) Download (development_archive) and install the U250 Development Platform using
sudo apt-get install ./xilinx-u250-xdma-201830.2-dev-2580015_18.04.deb
- (1) Download (https://www.xilinx.com/products/boards-and-kits/alveo/u250.html#gettingStarted) and install the Xilinx Runtime using
For artifact evaluation, if you come across any difficulty about the environment or experiments, or if you need us to provide a remote environment for you, please do not hesitate to contact Linfeng Du @ [email protected]. We will get back to you ASAP (most likely within 24 hours).
To reproduce the results shown in the FADO paper, to be specific, mainly the last two rows of Table 5 and the whole Table 6, we design the following three experiments. Plesae find below the:
- working directory and the corresponding data entry in Table 5 and/or Table 6
- command used in terminal
- the explanation about generated results and output log
- the uncertainty analysis: whether you can reproduce the same or very close results as shown in the paper -- same results will be reproduced for some experiments, while others could vary because of the uncertainty shown in the workflow figure below.
- Uncertainty 1: the initial "AutoBridge Floorplanner" using MILP solver could give various initial solutions
- Uncertainty 2: iterative calling "AutoBridge Floorplanner" could lead to more uncertainty in the resulting QoR output
- Uncertainty 3: randomness during back-end placement and routing (P&R)
- About runtime:
- CPU performance difference
- Operating system process scheduling
- random convergence time of MILP solver
- Notice: although we can set random seeds to keep the solver's performance stable, it could limit the optimality of results generated. Instead, we run experiments multiple times and reported the most common observation of latency and resource in our paper.
- ...
-
Command:
python main.py 2 9
- "2" for AutoBridge Floorplanner
- "9" for various choices of directive optimization (and iterative floorplan legalization)
-
Working directories:
./benchmarks/.*/latency_fp_do
:- Corresponding data entry in the paper
- Table 5: "Initial FP -> Iterative DO" (the second line)
- Uncertainty analysis: (factor: Uncertainty 1)
- latency: almost always the same
- resource: almost always the same
- runtime: could vary
- Corresponding data entry in the paper
./benchmarks/.*/latency_ab
:- Corresponding data entry in the paper
- Table 5: "Iterative (DO + AutoBridge FP)" (the third line)
- Uncertainty analysis: (factors: Uncertainty 1, Uncertainty 2)
- latency: almost always the same
- resource: almost always the same
- runtime: could vary, especially because of MILP solver's convergence randomness
- Corresponding data entry in the paper
./benchmarks/.*/latency_fado
:- Corresponding data entry in the paper
- Table 5: "Original (no directive)" (the first line)
- Table 5: "Iterative (DO + Incr FP) (Ours)" (the fourth line)
- Table 6 (the whole table)
- Uncertainty analysis: (factor: Uncertainty 1)
- latency: almost always the same
- resource: almost always the same
- runtime: could vary
- specially, the "mttkrp_cov" benchmark could have larger randomness because the final utilization is very close to the upper limit of available resource on the FPGA. Except for the most common results reported in our paper, other common results include:
======== DSE Stages (Table 6) MTTKRP*2+COV*2 ========
Stage 0: Online
Resource: 57.10%, Latency (thousand cycles): 160062.3
Stage 1: Online+Offline
Resource: 57.10%, Latency (thousand cycles): 160062.3
Stage 2: Online+Offline+Ahead
Resource: 63.45%, Latency (thousand cycles): 101763.6
Stage 3: Online+Offline+Ahead+Back
Resource: 64.39%, Latency (thousand cycles): 101755.4
or======== DSE Stages (Table 6) MTTKRP*2+COV*2 ========
Stage 0: Online
Resource: 63.15%, Latency (thousand cycles): 163241.1
Stage 1: Online+Offline
Resource: 64.67%, Latency (thousand cycles): 153927.2
Stage 2: Online+Offline+Ahead
Resource: 63.26%, Latency (thousand cycles): 129184.0
Stage 3: Online+Offline+Ahead+Back
Resource: 63.25%, Latency (thousand cycles): 128104.0
- Corresponding data entry in the paper
-
Output log:
- in
./benchmarks/.*/output/latency_resource_runtime.log
- Example log of test
./benchmark/cnn_2mm/latency_fado
:Iterative (DO + Incr FP) (Our FADO) directive search result (Table 5):
Runtime (s): 1.7685
Latency (thousand cycles): 91.164
Resource: 55%
============ DSE Stages (Table 6) ============
Original (no directive):
Resource: 28.27%, Latency (thousand cycles): 8933.0
Stage 0: Online
Resource: 28.27%, Latency (thousand cycles): 734.6
Stage 1: Online+Offline
Resource: 40.12%, Latency (thousand cycles): 131.8
Stage 2: Online+Offline+Ahead
Resource: 55.01%, Latency (thousand cycles): 91.4
Stage 3: Online+Offline+Ahead+Back
Resource: 54.56%, Latency (thousand cycles): 91.2
- in
-
Explanation:
- Experiment 1 is designed for you to get almost the same latency and resource, and proportional runtime for every test case, as reported in our paper.
-
Command:
python main.py 3 4
- "3" for exporting RTL design, and packing XO
- "4" for running Vitis flow (v++)
-
Working directories:
./benchmarks/.*/freq_fp_do
:- Corresponding data entry in the paper
- Table 5: "Initial FP -> Iterative DO" (the second line)
- Uncertainty analysis: (factor: Uncertainty 3)
- frequency: almost always the same
- Corresponding data entry in the paper
./benchmarks/.*/freq_ab
:- Corresponding data entry in the paper
- Table 5: "Iterative (DO + AutoBridge FP)" (the third line)
- Uncertainty analysis: (factors: Uncertainty 3)
- frequency: almost always the same
- Corresponding data entry in the paper
./benchmarks/.*/freq_fado
:- Corresponding data entry in the paper
- Table 5: "Iterative (DO + Incr FP) (Ours)" (the fourth line)
- Uncertainty analysis: (factor: Uncertainty 3)
- frequency: almost always the same
- Corresponding data entry in the paper
-
Output:
- Please check the post-implementation Fmax using the script
./script/get_freq.py
, e.g., starting from the currect base directory:cd ./benchmarks/cnn_2mm/freq_fado/ python ../../../script/get_freq.py .
- Example output in the terminal:
Usage:
python get_freq.py $(realpath [benchmark base])
Relative path: ./vitis_run/top_xilinx_u250_xdma_201830_2.temp/reports Full vitis report path: ./vitis_run/top_xilinx_u250_xdma_201830_2.temp/reports/link/imp Timing report found: ./vitis_run/top_xilinx_u250_xdma_201830_2.temp/reports/link/imp/> impl_1_xilinx_u250_xdma_201830_2_bb_locked_timing_summary_postroute_physopted.rptFmax: 274.10
- Please check the post-implementation Fmax using the script
-
Explanation:
- Experiment 2 is designed for you to get almost the same frequency for every test case as reported in paper.
-
Command:
python main.py 2 9 python main.py 3 4
-
Working directories:
./benchmarks/.*/all_ab
:- Corresponding data entry in the paper
- Table 5: "Iterative (DO + AutoBridge FP)" (the third line)
- Corresponding data entry in the paper
./benchmarks/.*/freq_fado
:- Corresponding data entry in the paper
- Table 5: "Iterative (DO + Incr FP) (Ours)" (the fourth line)
- Corresponding data entry in the paper
-
Output:
- Latency, Resource, and Runtime in
./benchmarks/.*/output/latency_resource_runtime.log
. - Fmax using the script
./script/get_freq.py
.
- Latency, Resource, and Runtime in
-
Explanation:
- Experiment 3 is designed for you to test the functionality of FADO' whole workflow.
- Since all uncertainties mentioned are included in this test, the QoR output could vary a little bit more than previous experiments.