This repo received all four Artifact Evaluation badges (Available, Functional, Reusable, Replicated) from FPT2022.
To address slow FPGA compilation, researchers have proposed to run separate compilations for smaller design components using Partial Reconfiguration(PR) [Xiao/FPT2019, Xiao/ASPLOS2022]. Different from the previous works, this work provides variable-sized pages that are hierarchically recombined from multiple smaller pages depending on the size of user operators. This unique capability not only accelerates FPGA compilation but also relieves users of the burden to fit the operators to the fixed-sized pages. For more details, please refer to our FPT2022 paper.
The starting code is forked from PLD repository [Xiao/ASPLOS2022]. The main differences are:
- static design generation using Hierarchical Partial Reconfiguration(a.k.a Nested DFX), thereby providing variable-sized PR pages
- synchronization after the synthesis jobs for automatic page assignment
The framework is developed with Ubuntu 20.04 with kernel 5.4.0, Vitis 2021.1 and Xilinx ZCU102 evaluation board.
If you install Vitis on /tools/Xilinx, you should set Xilinx_dir in ./common/configure/configure.xml as below.
<spec name = "Xilinx_dir" value = "/tools/Xilinx/Vitis/2021.1/settings64.sh" />
The ZYNQMP common image file can be downloaded from the Vitis Embedded Platforms page. Locate the image to the directory of your choice(e.g. /opt/platforms/), and adjust the configuration in ./common/configure/zcu102/configure.xml as below.
<spec name = "sdk_dir" value = "/opt/platforms/xilinx-zynqmp-common-v2021.1/ir/environment-setup-cortexa72-cortexa53-xilinx-linux" />
You can create ZCU102 Base DFX paltform from Vitis Embedded Platform Source repo(2021.1 branch). We slightly modified the floorplanning of ZCU102 Base DFX platform to reserve more area for the dynamic region. This can be done by replacing this file to our modified xdc file. You can follow the instructions to generate the ZCU102 DFX platform. For instance,
cd ./Xilinx_Official_Platforms/xilinx_zcu102_base_dfx/
source /PETALINUX_DIR/petalinux/2021.1/settings.sh
make all
Once you successfully generated ZCU102 DFX platform, locate the generated platform to the directory of your choice(e.g. /opt/platforms/), and adjust the configurations in ./common/configure/zcu102/configure.xml as below.
<spec name = "PLATFORM_REPO_PATHS" value= "/opt/platforms/xilinx_zcu102_base_dfx_202110_1" />
<spec name = "ROOTFS" value = "/opt/platforms/xilinx_zcu102_base_dfx_202110_1/sw/xilinx_zcu102_base_dfx_202110_1/xrt/filesystem" />
<spec name = "PLATFORM" value = "xilinx_zcu102_base_dfx_202110_1" />
As stated in the Xilinx user guide for DFX, the Nested DFX does not allow more than one RP to be subdivided until the first RP has a placed/routed design. This means that we need a series of subdivisions followed by place/route. Therefore, we will first subdivide the single RP from the ZCU102 Base DFX platform into 7 children RPs: p2(double page), p4(quad page), p12(quad page), p16(quad page), p20(quad page), and p_NoC(pblock for NoC).
After the first subdivision, we have a routed design that looks like below.
Then, we subdivide p2 and place/route. The routed design after this step looks like below.
Open this design and subdivide p4(quad page) into two double pages(p4_p0 and p4_p1). Place/route the design.
The subdivisions followed by place/route continue until we subdivide all the large pages into single pages. The final static design looks like below.
When the final routed design is created, from the final routed design, we recombine the children pblocks to generate 'intermediate bitstreams' and abstract shells for each page. Intermediate bitstream is a bitstream like p4_p1_subdivide.bit. When you want to load a bitstream on the single page, p4_p1_p0 or p4_p1_p1, you need to load the associated parent recombined bitstreams to properly set up the context.
It's important that all the bitstreams and abstract shells are generated from the same routed design. In this way, partial bitstreams generated with abstract shells are compatible to each other.
Finally, our framework generates all the utilization reports, excludes the blocked resources, and outputs a file that contains the information on each PR page's available resources.
To create a static design(overlay) for PR pages, you can simply run the command below in your /<PROJECT_DIR>/
.
Note that this process can take >4 hours depending on the system CPU/RAM because of the sequentialized charateristic of
Xilinx Nested DFX technology.
make overlay -j$(nproc)
When generating an overlay, you should encounter an ERROR: [DRC RTSTAT-5] Partial antennas
like below.
We consider this as a potential bug in Vivado.
In this case, cd to /<PROJECT_DIR>/workspace/F001_overlay/ydma/zcu102/zcu102_dfx_manual/
and open up Vivado GUI with
vivado &
. In Tcl console, manually copy and paste the contents of the scripts that encountered the errors as shown below.
With the given floorplanning(*.xdc files), scripts that cause this error are:
/<PROJECT_DIR>/workspace/F001_overlay/ydma/zcu102/zcu102_dfx_manual/tcl/nested/pr_recombine_dynamic_region.tcl
/<PROJECT_DIR>/workspace/F001_overlay/ydma/zcu102/zcu102_dfx_manual/tcl/nested/pr_recombine_p8.tcl
Once you manually generate dynamic_region.bit
and p8.dcp
,
cd to
/<PROJECT_DIR>/workspace/F001_overlay/ydma/zcu102/zcu102_dfx_manual/
directory
and continue the Makefile by entering make all -j$(nproc)
.
Then, in the same directory, run the rest of the commands in
/<PROJECT_DIR>/workspace/F001_overlay/run.sh
that were supposed to run.
For instance, copy/paste the lines below in the terminal.
./shell/run_xclbin.sh
cd ../../../
cp -r ./ydma/zcu102/package ./ydma/zcu102/zcu102_dfx_manual/overlay_p23/
cp ./ydma/zcu102/zcu102_dfx_manual/overlay_p23/*.xclbin ./ydma/zcu102/zcu102_dfx_manual/overlay_p23/package/sd_card
mv parse_ovly_util.py ./ydma/zcu102/zcu102_dfx_manual/overlay_p23/
mv get_blocked_resources.py ./ydma/zcu102/zcu102_dfx_manual/overlay_p23/
cd ./ydma/zcu102/zcu102_dfx_manual/overlay_p23/ && python get_blocked_resources.py
python parse_ovly_util.py
This conclues overlay generation and creates /<PROJECT_DIR>/workspace/F001_overlay/
directory.
You are now ready to separately compile operators in parallel with different sizes of PR pages!
If you are interested in the Nested DFX,
please take a look at Setting PR Hierarchy in Vivado.
cd to /<PROJECT_DIR>/
and in Makefile, select the prj_name
with your choice of benchmark. Then,
make all -j$(nproc)
This will run HLS, Vivado synthesis for each operator in parallel. It synchronizes after the synthesis, and based on the resource utilization estimates, it assigns appropriate pages(single, double, or quad), and launches implementations to generated partial bitstreams.
For Optical Flow (96, mix) benchmark, 10 operators will be copmiled separately in parallel. Note that 9 operators are mapped on single pages, and one operator is mapped on a quad page (the bottom right).
After all parallel compile jobs are done, in /<PROJECT_DIR>/
, do below to print out the compile time measurement.
make report
HLS, synthesis, and implementation step for each operator is measured and the time is recorded in a log file. After all runs are finished, we parse these log files to report the compile time. For a comparison against the monolithic flow of Vitis, refer to Rosetta Benchmark on ZCU102.
Once you successfully generated separate .xclbin
files and host executable in
/<PROJECT_DIR>/workspace/F005_bits_optical_flow512_96_final/sd_card/
directory,
- Use GParted to prepare a SD card to boot the ZCU102 board.
-
Copy the boot files to BOOT.
cp /<PROJECT_DIR>/workspace/F001_overlay/ydma/zcu102/zcu102_dfx_manual/overlay_p23/package/sd_card/* /media/<YOUR_ACCOUNT>/BOOT/
-
Copy rootfs files to rootfs. For instance,
sudo tar -zxvf /opt/platforms/xilinx_zcu102_base_dfx_202110_1/sw/xilinx_zcu102_base_dfx_202110_1/xrt/filesystem/rootfs.tar.gz -C /media/<YOUR_ACCOUNT>/rootfs/
-
Safely unplug the SD card from the workstation and slide it into the ZCU102. Power on the device.
-
You can refer to this post set up the ip addresses for the workstation and the ZCU102.
-
scp the generated
.xclbin
files and the host executable, like below. Note that for the optical flow benchmark, you need to scp current directory too. The spam filter benchmark needs data and the digit recognition needs 196data. Add-i ~/.ssh/id_rsa_zcu102
if you have a key to ssh to the board.scp -r ./workspace/F005_bits_optical_flow_96_final/sd_card/* [email protected]:/media/sd-mmcblk0p1/
-
ssh to the ZCU102 and cd
/media/sd-mmcblk0p1/
. Run the application:./run_app.sh
As an application latency, we include the time to set kernel argument, the time to transfer to/from data, and kernel execution time. It is 8808 us for Optical Flow (96, mix) as shown below.
If you take a close look at our code,
-
we are 'unnecessarily' generating duplicate 'page_double_subdivide_.dcp' or 'page_quad_subdivide_.dcp' in
/<PROJECT_DIR>/workspace/F001_overlay/ydma/zcu102/zcu102_dfx_manual/overlay_p23/subdivide
-
and in the verilog sources files for the designs above (e.g.
/<PROJECT_DIR>/workspace/F001_overlay/ydma/zcu102/zcu102_dfx_manual/p_d_s_p2.v
), there's a dummy register.
In the course of generating single pages, we are reading the same synthesized quad page checkpoints and double page checkpoints. We experience that sometimes Vivado does not understand the hierarchy of the pages. For instance, p4 is the parent pblock of p4_p0 and p4_p1. p4_p0 is the parent block of p4_p0_p0 and p4_p0_p1, but Vivado sometimes doesn't recognize such hierarchy.
We manage to resolve this issue by creating a separate synthesized checkpoint for each double page and
quad page to be subdivided, like page_double_subdivide_*.dcp
or page_quad_subdivide_*.dcp
.
Furthermore, while our double page consists of two single pages and nothing else left
other than some routing to the single pages,
we place a dummy register.
In this way, Vivado seems to understand the PR hierarchy.
It's true that in the Xilinx user guide for PR tutorial, there is no example design that has multiple parent RPs. If you can share your experience, it should be really helpful!