See other versions |
The methodology for developing optimized accelerated applications is comprised of two major phases: architecting the application, and developing the accelerator to meet your desired performance goals.
- In the first phase, you make key decisions about the application architecture by determining which software functions should be accelerated onto FPGA kernels, how much parallelism can be achieved, and how to deliver it in code.
- In the second phase, you implement kernels by structuring the source code, and applying the necessary compiler options and pragmas to create the kernel architecture needed to achieve the performance target.
You begin this tutorial with a baseline application, and profile it to examine the potential for hardware acceleration. The tutorial application performs a 2D convolution of an RGBA video and a set of filter coefficients using ffmpeg, a popular multimedia framework that can play, transcode, mux, demux, and filter many audio/video formats. Then, you perform various optimizations on both the host program and kernel side. In this tutorial, you will work with the following optimization techniques:
- Memory transfer optimizations
- Fixed point data type adoption
- Dataflow and streams
- Optimization of Loops
This tutorial follows the SDAccel Methodology Guide (UG1346) about how to migrate a CPU-based application to an optimized FPGA-accelerated design. For a deeper understanding, you should review that material as you are working through this tutorial.
This tutorial requires that the ffmpeg framework is installed on the machine where the steps will be executed. These dependencies can be downloaded by running the following commands.
-
For CentOS:
sudo yum localinstall --nogpgcheck https://download1.rpmfusion.org/free/el/rpmfusion-free-release-7.noarch.rpm sudo yum install ffmpeg
-
For Ubuntu:
sudo apt update sudo apt install ffmpeg ffmpeg -version
The labs in this tutorial use:
- BASH Linux shell commands.
- 2019.1 SDx release and the xilinx_u200_xdma_201830_1 platform. If necessary, it can be easily ported to other versions and platforms.
- A
Makefile
that is detailed and contains many steps and variables. For a discussion of theMakefile
structure and contents, refer to Understanding the Makefile.
IMPORTANT:
- Before running any of the examples, make sure you have installed Xilinx Runtime (XRT) and SDAccel development environment as described in the SDAccel Development Environment Release Notes, Installation, and Licensing Guide (UG1238).
- If you run applications on the Alveo™ card, ensure the card and software drivers have been correctly installed by following the instructions in the Getting Started with Alveo Data Center Accelerator Cards Guide (UG1301).
- To access the reference files, enter the following in a terminal:
git clone http://github.com/Xilinx/SDAccel-Tutorials
. - Navigate to
SDAccel-Tutorials-master/docs/convolution-tutorial
.
The following labs walk through the best practices for taking an existing application and optimizing it as an FPGA-accelerated application. This tutorial is divided into several different labs that show the methodology. It is recommended to complete each lab before proceeding to the next.
- Evaluating the Original Application: In this lab, the original C-based application processes an input video to generate the convolution output video. This lab also discusses setting realistic performance goals for an accelerated application.
- Creating an SDAccel Application from the C Application: Convert the original C code into a host program and hardware kernel that is called by the host using the OpenCL™ API.
- Optimizing Memory Transfers: Learn methods for optimizing the hardware kernel for improved memory access. You will learn how to use local cache to make efficient use of the FPGA bandwidth.
- Optimizing Using Fixed Point Data Types: Discusses how data types affect design performance.
- Optimizing with Dataflow: Improve the compute efficiency of your kernel, applying dataflow and streaming to improve the data-path in your kernel.
- Using Out-of-Order Queues and Multiple Compute Units: Modify the OpenCL API calls in the host program to allow for out-of-order task execution, and increase parallelism on the accelerator by synthesizing multiple kernels to perform the work.
- Running the Accelerator in Hardware: All the previous steps have been run in Hardware Emulation mode. Here you run the application on the acceleration hardware.
Return to Main Page — Return to Getting Started Pathway
Copyright© 2019 Xilinx