Skip to content

Latest commit

 

History

History
59 lines (40 loc) · 2.94 KB

README.md

File metadata and controls

59 lines (40 loc) · 2.94 KB

OpenCL-FPGA-examples

These examples are used and discussed in the Tutorial

M02 OpenCL design flows for Intel and Xilinx FPGAs - common optimization strategies, design patterns and vendor-specific differences

https://www.date-conference.com/conference/tutorial-m02

at

DATE 19 Conference

Target compilers

  • Intel FPGA SDK for OpenCL 18.1.1, aocx
  • Xilinx SDx 18.3, SDAccel feature with OpenCL, xocc

Makefile

Allow easy generation of reports and FPGA binaries using

make reportIntel-<design_name>
make reportXilinx-<design_name>
make buildIntel-<design_name>
make buildXilinx-<design_name>

Design files

macros.h

Common header file to enable portable use of pipes and channels

Example 1: vector scale

  • vscale1_vec.cl scaling an input vector in chunks of 16 elements using the OpenCL float16 data type
  • vscale2_u.cl applying automatic unrolling to achieve 16x parallelism, requires loop epilogue, not generated by xocc
  • vscale3_u16.cl applying automatic unrolling to achieve 16x parallelism without requiring loop epilogue - functionally only identical if size is multiple of 16
  • vscale4_u16_epi.cl applying automatic unrolling to achieve 16x parallelism, manual formulation of loop epilogue
  • vscale5_short.cl scaling an input vector in chunks of 16 elements using the OpenCL short16 data type - demonstrates that short multiplications fit into single DSP on Xilinx Kintex Ultrascale

Example 2: SAXPY

  • SAXPY1.cl direct implementation of BLAS 1 routine, requires one global write and two global read interfaces
  • SAXPY2_block.cl processing of routine in blocks of 1024, two read loops, one compute/write back loop
  • SAXPY3_ivdep.cl added ivdep pragma to blockwise processing to demonstrate formation of outer loop pipelining by aocx
  • SAXPY4_dataflow.cl add dataflow attribute, xocc design still suffers from lack of gmem ports
  • SAXPY5_streaming.cl separation into two kernels connected by pipes, proper pipelining for both compilers possible
  • SAXPY6_streaming16.cl separation into two kernels connected by pipes, using float16 datatype, asymptotic throughput of 16 elements per cycle with both compilers

Vendor Matrix Multiplication designs

Individual licenses in files apply