This repository contains the LINPACK for FPGA and its OpenCL kernels. Currently only the Intel FPGA SDK for OpenCL utility is supported.
CMake is used as the build system.
The targets below can be used to build the benchmark and its kernels, where VENDOR
can be
intel
or xilinx
:
Target | Description |
---|---|
Linpack_VENDOR |
Builds the host application linking with the Intel SDK |
Linpack_test_VENDOR |
Compile the tests and its dependencies linking with the Intel SDK |
Moreover, there are additional targets to generate kernel reports and bitstreams. The provided kernel is optimized for the Bittware 520N board equipped with Stratix 10. Only the LU facotrization without pivoting is implemented on FPGA and external channels are used to calculate the solution in a 2D torus of FPGAs.
The kernel targets are listed below. COMM_TYPE
can be IEC for Intel external channel (only available for vendor Intel) and PCIE for communication via PCIe and MPI.
Target | Description |
---|---|
hpl_torus_COMM_TYPE _VENDOR |
Synthesizes the kernel (takes several hours!) |
hpl_torus_COMM_TYPE reportVENDOR |
Just compile kernel and create reports |
hpl_torus_COMM_TYPE emulateVENDOR |
Create a n emulation kernel |
You can build for example the host application by running
mkdir build && cd build
cmake ..
make Linpack_intel
A whole emulation build can be done with
mkdir build && cd build
cmake ..
make Linpack_intel
This will compile the host code as well a unit test binary with emulation kernels for functionality testing.
You will find all executables and kernel files in the bin
folder of your build directory.
Next to the common configuration options given in the README of the benchmark suite you might want to specify the following additional options before build:
Name | Default | Description |
---|---|---|
DEFAULT_MATRIX_SIZE |
1024 | Width and heigth of the input matrix |
REGISTER_BLOCK_LOG |
3 | Size of the blocks that will be processed in registers (2^3=8 is the default) |
LOCAL_MEM_BLOCK_LOG |
5 | Size of the blocks that will be processed in local memory (2^3=8 is the default) |
DATA_TYPE |
float | Used data type. Can be float or double |
Moreover the environment variable INTELFPGAOCLSDKROOT
has to be set to the root
of the Intel FPGA SDK installation.
Additionally it is possible to set the used compiler and other build tools
in the CMakeCache.txt
located in the build directory after running cmake.
For execution of the benchmark run:
./Linpack_intel -f path_to_kernel.aocx
For more information on available input parameters run
./Linpack_intel -h
Implementation of the LINPACK benchmark proposed in the HPCC benchmark suite for FPGA.
Version: 2.6
MPI Version: 3.1
Config. Time: Thu Mar 31 11:19:26 UTC 2022
Git Commit: 2c3b1c6
Usage:
bin/Linpack_intel [OPTION...]
-f, --file arg Kernel file name
-n, arg Number of repetitions (default: 10)
-i, Use memory Interleaving
--skip-validation Skip the validation of the output data. This will
speed up execution and helps when working with special
data types.
--device arg Index of the device that has to be used. If not
given you will be asked which device to use if there are
multiple devices available. (default: -1)
--platform arg Index of the platform that has to be used. If not
given you will be asked which platform to use if there
are multiple platforms available. (default: -1)
-r, arg Number of used kernel replications (default: 5)
--comm-type arg Used communication type for inter-FPGA communication
(default: AUTO)
--test Only test given configuration and skip execution and
validation
-h, --help Print this help
-m, arg Global matrix size in number of blocks in one
dimension. Local matrix sizes will be determined by PQ
grid. (default: 1024)
-b, arg Log2 of the block size in number of values in one
dimension (default: 9)
-p, arg Width of the FPGA grid. The heigth (Q) will be
calculated from mpi_size / P. (default: 1)
--uniform Generate a uniform matrix instead of a diagonally
dominant. This has to be supported by the FPGA kernel!
--emulation Use kernel arguments for emulation. This may be
necessary to simulate persistent local memory on the FPGA
Available options for --comm-type
:
IEC
: Intel external channels are used by the kernels for communication.PCIE
: PCIe and MPI are used to exchange data between FPGAs over the CPU.
To execute the unit and integration tests for Intel devices run
./Linpack_test_intel -f KERNEL_FILE_NAME
in the bin
folder within the build directory.
It will run an emulation of the kernel and execute some functionality tests.
The host code will print the results of the execution to the standard output. The result summary looks similar to this:
norm. residual res. error mach. eps
4.35451e-03 5.96046e-07 1.19209e-07
Method best mean GFLOPS
total 1.12152e-01 s 1.16113e-01 s 2.13045e-04 GFLOP/s
GEFA 1.12152e-01 s 1.16113e-01 s 1.94784e-04 GFLOP/s
GESL 2.00000e-08 s 3.97000e-08 s 1.02400e+02 GFLOP/s
The first row contains data from the correctness check that is done once when executing the benchmark:
resid
: The maximum residual error when multiplying the result vector with the matrix and subtract by the expected result.norm. resid
: The normalized residual error based onresid
.machep
: machine epsilon that gives an upper bound for rounding errors due to the used floating point format.
The table below contains the performance measurements for the bechmark for the both routines GEFA and GESL. Only GEFA is implemented on FPGA, so only this result is significant for now. GESL measurement is currently disabled and does not show any valid results since execution is done on the host during validation! The columns of the table contain the following information:
best
: The best measured time for executing the benchmark in seconds.mean
: The arithmetic mean of all measured execution times in seconds.GFLOPS
: GFLOP/s achieved for the calculation using the best measured time.
The last row of the output will always contain Validation: SUCCESS!
, if the norm. residual is below 1.
This will be interpreted as successful validation.
In this case, the executable will return 0 as exit code, 1 otherwise.
The json output looks like the following.
{
"config_time": "Wed Dec 14 08:41:58 UTC 2022",
"device": "Intel(R) FPGA Emulation Device",
"environment": {
"LD_LIBRARY_PATH": "/opt/software/pc2/EB-SW/software/Python/3.9.5-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/libffi/3.3-GCCcore-10.3.0/lib64:/opt/software/pc2/EB-SW/software/GMP/6.2.1-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/SQLite/3.35.4-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/Tcl/8.6.11-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/libreadline/8.1-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/libarchive/3.5.1-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/cURL/7.76.0-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/bzip2/1.0.8-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/ncurses/6.2-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/ScaLAPACK/2.1.0-gompi-2021a-fb/lib:/opt/software/pc2/EB-SW/software/FFTW/3.3.9-gompi-2021a/lib:/opt/software/pc2/EB-SW/software/FlexiBLAS/3.0.4-GCC-10.3.0/lib:/opt/software/pc2/EB-SW/software/OpenBLAS/0.3.15-GCC-10.3.0/lib:/opt/software/pc2/EB-SW/software/OpenMPI/4.1.1-GCC-10.3.0/lib:/opt/software/pc2/EB-SW/software/PMIx/3.2.3-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/libfabric/1.12.1-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/UCX/1.10.0-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/libevent/2.1.12-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/OpenSSL/1.1/lib:/opt/software/pc2/EB-SW/software/hwloc/2.4.1-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/libpciaccess/0.16-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/libxml2/2.9.10-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/XZ/5.2.5-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/numactl/2.0.14-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/binutils/2.36.1-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/zlib/1.2.11-GCCcore-10.3.0/lib:/opt/software/pc2/EB-SW/software/GCCcore/10.3.0/lib64:/opt/software/slurm/21.08.6/lib:/opt/software/FPGA/IntelFPGA/opencl_sdk/21.2.0/hld/host/linux64/lib:/opt/software/FPGA/IntelFPGA/opencl_sdk/20.4.0/hld/board/bittware_pcie/s10/linux64/lib"
},
"errors": {
"epsilon": 1.1920928955078125e-07,
"residual": 5.960464477539062e-07,
"residual_norm": 0.004354506590071576
},
"execution_time": "Wed Dec 14 09:20:49 UTC 2022",
"git_commit": "be1a4e9-dirty",
"mpi": {
"subversion": 1,
"version": 3
},
"name": "LINPACK",
"results": {
"gflops": {
"unit": "GFLOP/s",
"value": 0.0006047108051562395
},
"gflops_lu": {
"unit": "GFLOP/s",
"value": 0.0005528788702090362
},
"gflops_sl": {
"unit": "GFLOP/s",
"value": 68.26666666666668
},
"t_mean": {
"unit": "s",
"value": 0.041533081799999996
},
"t_min": {
"unit": "s",
"value": 0.039512
},
"tlu_mean": {
"unit": "s",
"value": 0.041533051599999996
},
"tlu_min": {
"unit": "s",
"value": 0.03951197
},
"tsl_mean": {
"unit": "s",
"value": 3.019999999999999e-08
},
"tsl_min": {
"unit": "s",
"value": 3e-08
}
},
"settings": {
"Block Size": 16,
"Communication Type": false,
"Data Type": false,
"Diagonally Dominant": true,
"Emulate": false,
"FPGA Torus": {
"P": 1,
"Q": 1
},
"Kernel File": false,
"Kernel Replications": 3,
"MPI Ranks": 1,
"Matrix Size": 32,
"Repetitions": 10,
"Test Mode": false
},
"timings": {
"gefa": [
{
"unit": "s",
"value": 0.040978706
},
{
"unit": "s",
"value": 0.041104108
},
{
"unit": "s",
"value": 0.040878394
},
{
"unit": "s",
"value": 0.040391036
},
{
"unit": "s",
"value": 0.044723132
},
{
"unit": "s",
"value": 0.03951197
},
{
"unit": "s",
"value": 0.043374308
},
{
"unit": "s",
"value": 0.04179909
},
{
"unit": "s",
"value": 0.041162129
},
{
"unit": "s",
"value": 0.041407643
}
],
"gesl": [
{
"unit": "s",
"value": 3e-08
},
{
"unit": "s",
"value": 3e-08
},
{
"unit": "s",
"value": 3e-08
},
{
"unit": "s",
"value": 3e-08
},
{
"unit": "s",
"value": 3e-08
},
{
"unit": "s",
"value": 3e-08
},
{
"unit": "s",
"value": 3.1e-08
},
{
"unit": "s",
"value": 3.1e-08
},
{
"unit": "s",
"value": 3e-08
},
{
"unit": "s",
"value": 3e-08
}
]
},
"validated": true,
"version": "2.6"
}