05_saxpy

Apr 15, 2020

a0885f6 · Apr 15, 2020

Name	Name	Last commit message	Last commit date
parent directory ..
docs	docs	improve 05_saxpy, 09_matAdd and 10_matMul	Apr 15, 2020
src	src	improve 05_saxpy, 09_matAdd and 10_matMul	Apr 15, 2020
tests	tests	improve 05_saxpy, 09_matAdd and 10_matMul	Apr 15, 2020
Makefile.am	Makefile.am	switching back to work on OpenMP Offloading, again	Feb 28, 2020
README.md	README.md	improve 05_saxpy, 09_matAdd and 10_matMul	Apr 15, 2020
configure.ac	configure.ac	small fixes	Apr 14, 2020

README.md

title	author	date
saxpy	Xin Wu (PC²)	05.04.2020

saxpy performs the saxpy operation on host as well as accelerator. The performance (in MB/s) for different implementations is also compared.

The saxpy operation is defined as:

$y := a * x + y$

where:

a is a scalar.
x and y are single-precision vectors each with n elements.
For testing n is assumed to be $2^{26}$ .
The following table only summarizes the most important points. For more details on the ial-th implementation see comments in hsaxpy.c (on host) and asaxpy.c (on accelerator).
- on host

ial	Remarks
0	naive implementation
1	saxpy in MKL

- on accl

ial	Remarks
0	<<<2^0 , 2^0 >>>, TOO SLOW! not tested
1	<<<2^0 , 2^7 >>>, auto scheduling
2	<<<2^7 , 2^0 >>>, auto scheduling
3	<<<2^7 , 2^7 >>>, auto scheduling
4	<<<2^16, 2^10>>>, manual scheduling
5	<<<2^15, 2^7 >>>, manual scheduling, 16x loop unrolling
	(2^152^716==2^26)
6	<<<2^12, 2^7 >>>, auto scheduling, 16x loop unrolling
7	de-linearize the vector gives slightly better performance than CUBLAS
8	cublasSaxpy in CUBLAS

autoreconf -i; ./configure; make; make check;

make check has been tested on OCuLUS (with OpenCCS) and P53s (without OpenCCS).