title | author | date |
---|---|---|
saxpy |
Xin Wu (PC²) |
05.04.2020 |
saxpy
performs the saxpy
operation on host as well as accelerator.
The performance (in MB/s) for different implementations is also compared.
The saxpy
operation is defined as:
where:
-
a
is a scalar. -
x
andy
are single-precision vectors each with n elements. -
For testing n is assumed to be
. -
The following table only summarizes the most important points. For more details on the ial-th implementation see comments in
hsaxpy.c
(on host) andasaxpy.c
(on accelerator).- on host
ial | Remarks |
---|---|
0 | naive implementation |
1 | saxpy in MKL |
- on accl
ial | Remarks |
---|---|
0 | <<<2^0 , 2^0 >>>, TOO SLOW! not tested |
1 | <<<2^0 , 2^7 >>>, auto scheduling |
2 | <<<2^7 , 2^0 >>>, auto scheduling |
3 | <<<2^7 , 2^7 >>>, auto scheduling |
4 | <<<2^16, 2^10>>>, manual scheduling |
5 | <<<2^15, 2^7 >>>, manual scheduling, 16x loop unrolling |
(2^152^716==2^26) | |
6 | <<<2^12, 2^7 >>>, auto scheduling, 16x loop unrolling |
7 | de-linearize the vector gives slightly better performance than CUBLAS |
8 | cublasSaxpy in CUBLAS |
autoreconf -i; ./configure; make; make check;
make check
has been tested on OCuLUS (with OpenCCS) and P53s (without OpenCCS).
-
docs/html/index.html: Source code documentation generated by Doxygen.
-
docs/UserManual.md: User Manual.