-
CPP_BLOCK 1 1 : plain C++ implementation - baseline
-
NEON 8 1 : NEON with loop unrolling factor 8, single thread
-
NEON 8 8 : NEON with loop unrolling factor 8, 8 threads
-
BLAS 1 1 : the combination of cblas_dgemv(), vDSP_vdivD(), and vDSP_vsbmD().
- 'BLAS 1 1' shows the best overall performance. 'NEON 8 8' performs good, too.
-
NEON 1 1 : NEON with no loop unrolling, single thread
-
NEON 2 1 : NEON with loop unrolling factor 2, single thread
-
NEON 4 1 : NEON with loop unrolling factor 4, single thread
-
NEON 8 1 : NEON with loop unrolling factor 8, single thread
There is a clear benefit in using NEON intrinsics, and the explicit loop unrolling.
-
NEON 8 1 : NEON with loop unrolling factor 8, single thread
-
NEON 8 2: NEON with loop unrolling factor 8, 2 threads
-
NEON 8 4: NEON with loop unrolling factor 8, 4 threads
-
NEON 8 8: NEON with loop unrolling factor 8, 8 threads
There is a clear benefit in using multithreads. The overhead of synchronizing the threads is amortized around the size (512, 512).