Improvement of GEMM and FFT
Next to minor changes, bug fixes, and improvements, this release introduces the following major changes:
- Optimize FFT for Xilinx up to a size of LOG_FFT_SIZE=9
- Kernel replication support for GEMM and FFT
- Remove git submodules and instead fetch dependencies using CMake. This eliminates the need to check out out the git repository to build the benchmarks.