GitHub - RIKEN-RCCS/EigenG-Batched

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
logs		logs
old_kernels		old_kernels
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENCE.txt		LICENCE.txt
Makefile_cuda		Makefile_cuda
Makefile_hip		Makefile_hip
README		README
cusolver.hpp		cusolver.hpp
cusolver_evd.hpp		cusolver_evd.hpp
eigen_GPU_batch.cu		eigen_GPU_batch.cu
eigen_GPU_batch.h		eigen_GPU_batch.h
eigen_GPU_batch.hpp		eigen_GPU_batch.hpp
eigen_GPU_check.cu		eigen_GPU_check.cu
eigen_GPU_check.h		eigen_GPU_check.h
eigen_GPU_check.hpp		eigen_GPU_check.hpp
gpu_arch.h		gpu_arch.h
hhsy2tr.hpp		hhsy2tr.hpp
hhsy2tr_tiled.hpp		hhsy2tr_tiled.hpp
hhtr2sy.hpp		hhtr2sy.hpp
hhtr2sy_tiled.hpp		hhtr2sy_tiled.hpp
hipsolver.hpp		hipsolver.hpp
imtql2.hpp		imtql2.hpp
imtql2_tiled.hpp		imtql2_tiled.hpp
jac.cpp		jac.cpp
jac_tiled.hpp		jac_tiled.hpp
main.cpp		main.cpp
misc.hpp		misc.hpp
misc_gpu.hpp		misc_gpu.hpp
symdmt.cu		symdmt.cu
tql2.hpp		tql2.hpp
tql2_tiled.hpp		tql2_tiled.hpp
tred1_tiled.hpp		tred1_tiled.hpp

Repository files navigation

EigenG-Batched

0. Revision Histrory
  Version 1.6 (2024.11.20) ::
    Sorting option has been fixed (#define DO_SORT 1).
    Confirmed to perform on CUDA 12.6, Rocm 6.2, or priors.

  Version 1.5 (2024.01.01) ::
    CUDA 12.3, and Rocm 6.0 has been supported

  Version 1.4 (2023.07.22) ::
    Merge the CUDA version and the HIP version, Makefile_cuda and Makefile_hip

  Version 1.3 (2023.07.19) ::
    Modification in a virtual machine specification as sm_XXX in Makefile.

  Version 1.2 (2022.10.13) ::
    Repository name has been replaced from EigenG-Batch to EigenG-Batched.
    A bit revision in README and the LICENCE notice.
    
  Version 1.1 (2022.09.20) ::
    Avoidance of zero-division by IEE754 Minus-Zero.

  Version 1.0 (2022.08.15) ::
    First public release.

  Belows are Preliminary versions // not officially released
  Version 2022.06.16 ::
    New implementation of hypot function and warp-aligned symv kernel.
  Version 2022.06.05 ::
    Avoidance of the warp divergency.
  Version 2022.05.22 ::
    libeigenGbatch.a is supported.
    An API to get working buffer size is separately supported.
    A cooperative run on a non-warp unit is synchronized on a warp appropriately
  Version 2022.05.20 ::
    Zero-division fault is avoided.
    LETKF-like test matrix is preliminary included.
  Version 2022.05.16 ::
    Tiled_partition cooperative group version are included for the case N <= 32.
    small kernels are not used anymore.



1. Build, Use, and Note
  Edit 'Makefile' appropriately; especially, CUDA_PATH, and so on.
  Run 'make' command, and confirm libeigenGbatch.a is generated.
  When make your cuda executable, link libeigenGbatch.a.
  The obtained eigenpairs are not sorted by eigenvalue in default.
  If you need sorting, edit eigen_GPU_batch.cu as '#define DO_SORT 1'.
  Offload to multiple GPU's is available with appropriate use of device id's
  and CUDA streams.

  Source code examples are as follows.

  ** In the case of C++,
  #include "eigen_GPU_batch.hpp"
  void foo(...){
    float *a, *w, *work;
    size_t lwork;
    ...
    eigen_GPU_batch_BufferSize (L, nm, n, m, a, w, &lwork);
    cudaMalloc(&work, lwork);
    ...
    eigen_GPU_batch (L, nm, n, m, a, w, work, stream);
    ...
  }
  
  ** In the case of C,
  #include "eigen_GPU_batch.h"
  void foo(...){
    float *a, *w, *work;
    size_t lwork;
    ...
    eigen_GPU_batch_FP (L, nm, n, m, a, (float*)&lwork, NULL, NULL);
    cudaMalloc(&work, lwork);
    ...
    eigen_GPU_batch_FP (L, nm, n, m, a, w, work, stream);
    ...
  }
  


2. Main file Structure:
  root:- eigen_GPU_batch.cu  -- 1: and 2:
       + eigen_GPU_batch.hpp
       + eigen_GPU_batch.h
       + misc_gpu.hpp
       + misc.hpp
       + old_kernels/ -- 3:
       + fortran/ -- 4:

  1:- hhsy2tr.hpp
    + imtql2.hpp
    + tql2.hpp (optional)
    + hhtr2sy.hpp

  2:- tred1_tiled.hpp
    + hhsy2tr_tiled.hpp (optional)
    + imtql2_tiled.hpp
    + tql2_tiled.hpp (optional)
    + hhtr2sy_tiled.hpp

  3:(old or experimental implementations)
    + tred1.hpp
    + tred1_tiled.hpp
    + trbak1.hpp
    + trbak1_tiled.hpp
    + {tred1,tql2,trbak1,hhtr2sy}_small.hpp
    + imtql2.hpp

  4: fortran reference codes
    + fortran/*.f

  5: benchmark+test codes
    + main.*
    + eigen_GPU_check.*



3. Tunable parameters

hhsy2tr.hpp(hhsy2tr_tiled.hpp) and eigen_GPU_batch.cu
  const int BLK_J = 3;
  const int BLK_K = 3;
  const int BLK_M = 4; <= must be the same as the parameter 'mb' in eigen_GPU_batch.cu.

hhtr2sy.hpp
  const int BLK_I = 4 if T is float else 3
  const int BLK_J = 4 if T is float else 3
hhtr2sy_tiled.hpp
  const int BLK_I = (tile_size>=16)?4:((tile_size>=8)?3:2);
  const int BLK_J = (tile_size>=16)?4:((tile_size>=8)?3:2);

eigen_GPU_batch.hpp:
  eigen_GPU_batch_get_Launch_params
  paraeters numTB abd numTH affect to the performance strongly.
   :: Deafault numTB = deviceProp.multiProcessorCount*12;
   :: Deafault numTH = 32*8;
  Note numTH must be multiple of 32.

common:
  #pragma unroll
  #pragma unroll 1
   :: in previous version of 20220428, wrong unroll factor was specified.
   :: Some facts through several branch comparisons reveal the unroll factor is not strongly related to the performance. The one without unroll pragmas might show better performance.

Makefile:
  compiler option to the NVCC for ptx compiling
  --maxrregcount=128
   :: Since the batch kernel has a memory-bound property, even though the kernel looks categorized in BLAS3, it is significant to hide memory latency. Thus, it leads to better performance to maximize the number of active thread blocks, meaning high system usability with hiding a memory access behind multiple thread switching.



4. To Do

* Take advantage of TensorCore acceleration.
* Flexible interleave data format and thread scheduling.
* Wilkinson's combined implementation with SYR2 + SYMV like Kevd by Kudo.
* Divide and conqure method (probably not neccessary for batched version)
* Half precision version + Iterative refinment.
* Automation to validate the kernels by calling compute-sanitizer and so on.
* Any power of the input matrix, for example, Square root, etc.

5. Note for the HIP environment

* For the target ID table for AMD/HIP is available from
  https://llvm.org/docs/AMDGPUUsage.html
  RX480 : gfx803 obsolated in Rocm 6.0
  MI100 : gfx908
  MI250 : gfx90a
  MI300 : gfx90a
  Radeon (RDNA 2): gfx1030
  Radeon (RDNA 3): gfx1100,gfx1101,gfx1102
  Ryzen APU GXX: gfx90c


5. COPYRIGHT and LICENCING

Please find and read LICENCE.txt.