-
Notifications
You must be signed in to change notification settings - Fork 2
/
README
182 lines (141 loc) · 5.38 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
EigenG-Batched
0. Revision Histrory
Version 1.6 (2024.11.20) ::
Sorting option has been fixed (#define DO_SORT 1).
Confirmed to perform on CUDA 12.6, Rocm 6.2, or priors.
Version 1.5 (2024.01.01) ::
CUDA 12.3, and Rocm 6.0 has been supported
Version 1.4 (2023.07.22) ::
Merge the CUDA version and the HIP version, Makefile_cuda and Makefile_hip
Version 1.3 (2023.07.19) ::
Modification in a virtual machine specification as sm_XXX in Makefile.
Version 1.2 (2022.10.13) ::
Repository name has been replaced from EigenG-Batch to EigenG-Batched.
A bit revision in README and the LICENCE notice.
Version 1.1 (2022.09.20) ::
Avoidance of zero-division by IEE754 Minus-Zero.
Version 1.0 (2022.08.15) ::
First public release.
Belows are Preliminary versions // not officially released
Version 2022.06.16 ::
New implementation of hypot function and warp-aligned symv kernel.
Version 2022.06.05 ::
Avoidance of the warp divergency.
Version 2022.05.22 ::
libeigenGbatch.a is supported.
An API to get working buffer size is separately supported.
A cooperative run on a non-warp unit is synchronized on a warp appropriately
Version 2022.05.20 ::
Zero-division fault is avoided.
LETKF-like test matrix is preliminary included.
Version 2022.05.16 ::
Tiled_partition cooperative group version are included for the case N <= 32.
small kernels are not used anymore.
1. Build, Use, and Note
Edit 'Makefile' appropriately; especially, CUDA_PATH, and so on.
Run 'make' command, and confirm libeigenGbatch.a is generated.
When make your cuda executable, link libeigenGbatch.a.
The obtained eigenpairs are not sorted by eigenvalue in default.
If you need sorting, edit eigen_GPU_batch.cu as '#define DO_SORT 1'.
Offload to multiple GPU's is available with appropriate use of device id's
and CUDA streams.
Source code examples are as follows.
** In the case of C++,
#include "eigen_GPU_batch.hpp"
void foo(...){
float *a, *w, *work;
size_t lwork;
...
eigen_GPU_batch_BufferSize (L, nm, n, m, a, w, &lwork);
cudaMalloc(&work, lwork);
...
eigen_GPU_batch (L, nm, n, m, a, w, work, stream);
...
}
** In the case of C,
#include "eigen_GPU_batch.h"
void foo(...){
float *a, *w, *work;
size_t lwork;
...
eigen_GPU_batch_FP (L, nm, n, m, a, (float*)&lwork, NULL, NULL);
cudaMalloc(&work, lwork);
...
eigen_GPU_batch_FP (L, nm, n, m, a, w, work, stream);
...
}
2. Main file Structure:
root:- eigen_GPU_batch.cu -- 1: and 2:
+ eigen_GPU_batch.hpp
+ eigen_GPU_batch.h
+ misc_gpu.hpp
+ misc.hpp
+ old_kernels/ -- 3:
+ fortran/ -- 4:
1:- hhsy2tr.hpp
+ imtql2.hpp
+ tql2.hpp (optional)
+ hhtr2sy.hpp
2:- tred1_tiled.hpp
+ hhsy2tr_tiled.hpp (optional)
+ imtql2_tiled.hpp
+ tql2_tiled.hpp (optional)
+ hhtr2sy_tiled.hpp
3:(old or experimental implementations)
+ tred1.hpp
+ tred1_tiled.hpp
+ trbak1.hpp
+ trbak1_tiled.hpp
+ {tred1,tql2,trbak1,hhtr2sy}_small.hpp
+ imtql2.hpp
4: fortran reference codes
+ fortran/*.f
5: benchmark+test codes
+ main.*
+ eigen_GPU_check.*
3. Tunable parameters
hhsy2tr.hpp(hhsy2tr_tiled.hpp) and eigen_GPU_batch.cu
const int BLK_J = 3;
const int BLK_K = 3;
const int BLK_M = 4; <= must be the same as the parameter 'mb' in eigen_GPU_batch.cu.
hhtr2sy.hpp
const int BLK_I = 4 if T is float else 3
const int BLK_J = 4 if T is float else 3
hhtr2sy_tiled.hpp
const int BLK_I = (tile_size>=16)?4:((tile_size>=8)?3:2);
const int BLK_J = (tile_size>=16)?4:((tile_size>=8)?3:2);
eigen_GPU_batch.hpp:
eigen_GPU_batch_get_Launch_params
paraeters numTB abd numTH affect to the performance strongly.
:: Deafault numTB = deviceProp.multiProcessorCount*12;
:: Deafault numTH = 32*8;
Note numTH must be multiple of 32.
common:
#pragma unroll
#pragma unroll 1
:: in previous version of 20220428, wrong unroll factor was specified.
:: Some facts through several branch comparisons reveal the unroll factor is not strongly related to the performance. The one without unroll pragmas might show better performance.
Makefile:
compiler option to the NVCC for ptx compiling
--maxrregcount=128
:: Since the batch kernel has a memory-bound property, even though the kernel looks categorized in BLAS3, it is significant to hide memory latency. Thus, it leads to better performance to maximize the number of active thread blocks, meaning high system usability with hiding a memory access behind multiple thread switching.
4. To Do
* Take advantage of TensorCore acceleration.
* Flexible interleave data format and thread scheduling.
* Wilkinson's combined implementation with SYR2 + SYMV like Kevd by Kudo.
* Divide and conqure method (probably not neccessary for batched version)
* Half precision version + Iterative refinment.
* Automation to validate the kernels by calling compute-sanitizer and so on.
* Any power of the input matrix, for example, Square root, etc.
5. Note for the HIP environment
* For the target ID table for AMD/HIP is available from
https://llvm.org/docs/AMDGPUUsage.html
RX480 : gfx803 obsolated in Rocm 6.0
MI100 : gfx908
MI250 : gfx90a
MI300 : gfx90a
Radeon (RDNA 2): gfx1030
Radeon (RDNA 3): gfx1100,gfx1101,gfx1102
Ryzen APU GXX: gfx90c
5. COPYRIGHT and LICENCING
Please find and read LICENCE.txt.