[Proposal] Integration of the rocMLIR Split-K GEMM implementation into MIGraphX #2858
ravil-mobile
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
1. Relevant High-Level Details
The Split-K GEMM Scheme is designed to improve work balancing between CUs while computing GEMMs. The scheme can greatly improve time-to-solution from some GEMM configurations. In contrast to the Data Parallel GEMM Scheme, the Split-K algorithm utilizes several workgroups for computing each output tile; thus, each workgroup computes some partial result for an output tile. Therefore, the Split-K Scheme involves a post-action: a summation of partial results. Currently, rocMLIR utilizes atomicAdd instructions for accumulating partial results, which supports FP32 and (partially) FP16 data formats. Thus, the rocMLIR implementation of the Split-K scheme is “lock”-free (“barrier”-free) – i.e., it guarantees the avoidance of potential deadlock states. However, the implementation demands the zero-initialization of the output memory buffer. Otherwise, the result of a matrix multiplication may be accumulated to the “garbage” data located int the output memory buffer.
Note that the rocMLIR implementation of the Split-K scheme prevents kernels fusion – i.e., a fusion of a GEMM kernel with subsequent element-wise operations. This results from the absence of any mechanism, which workgroups can make use to check whether partial results for a specific output tile have been fully accumulated or not. In general, the problem can be resolved by adding some synchronization mechanism between workgroups. However, a naïve implementation of the synchronization always leaves a potential for a kernel to end up in a deadlock state. The best solution of the aforementioned problem would involve Cooperative Groups (CG) – e.g., something similar to “cudaLaunchCooperativeKernel”. However, CG is not supported by the ROCm at the moment of writing.
2. Integration Approach
Approach 1: rocMLIR becomes a sub-graph provider. The approach assumes that rocMLIR will be equipped with some run-time library which is going to execute (sequences of) generated kernels. The run-time will expose a CAPI to execute a generated sequence, taking an
mlirModule
and ahipStream
as arguments. The advantage of this approach is that it completely hides all implementation details from users; the users won’t have any need to perform any pre-processing actions before executing a kernel, except for memory allocations (including auxiliary memory if required by a generated kernel). This solution is generic; it will result in adding extensibility and flexibility for both rocMLIR and MIGraphX. By and large, both projects will be better prepared for any upcoming changes in software requirements. The disadvantage of this approach that it will take a considerable amount of time for the development, refactoring and integration.Approach 2: Similar to Approach 1, rocMLIR becomes a sub-graph provider. However, in this case, rocMLIR won't have any run-time library for executing kernel sequences. Instead, rocMLIR is going to return 1) a sequence of pre-processing kernels (which must be executed beforehand) and 2) computational kernels (which must be executed one after another) for each requested operation (e.g., Split-K GEMM). The users (e.g., MIGraphX) will be responsible for scheduling pre-processing kernels. This approach has similar advantages as Approach 1. Additionally, such a solution can result in better performance if scheduling is intelligently implemented. However, the approach is going to add considerable programming complexity for both rocMLIR and MIGraphX teams.
Approach 3: rocMLIR will stay as a kernel provider and both (rocMLIR and MIGraphX) teams will solely focus on the enablement of the Split-K GEMM scheme. This approach is going to result in adding (presumable special) rocMLIR CAPIs to query pre-processing actions which MIGraphX will have to perform before executing generated kernels. The disadvantage: the proposed solution is not generic and may blow the rocMLIR CAPI in the future. The advantage: the approach is going to result in the least communication and programming efforts for both teams.
Having analyzed advantages and disadvantages of the proposed approaches as well as the existing time constraints, the rocMLIR team selects Approach 3.
3. Detailed Overview of The Selected Approach
It is worth noting that rocMLIR acts as a kernel provider for MIGraphX – i.e., not a sub-graph provider. Some rocMLIR kernels (e.g., Split-K GEMM) may involve several pre-processing steps, which must be executed before invoking a generated kernel. Therefore, both rocMLIR and MIGraphX teams must agree upon several topics listed below.
3.1 Selection of the Split-K GEMM Kernels
The selection of the Split-K GEMM variant is going to be done via the rocMLIR
perfCofig
string (see the Glossary). The rocMLIR team is going to add the 9th parameter to the string which will indicate the Split-K factor - i.e., the number of splits along the K-th dimension. The value equal to 1 indicates that the Data Parallel GEMM scheme is in use. The values other than 1 indicate that the Split-K scheme is going to be utilized.As before, MIGraphX requests rocMLIR to provide a set of performance configs for quick or exhaustive tuning. Internally, rocMLIR will try to employ some heuristics to limit the number of the Split-K factors and, thus, speed up the tuning process, which is performed on the MIGraphX side.
3.2 MIGraphX Fusing Decision
[TODO] @pfultz2 and @causten, please, describe how MIGraphX is going to handle the fusion decision in this section. Based on our last discussion, @pfultz2 has a good idea how it can be done in MIGraphX.
3.3 Communication between rocMLIR and MIGraphX
3.3.1 An early warning about Split-K
Given GEMM sizes and the number of CUs, rocMLIR should tell the MIGraphX how likely the Split-K GEMM scheme is faster than the Data Parallel GEMM: 1) always; or 2) maybe; 3) or never. This information should be available as early as possible - i.e., before tuning.
3.3.2 Checking whether a module fusible
MIGraphX requests to know whether a particular module, which may consists of a GEMM kernel and several pointwise operations, is fusible inside rocMLIR or not.
3.4 Initialization of the Output Memory Buffers
The memory initialization problem can be solved generically and, thus, the same solution can be applied to handle operations which may appear in the future - e.g., consider operations with multi-output values. rocMLIR is going to provide C-API which users must use to query which kernel arguments require memory initialization. The rocMLIR expects MIGraphX to perform the query and the corresponding pre-processing actions. Otherwise, rocMLIR does not guarantee correctness of numerical results.
Note, rocMLIR uses
double
(in the example above) because it is the widest supported data type. MIGraphX knows the data types of the corresponding kernel arguments; thus, it can perform appropriate data casting.3.5 Handling Auxiliary Memory Buffers
Starting from the enablement of the Split-K GEMM, each rocMLIR operation may request a caller to allocate additional memory buffers. The caller is obliged to allocate and initialize all buffers specified by the generated rocMLIR module.
[TODO] @krzysz00 @manupak, could you please double check Section 3.4 & 3.5
4. Glossary
Data Parallel GEMM Scheme – see, Algorithm 2 from [1]$(M, N, K)$ $MPerBlock \times NPerBlock$ $MPerBlock, NPerBlock, KPerBlock, MPerWave, NPerWave, KPack, AThreadCopyMoreGemmK, BThreadCopyMoreGemmKPack$
Split-K GEMM Scheme (also known as the Fixed-Split Scheme) - see, Algorithm 4 from [1]
GEMM configuration - a tuple of
Output Tile –
rocMLIR perfConfig - a string with 8 integers separated by comma. The configuration denotes:
5. References
[1] - Osama, Muhammad, et al. "Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU." Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 2023.
Beta Was this translation helpful? Give feedback.
All reactions