[Proposal] Integration of the rocMLIR Split-K GEMM implementation into MIGraphX #2858

ravil-mobile · 2024-03-06T10:29:13Z

ravil-mobile
Mar 6, 2024
Collaborator

1. Relevant High-Level Details

The Split-K GEMM Scheme is designed to improve work balancing between CUs while computing GEMMs. The scheme can greatly improve time-to-solution from some GEMM configurations. In contrast to the Data Parallel GEMM Scheme, the Split-K algorithm utilizes several workgroups for computing each output tile; thus, each workgroup computes some partial result for an output tile. Therefore, the Split-K Scheme involves a post-action: a summation of partial results. Currently, rocMLIR utilizes atomicAdd instructions for accumulating partial results, which supports FP32 and (partially) FP16 data formats. Thus, the rocMLIR implementation of the Split-K scheme is “lock”-free (“barrier”-free) – i.e., it guarantees the avoidance of potential deadlock states. However, the implementation demands the zero-initialization of the output memory buffer. Otherwise, the result of a matrix multiplication may be accumulated to the “garbage” data located int the output memory buffer.

Note that the rocMLIR implementation of the Split-K scheme prevents kernels fusion – i.e., a fusion of a GEMM kernel with subsequent element-wise operations. This results from the absence of any mechanism, which workgroups can make use to check whether partial results for a specific output tile have been fully accumulated or not. In general, the problem can be resolved by adding some synchronization mechanism between workgroups. However, a naïve implementation of the synchronization always leaves a potential for a kernel to end up in a deadlock state. The best solution of the aforementioned problem would involve Cooperative Groups (CG) – e.g., something similar to “cudaLaunchCooperativeKernel”. However, CG is not supported by the ROCm at the moment of writing.

2. Integration Approach

Approach 1: rocMLIR becomes a sub-graph provider. The approach assumes that rocMLIR will be equipped with some run-time library which is going to execute (sequences of) generated kernels. The run-time will expose a CAPI to execute a generated sequence, taking an mlirModule and a hipStream as arguments. The advantage of this approach is that it completely hides all implementation details from users; the users won’t have any need to perform any pre-processing actions before executing a kernel, except for memory allocations (including auxiliary memory if required by a generated kernel). This solution is generic; it will result in adding extensibility and flexibility for both rocMLIR and MIGraphX. By and large, both projects will be better prepared for any upcoming changes in software requirements. The disadvantage of this approach that it will take a considerable amount of time for the development, refactoring and integration.

Approach 2: Similar to Approach 1, rocMLIR becomes a sub-graph provider. However, in this case, rocMLIR won't have any run-time library for executing kernel sequences. Instead, rocMLIR is going to return 1) a sequence of pre-processing kernels (which must be executed beforehand) and 2) computational kernels (which must be executed one after another) for each requested operation (e.g., Split-K GEMM). The users (e.g., MIGraphX) will be responsible for scheduling pre-processing kernels. This approach has similar advantages as Approach 1. Additionally, such a solution can result in better performance if scheduling is intelligently implemented. However, the approach is going to add considerable programming complexity for both rocMLIR and MIGraphX teams.

Approach 3: rocMLIR will stay as a kernel provider and both (rocMLIR and MIGraphX) teams will solely focus on the enablement of the Split-K GEMM scheme. This approach is going to result in adding (presumable special) rocMLIR CAPIs to query pre-processing actions which MIGraphX will have to perform before executing generated kernels. The disadvantage: the proposed solution is not generic and may blow the rocMLIR CAPI in the future. The advantage: the approach is going to result in the least communication and programming efforts for both teams.

Having analyzed advantages and disadvantages of the proposed approaches as well as the existing time constraints, the rocMLIR team selects Approach 3.

3. Detailed Overview of The Selected Approach

It is worth noting that rocMLIR acts as a kernel provider for MIGraphX – i.e., not a sub-graph provider. Some rocMLIR kernels (e.g., Split-K GEMM) may involve several pre-processing steps, which must be executed before invoking a generated kernel. Therefore, both rocMLIR and MIGraphX teams must agree upon several topics listed below.

3.1 Selection of the Split-K GEMM Kernels

The selection of the Split-K GEMM variant is going to be done via the rocMLIR perfCofig string (see the Glossary). The rocMLIR team is going to add the 9th parameter to the string which will indicate the Split-K factor - i.e., the number of splits along the K-th dimension. The value equal to 1 indicates that the Data Parallel GEMM scheme is in use. The values other than 1 indicate that the Split-K scheme is going to be utilized.

As before, MIGraphX requests rocMLIR to provide a set of performance configs for quick or exhaustive tuning. Internally, rocMLIR will try to employ some heuristics to limit the number of the Split-K factors and, thus, speed up the tuning process, which is performed on the MIGraphX side.

3.2 MIGraphX Fusing Decision

[TODO] @pfultz2 and @causten, please, describe how MIGraphX is going to handle the fusion decision in this section. Based on our last discussion, @pfultz2 has a good idea how it can be done in MIGraphX.

3.3 Communication between rocMLIR and MIGraphX

3.3.1 An early warning about Split-K

Given GEMM sizes and the number of CUs, rocMLIR should tell the MIGraphX how likely the Split-K GEMM scheme is faster than the Data Parallel GEMM: 1) always; or 2) maybe; 3) or never. This information should be available as early as possible - i.e., before tuning.

// Example: MIGraphX side
auto isSplitKFaster = mlirIsSplitKFaster(M, N, K, numCUs);

3.3.2 Checking whether a module fusible

MIGraphX requests to know whether a particular module, which may consists of a GEMM kernel and several pointwise operations, is fusible inside rocMLIR or not.

// Example: MIGraphX side
bool isFusible = isModuleFusible(mmodule.get(), perfConfig);

3.4 Initialization of the Output Memory Buffers

The memory initialization problem can be solved generically and, thus, the same solution can be applied to handle operations which may appear in the future - e.g., consider operations with multi-output values. rocMLIR is going to provide C-API which users must use to query which kernel arguments require memory initialization. The rocMLIR expects MIGraphX to perform the query and the corresponding pre-processing actions. Otherwise, rocMLIR does not guarantee correctness of numerical results.

// Example: MIGraphX side
auto numPrefillArgs = mlirGetNumPrefillArgs(mmodule.get());
std::vector<size_t> prefillArgIndices(numPrefillArgs);
std::vector<MlirAttribute> prefillArgValues(numPrefillArgs);
mlirGetPrefillArgsInfo(mmodule.get(), prefillArgIndices.data(), prefillArgValues.data());

// MIGraphX performs memset of the corresponding memory buffers
...

Note, rocMLIR uses double (in the example above) because it is the widest supported data type. MIGraphX knows the data types of the corresponding kernel arguments; thus, it can perform appropriate data casting.

3.5 Handling Auxiliary Memory Buffers

Starting from the enablement of the Split-K GEMM, each rocMLIR operation may request a caller to allocate additional memory buffers. The caller is obliged to allocate and initialize all buffers specified by the generated rocMLIR module.

// Example: MIGraphX side
auto numAuxBuffers = mlirGetNumAuxBuffers(mmodule.get());
std::vector<size_t> auxBuffersSizes(numAuxBuffers); // in bytes
std::vector<MlirAttribute> auxBuffersInitValues(numAuxBuffers);
mlirGetAuxBuffersInfo(mmodule.get(), auxBuffersInitValues.data());

// MIGraphX performs `hipMalloc` & `hipMemset` for the corresponding aux buffers
// and supplies them while invoking the generated kernel
//
std::vector<void*> auxBuffers(numAuxBuffers);
for (size_t i = 0; i < numAuxBuffers; ++i) {
 hipMalloc(&auxBuffers[i], auxBuffersSizes[i]);
}
...

[TODO] @krzysz00 @manupak, could you please double check Section 3.4 & 3.5

4. Glossary

Data Parallel GEMM Scheme – see, Algorithm 2 from [1]
Split-K GEMM Scheme (also known as the Fixed-Split Scheme) - see, Algorithm 4 from [1]
GEMM configuration - a tuple of $(M, N, K)$
Output Tile – $MPerBlock \times NPerBlock$
rocMLIR perfConfig - a string with 8 integers separated by comma. The configuration denotes: $MPerBlock, NPerBlock, KPerBlock, MPerWave, NPerWave, KPack, AThreadCopyMoreGemmK, BThreadCopyMoreGemmKPack$

5. References

[1] - Osama, Muhammad, et al. "Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU." Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 2023.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Integration of the rocMLIR Split-K GEMM implementation into MIGraphX #2858

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

[Proposal] Integration of the rocMLIR Split-K GEMM implementation into MIGraphX #2858

ravil-mobile Mar 6, 2024 Collaborator

1. Relevant High-Level Details

2. Integration Approach

3. Detailed Overview of The Selected Approach

3.1 Selection of the Split-K GEMM Kernels

3.2 MIGraphX Fusing Decision

3.3 Communication between rocMLIR and MIGraphX

3.3.1 An early warning about Split-K

3.3.2 Checking whether a module fusible

3.4 Initialization of the Output Memory Buffers

3.5 Handling Auxiliary Memory Buffers

4. Glossary

5. References

Replies: 0 comments

ravil-mobile
Mar 6, 2024
Collaborator