How to reduce time spent on `cutlass::DeviceAllocation::copy_from_host`? #998

wangruohui · 2023-07-12T10:05:08Z

wangruohui
Jul 12, 2023

Hello,

I am studying examples/24_gemm_grouped and making it available for pytorch. But I found coping ptrA/B/C/D and lda/b/c/d from host to device takes much more time than kernel launch. In detail, there are totally 8 memcpy in the example https://github.com/NVIDIA/cutlass/blob/main/examples/24_gemm_grouped/gemm_grouped.cu#L654-L685.

Based on my understanding, this is unavoidable even though actual data is already placed on GPUs, because the kernel is defined on device, pointers to data must be accessible on device. (Am I right?)

So I am wondering if there are methods to reduce time spent on moving data around. For example, making copy_from_host async and overlapped? But it seems only synced memcpy is implemented for DeviceAllocation.

Thank you very much!

jackkosaian · 2023-07-12T12:26:57Z

jackkosaian
Jul 12, 2023

Hi, @wangruohui .

Based on my understanding, this is unavoidable even though actual data is already placed on GPUs, because the kernel is defined on device, pointers to data must be accessible on device. (Am I right?)

If your application has multiple kernels running one after each other (e.g., a preceding layer feeding into grouped GEMM), you could consider having the preceding kernel populate the device pointers so as to avoid the memcpys. Otherwise, you will indeed need to copy from host to device.

So I am wondering if there are methods to reduce time spent on moving data around.

One way to reduce these overheads is by allocating a single buffer of memory on host/device which will be used to populate all of the grouped GEMM args (e.g., all values for ptrA, then all values for ptrB, etc.). The pointers passed in to the grouped GEMM arguments are then just offsets within this large buffer (e.g., the ptrB argument would be equivalent to ptrA + (problem_count * sizeof(ElementA*)) since there are problem_count pointers within the ptrA array). One can then do a single host-to-device copy of this large buffer.

We have an example of doing this in the CUTLASS Python interface here:

cutlass/python/cutlass/emit/pytorch.py

Lines 237 to 357 in f679663

    
           _PYTORCH_GROUPED_GEMM_IMPL_TEMPLATE = ( 
        
               common._CUTLASS_KERNEL_RUN_GROUPED_GEMM_2x 
        
               + """ 
        
           std::vector<at::Tensor> ${name}_kernel(const std::vector<at::Tensor>& A, const std::vector<at::Tensor>& B, at::optional<const std::vector<at::Tensor>> C, float alpha, float beta) { 
        
               size_t num = A.size(); 
        
               // To avoid performing many small cudaMallocs and host-to-device copies, 
        
               // we serialize the grouped GEMM arguments on the host, allocate one 
        
               // large chunk of device memory, and perform a single cudaMemcpy to 
        
               // copy the host data to the device. Allocation overheads could be 
        
               // avoided by using a memory pool. 
        
               // Calculate the total size of the data to be copied from host to device 
        
               size_t total_size = sizeof(cutlass::gemm::GemmCoord) + 
        
                                   sizeof(DeviceKernel::ElementA*) + 
        
                                   sizeof(DeviceKernel::ElementB*) + 
        
                                   sizeof(DeviceKernel::ElementC*) + 
        
                                   sizeof(DeviceKernel::ElementC*) + 
        
                                   sizeof(int64_t) + 
        
                                   sizeof(int64_t) + 
        
                                   sizeof(int64_t); 
        
               total_size *= num; 
        
               // num * sizeof(cutlass::gemm::GemmCoord) may leave one at a non-multiple 
        
               // of sizeof(DeviceKernel::ElementA*) (which will be 64 on a 64-bit system). 
        
               // To ensure that we don't end up having misaligned loads in the kernel, 
        
               // we pad to the nearest multiple of 8. 
        
               // 
        
               // Note that, even on a 32-bit system (for which sizeof(X*) will not equal 
        
               // sizeof(int64_t)), only padding between the list of GemmCoords and the 
        
               // list of ptr_As is sufficient because the set of four equal-length lists of pointers 
        
               // (A*, B*, C*, D*) will ensure that the first list of int64_ts will always 
        
               // start on a multiple of 8. 
        
               int64_t padding = 8 - (total_size % 8); 
        
               total_size += padding; 
        
               uint8_t* host_data = new uint8_t[total_size]; 
        
               cutlass::DeviceAllocation<uint8_t> device_data(total_size); 
        
               uint8_t* start = host_data; 
        
               cutlass::gemm::GemmCoord* problem_sizes_host = reinterpret_cast<cutlass::gemm::GemmCoord*>(start); 
        
               // Apply the padding after the list of GemmCoords 
        
               start += num * sizeof(cutlass::gemm::GemmCoord) + padding; 
        
               int64_t ptr_A_offset = start - host_data; 
        
               DeviceKernel::ElementA** ptr_A_host = reinterpret_cast<DeviceKernel::ElementA**>(start); 
        
               start += num * sizeof(DeviceKernel::ElementA*); 
        
               int64_t ptr_B_offset = start - host_data; 
        
               DeviceKernel::ElementB** ptr_B_host = reinterpret_cast<DeviceKernel::ElementB**>(start); 
        
               start += num * sizeof(DeviceKernel::ElementB*); 
        
               int64_t ptr_C_offset = start - host_data; 
        
               DeviceKernel::ElementC** ptr_C_host = reinterpret_cast<DeviceKernel::ElementC**>(start); 
        
               start += num * sizeof(DeviceKernel::ElementC*); 
        
               int64_t ptr_D_offset = start - host_data; 
        
               DeviceKernel::ElementC** ptr_D_host = reinterpret_cast<DeviceKernel::ElementC**>(start); 
        
               start += num * sizeof(DeviceKernel::ElementC*); 
        
               int64_t lda_offset = start - host_data; 
        
               int64_t* lda_host = reinterpret_cast<int64_t*>(start); 
        
               start += num * sizeof(int64_t); 
        
               int64_t ldb_offset = start - host_data; 
        
               int64_t* ldb_host = reinterpret_cast<int64_t*>(start); 
        
               start += num * sizeof(int64_t); 
        
               int64_t ldc_offset = start - host_data; 
        
               int64_t* ldc_host = reinterpret_cast<int64_t*>(start); 
        
               start += num * sizeof(int64_t); 
        
               std::vector<at::Tensor> D(num); 
        
               bool need_C = (C != at::nullopt) && (beta != 0.f); 
        
               for (size_t i = 0; i < num; ++i) { 
        
                   int M = A[i].size(0); 
        
                   int N = B[i].size(1); 
        
                   int K = A[i].size(1); 
        
                   *(problem_sizes_host + i) = {M, N, K}; 
        
                   *(ptr_A_host + i) = reinterpret_cast<typename DeviceKernel::ElementA*>(A[i].contiguous().data_ptr()); 
        
                   *(ptr_B_host + i) = reinterpret_cast<typename DeviceKernel::ElementB*>(B[i].contiguous().data_ptr()); 
        
                   if (need_C) { 
        
                       *(ptr_C_host + i) = reinterpret_cast<typename DeviceKernel::ElementC*>(C->at(i).contiguous().data_ptr()); 
        
                   } 
        
                   else { 
        
                       *(ptr_C_host + i) = nullptr; 
        
                   } 
        
                   D[i] = B[i].new_empty({M, N}, ${torch_type_C}); 
        
                   *(ptr_D_host + i) = reinterpret_cast<typename DeviceKernel::ElementC*>(D[i].contiguous().data_ptr()); 
        
                   *(lda_host + i) = DeviceKernel::LayoutA::packed({M, K}).stride(0); 
        
                   *(ldb_host + i) = DeviceKernel::LayoutB::packed({K, N}).stride(0); 
        
                   *(ldc_host + i) = DeviceKernel::LayoutC::packed({M, N}).stride(0); 
        
               } 
        
               device_data.copy_from_host(host_data); 
        
               cutlass::Status status = ${name}_kernel_run( 
        
                   num, 
        
                   reinterpret_cast<cutlass::gemm::GemmCoord*>(device_data.get()), 
        
                   reinterpret_cast<DeviceKernel::ElementA**>(device_data.get() + ptr_A_offset), 
        
                   reinterpret_cast<DeviceKernel::ElementB**>(device_data.get() + ptr_B_offset), 
        
                   reinterpret_cast<DeviceKernel::ElementC**>(device_data.get() + ptr_C_offset), 
        
                   reinterpret_cast<DeviceKernel::ElementC**>(device_data.get() + ptr_D_offset), 
        
                   reinterpret_cast<int64_t*>(device_data.get() + lda_offset), 
        
                   reinterpret_cast<int64_t*>(device_data.get() + ldb_offset), 
        
                   reinterpret_cast<int64_t*>(device_data.get() + ldc_offset), 
        
                   reinterpret_cast<int64_t*>(device_data.get() + ldc_offset), 
        
                   ElementCompute(alpha), ElementCompute(beta)); 
        
               delete[] host_data; 
        
               TORCH_CHECK(status == cutlass::Status::kSuccess, "CUTLASS kernel failed"); 
        
               return D; 
        
           } 
        
           """ 
        
           )

In fact, if you're interested in using this to build a PyTorch CUDA extension that uses grouped GEMM, you may be interested in the Python interface's PyTorch emitter. See this example for details: https://github.com/NVIDIA/cutlass/blob/main/examples/python/02_pytorch_extension_grouped_gemm.ipynb

I hope this helps!

0 replies

wangruohui · 2023-07-12T20:07:58Z

wangruohui
Jul 12, 2023
Author

Thank you very much!

I have checked the code and understood the single buffer trick. My ultimate goal is to port example/41 (grouped attention) into pytorch, so maybe I still need to deal with C++.

For the first trick, did you mean something like this?

# For the first time, memcpy ptr_A/B/C to device and keep it alive on device, 
# e.g., by wrapping device allocation as a python object and returning it. 
pointers = cutlass_module.foo(matA, matB, out=matC)
# Same data same pointer, but another operation
# In cpp, construct argument using pointers directly and launch kernel 
cutlass_module.bar(matA, matB, out=matC, pointers=pointers)

0 replies

jackkosaian · 2023-07-12T20:43:14Z

jackkosaian
Jul 12, 2023

But it seems only synced memcpy is implemented for DeviceAllocation.

Forgot to answer this question. You can use simple CUDA mallocs and memcpys (including the async version) if you'd prefer. DeviceAllocation is a utility in CUTLASS, but is not needed (the kernels themselves simply take in pointers).

Regarding your code snippet above: Are you suggesting allocating one space of device memory for each operand A in the group, and reusing these allocations across invocations (e.g., the previous layer writes its output to the same buffer each time, which is then reused in the call to the grouped kernel)? I think that could work if your problem sizes won't change and if you can make sure that the buffer isn't overwritten while the grouped GEMM is still using it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reduce time spent on `cutlass::DeviceAllocation::copy_from_host`? #998

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

How to reduce time spent on cutlass::DeviceAllocation::copy_from_host? #998

wangruohui Jul 12, 2023

Replies: 3 comments

jackkosaian Jul 12, 2023

wangruohui Jul 12, 2023 Author

jackkosaian Jul 12, 2023

How to reduce time spent on `cutlass::DeviceAllocation::copy_from_host`? #998

wangruohui
Jul 12, 2023

jackkosaian
Jul 12, 2023

wangruohui
Jul 12, 2023
Author

jackkosaian
Jul 12, 2023