[Transpiler] relax stensors' innermost dimension's alignment to reduce shared memory usage #131

jiazhihao · 2024-10-30T01:59:13Z

Currently, the transpiler requires 16-bytes alignment for the innermost dimension of all stensors:

mirage/src/transpiler/resolve_tensor_layout.cc

Lines 75 to 108 in 262b101

    
           void calc_tensor_strides(size_t strides[], 
        
                                    size_t &num_phy_elems, 
        
                                    int num_dims, 
        
                                    int const dims[], 
        
                                    int innermost_dim, 
        
                                    int datatype_size) { 
        
             // An order of dimensions. We layout elements according to this order 
        
             vector<int> dim_order = {innermost_dim}; 
        
             for (int i = num_dims - 1; i >= 0; --i) { 
        
               if (i != innermost_dim) { 
        
                 dim_order.push_back(i); 
        
               } 
        
             } 
        
             size_t alignment = std::max(16 / datatype_size, 1); 
        
             size_t cur_stride = 1; 
        
             bool encountered_non1_dim = false; 
        
             for (int dim_idx : dim_order) { 
        
               int cur_dim = dims[dim_idx]; 
        
               strides[dim_idx] = cur_stride; 
        
               if (cur_dim != 1) { 
        
                 if (!encountered_non1_dim) { 
        
                   cur_stride *= round_to_multiple((size_t)cur_dim, alignment); 
        
                   encountered_non1_dim = true; 
        
                 } else { 
        
                   cur_stride *= cur_dim; 
        
                 } 
        
               } 
        
             } 
        
             if (cur_stride == 1) { 
        
               // There is only one element in the tensor, we need to pad it to 16B 
        
               cur_stride = alignment; 
        
             } 
        
             num_phy_elems = cur_stride; 
        
           }

. This may result in more shared memory usage than necessary.

We should relax this constraint and only enforce this alignment for operators involving cp.async, ldmatrix, and other instructions that require such alignment.

The text was updated successfully, but these errors were encountered:

jiazhihao added CUDA Transpiler Issues and features related to the CUDA transpiler of Mirage enhancement New feature or request labels Oct 30, 2024

jiazhihao added this to CUDA Transpiler Oct 30, 2024

jiazhihao moved this to Todo in CUDA Transpiler Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transpiler] relax stensors' innermost dimension's alignment to reduce shared memory usage #131

[Transpiler] relax stensors' innermost dimension's alignment to reduce shared memory usage #131

jiazhihao commented Oct 30, 2024

[Transpiler] relax stensors' innermost dimension's alignment to reduce shared memory usage #131

[Transpiler] relax stensors' innermost dimension's alignment to reduce shared memory usage #131

Comments

jiazhihao commented Oct 30, 2024