Replies: 5 comments 9 replies
-
Thoughts: One problem I see with affine maps is that they can't represent tensor padding the way our coordinate transforms can. Ok, so, notes on why we even have those flags. The For convolution, however, this is currently not the case. I'll give my examples for forward convolution but these considerations are general. When we set up an implicit GEMM, we have an arbitrary choice to make. Namely, the K dimension of the GEMM should be However, not all of those options have the same performance. We want reads from the underlying tensor to be as contiguous as we can get away with. That is, if the filter is KCYX, we should choose gemmK <- C * Y * X, and for KYXC, we should choose gemmK <- Y * X * C ... and all this is under the assumption that those layout strings are the actual layout. The other purpose of the convolution layout strings is so we could have one Historically, we've been asked for convolution kernels and been given accurate layouts. We've had to do this tosa transpose folding thing because transposes were needed to encode NCHW convolutions in Tosa and we needed to tell Back when we did the whole transpose-folding machinery, there were thoughts (mainly @sjw36 and me) of doing away with this sort of signalling and trying to solve the problem generally. This didn't go anywhere because there wasn't any particularly urgent need for it, and because "given this transform chain, tell me what order to put C, Y, and X in (or N, H, and W) in" was nonobvious. The more general problem statement there, I think, is "tell me the [fastest] stride of each convolution or GEMM dimension", which is a doable piece of code and looks kinda similar to the vectorizer, which, in retrospect, is a feasible piece of code to write. If we had that sort of "memory structure discovery engine", we could make it so that we didn't need this Tosa-level transpose folding mechanism at all. |
Beta Was this translation helpful? Give feedback.
-
... Ok, yeah, that wording leads me to a close-enough encoding for these sort of strided setups. How about something like I don't want this on the type, necessarily - it's recoverable from the transform chain, but I think it makes a reasonable serialization key. |
Beta Was this translation helpful? Give feedback.
-
(That is, my proposal is "walk the transform chain to get the set of strides for each gemm/conv dimension") ... Ok, to add a few more words, when you have that sort of split dimension, you need That is, suppose we have a convolution input in NCHWC : 2x4x2x2x4 layout, we'd have
instead of any sort of |
Beta Was this translation helpful? Give feedback.
-
Re: affine maps, I agree with shortcoming of not being able to represent pads. otherwise @krzysz00 exactly my thoughts! I think we are going in the right direction. We can figure out exact mechanics how to serialize/deserialize to/from that representation. However, I would try to cover a bit more cases to increase longevity of the solution So to cite the e.g. : %1 = migraphx.transpose(%arg1) {permutation = [0, 2, 1, 3]} : (tensor<1x12x384x64xf32>) -> tensor<1x384x12x64xf32>
%2 = migraphx.reshape(%1) {dims = [1, 384, 768]} : (tensor<1x384x12x64xf32>) -> tensor<1x384x768xf32>
%3 = migraphx.dot(%2, %0) : tensor<1x384x768xf32>, tensor<1x768x768xf32> -> tensor<1x384x768xf32> So here, the gemm's k dim = 768 is collapsed from 12 and 64. The element strides for those original dims are 24756 and 1, respectively. |
Beta Was this translation helpful? Give feedback.
-
Problem definition
Our current mechanism of indicating layout (for convs) or indicating whether input tensors are transposed (for gemms) might not be sufficient (i.e. general enough) to faithfully recreate a kernel with the anchor op to be run in the tuning loop. I'd also note that we should not (prematurely) move away from generating a problem str (more like a hash) for a kernel (fused or otherwise) because most of the time fusion (yet) doesn't really affect parameter space that is being tuned. I've seen this reduces the instances to be tuned drastically.
E.g. :
Here one could see non-contiguous dimension are being collapsed and present as one of the dimension of the tensor going to the gemm.
Potential solutions
All of the solutions presented here as the general tendency to avoid folding of tranposes that would result in attributes on the gemm/conv as flags. The solutions rather be focused on exporting hierarchical striding mechanisms when accessing elements in a given dimension of tensors coming to the gemm/conv when they are in the rock (and friends) dialect.
Strided memrefs
This solution might handle transposes and slices, however it is unable express if there are varying strides even within a single dimension (post transposed-collapse).
Affine maps
If we can come with a way to serialize/convert_to_string affine maps for each tensors that should give all the information about how the kernel should be constructed to be used the tuning loop. However, since we are using folded transforms (hence affine maps) our static optimizations might not work fully. However, this is a simpler representation that can represent many rock.transform chains with a single problem description.
rock.transform chains
This might be the most faithful way running the tuning loop. However, Im worried about permutations and combinations that more or less represent the same access pattern from the boundary tensors --> hence explosion of tuning instances that other approaches dont have.
Discussion
Combination of rock.transform chains & affine maps ?
We can use the affine maps (post folding rock.transform chain) to key to avoid re-tuning problem descriptions of rock.transform chains that would collapse down to same access pattern. However, we can keep rock.transform chain -- maybe as a string -- in the problem description.
(Feel free to suggest discussion topic)
Beta Was this translation helpful? Give feedback.
All reactions