Transpose folding and layouts #1054

manupak · 2023-04-20T12:20:28Z

manupak
Apr 20, 2023
Collaborator

Problem definition

Our current mechanism of indicating layout (for convs) or indicating whether input tensors are transposed (for gemms) might not be sufficient (i.e. general enough) to faithfully recreate a kernel with the anchor op to be run in the tuning loop. I'd also note that we should not (prematurely) move away from generating a problem str (more like a hash) for a kernel (fused or otherwise) because most of the time fusion (yet) doesn't really affect parameter space that is being tuned. I've seen this reduces the instances to be tuned drastically.

E.g. :

func.func @mlir_dot(%arg0: tensor<1x384x768xf32>, %arg1: tensor<1x12x384x64xf32>, %arg2: tensor<1x768x768xf32>) -> tensor<1x384x768xf32> {
    %0 = migraphx.multibroadcast(%arg2) {out_dyn_dims = [], out_lens = [1, 768, 768]} : (tensor<1x768x768xf32>) -> tensor<1x768x768xf32>
    %1 = migraphx.transpose(%arg1) {permutation = [0, 2, 1, 3]} : (tensor<1x12x384x64xf32>) -> tensor<1x384x12x64xf32>
    %2 = migraphx.reshape(%1) {dims = [1, 384, 768]} : (tensor<1x384x12x64xf32>) -> tensor<1x384x768xf32>
    %3 = migraphx.dot(%2, %0) : tensor<1x384x768xf32>, tensor<1x768x768xf32> -> tensor<1x384x768xf32>
    %4 = migraphx.add(%3, %arg0) : (tensor<1x384x768xf32>, tensor<1x384x768xf32>) -> tensor<1x384x768xf32>
    return %4 : tensor<1x384x768xf32>
  }

Here one could see non-contiguous dimension are being collapsed and present as one of the dimension of the tensor going to the gemm.

Potential solutions

All of the solutions presented here as the general tendency to avoid folding of tranposes that would result in attributes on the gemm/conv as flags. The solutions rather be focused on exporting hierarchical striding mechanisms when accessing elements in a given dimension of tensors coming to the gemm/conv when they are in the rock (and friends) dialect.

Strided memrefs

memref<12x4xf32, strided<[4, 1], offset: 5>

This solution might handle transposes and slices, however it is unable express if there are varying strides even within a single dimension (post transposed-collapse).

Affine maps

If we can come with a way to serialize/convert_to_string affine maps for each tensors that should give all the information about how the kernel should be constructed to be used the tuning loop. However, since we are using folded transforms (hence affine maps) our static optimizations might not work fully. However, this is a simpler representation that can represent many rock.transform chains with a single problem description.

rock.transform chains

This might be the most faithful way running the tuning loop. However, Im worried about permutations and combinations that more or less represent the same access pattern from the boundary tensors --> hence explosion of tuning instances that other approaches dont have.

Discussion

Combination of rock.transform chains & affine maps ?

We can use the affine maps (post folding rock.transform chain) to key to avoid re-tuning problem descriptions of rock.transform chains that would collapse down to same access pattern. However, we can keep rock.transform chain -- maybe as a string -- in the problem description.

(Feel free to suggest discussion topic)

manupak · 2023-04-20T12:21:09Z

manupak
Apr 20, 2023
Collaborator Author

cc : @sjw36 @krzysz00 @giuseros @jerryyin

3 replies

jerryyin Apr 20, 2023
Maintainer

I want to provide a different perspective. I share the preference for the current aspects of tuning: Performance config can be reused through different actual configs, this allows us to do less of it. My idea is to focus on the limited combinations we have in tuning and not to further enlarge it. However, I believe the key to is find out what the "shortest distance" is from the alternation of the problem to one of the combinations in the performance config string.

For example, the effect of the additional transformations before the dot can happen in any arbitrary sequence and variety. It is hardly possible to enumerate all of them. Now instead of doing that, we simply see if the above transformation has the shortest distance to all combinations of transposeA/B true/false (layout if it is convolution). Then we just take the performance config with the shortest distance as a approximation to the problem itself.

The challenge will comes from how we'd define what the "shortest distance" is. I think one way to do it is simply to get the weighted distance from "before" and "after" of the transform map, to compute the difference of the view before the transform to the view after the transform, right before feeding into the dot operator. Then we can create some test cases to verify if our assumption stands and fine tune it as a (greedy) heuristic.

In summary, I prefer this way because this keeps the tuning space manageable and re-usable. To evaluate if this is good approximation you might need to give a comparison of this vs. tuning through the entire sub-kernel (so the dot operator can carry the offset naturally).

manupak Apr 20, 2023
Collaborator Author

I agree with the idea of heuristically choosing already tuned point.

However, as of the current state of affairs that metric is unclear -- to a point that makes choosing close enough point is equivalent to ignore the transforms but however that contradicts to the point that we care about transposes/layouts. Sure, yea when I get some free time I ll experiment tuning the kernel possibly using @yiqian1's PR.

Therefore, I ll categorize the approach into two :

A1. fast-tuning using the current tuning space.

This is what I think you were proposing

A2. exact-tuning using correct element strides.

This is the exact way that I think we could explore to see if it turns out beneficial.

For A1 & A2,
I still think we need to migrate from transposeA/B true/false to at least strides on dims (similiarly wxyz layout to strides on each dimension). We could make them symbolic (in terms of dims sizes, where applicable) but I think it is the element stride we should be caring about when we tune the anchor op..

This will allow us to define for A1 : a metric to see whether we have close enough element strided memrefs from the tuning list to cut down on tuning and A2 : it will allow any composite view of the memory to be considered when creating the candidate kernel to be tuned.

jerryyin Apr 20, 2023
Maintainer

@manupak Generally agree with you here too. We can go with an approach with strides are encoded in the performance config, and most existing configs will just have all zeros in the field. Or we can stay at the high level and be symbolic to make the config shared as much as possible. This all depends on what you find out in A2. A1 is really an optimization assuming it is the kind of approximation that will work for the majority. Performance config string is already cheap enough to store and read, and to extract as much performance as possible we may very likely fall to A2 approach.

The interesting aspect of the A2 approach is that we'd have a new kind of fields. Those fields won't be useful in the kernel run and are purely functionaly in the tuning parameter pick and selection process. This make it more challenging to (pre)populate the database: Having more fields really make the config strings to be very workload dependent now.

krzysz00 · 2023-04-20T18:42:42Z

krzysz00
Apr 20, 2023
Maintainer

Thoughts:

One problem I see with affine maps is that they can't represent tensor padding the way our coordinate transforms can.

Ok, so, notes on why we even have those flags.

The tr flags on rock.gemm are not actually required for anything in particular. Specifically, the performance impacts of doing the whole transpose-folding thing to set tr correctly as opposed to lowering the transposes to rock.transforms is nothing (minus very minor instruction ordering considerations).

For convolution, however, this is currently not the case. I'll give my examples for forward convolution but these considerations are general.

When we set up an implicit GEMM, we have an arbitrary choice to make. Namely, the K dimension of the GEMM should be gemmK <- Merge(permute_as_you_like(C, Y, X)), and then the N dimension needs to be gemmN <- Merge(match_permute_on_filter(N, H, W)). Whether we have gemmK <- C * Y * X or gemmK <- Y * X * C has no impact on correctness, so, in theory, we could choose any of them.

However, not all of those options have the same performance. We want reads from the underlying tensor to be as contiguous as we can get away with. That is, if the filter is KCYX, we should choose gemmK <- C * Y * X, and for KYXC, we should choose gemmK <- Y * X * C ... and all this is under the assumption that those layout strings are the actual layout.

The other purpose of the convolution layout strings is so we could have one rock.conv2d that handles NCHW and NHWC (and same for GEMM having those tr flags), but, if it weren't for these performance considerations, we could always just have Tosa lower to NHWC kernels and call it a day.

Historically, we've been asked for convolution kernels and been given accurate layouts. We've had to do this tosa transpose folding thing because transposes were needed to encode NCHW convolutions in Tosa and we needed to tell rock.conv2d that the underlying layout was NCHW lest we pay by having a matrix layout that really doesn't map to the underlying memory.

Back when we did the whole transpose-folding machinery, there were thoughts (mainly @sjw36 and me) of doing away with this sort of signalling and trying to solve the problem generally. This didn't go anywhere because there wasn't any particularly urgent need for it, and because "given this transform chain, tell me what order to put C, Y, and X in (or N, H, and W) in" was nonobvious.

The more general problem statement there, I think, is "tell me the [fastest] stride of each convolution or GEMM dimension", which is a doable piece of code and looks kinda similar to the vectorizer, which, in retrospect, is a feasible piece of code to write.

If we had that sort of "memory structure discovery engine", we could make it so that we didn't need this Tosa-level transpose folding mechanism at all.

3 replies

jerryyin Apr 20, 2023
Maintainer

Yes we could have gone a different way entirely, to make the kernel use the strides in memref. Note that this may very well become a requirement in the future for layout such as NCHWc.

On the other hand, the existing approach we have is also very cool, that we abstract the additional complexity and offload it to TOSA. So if requirement such as NCHWc come up, we can wrap it into a TOSA module to hand it off to a client.

krzysz00 Apr 20, 2023
Maintainer

I'm not sure I understand your reply or why it's here.

The kernel generation doesn't use the strides in the memref at all.

jerryyin Apr 20, 2023
Maintainer

I was just trying to convey the idea that the kernel could look different if we are to give strides to it but that's not the way we choose to implement it.

The other point is that when thinking of ways to address this problem, it better be generic enough to handle a future to support layout.

krzysz00 · 2023-04-20T18:50:14Z

krzysz00
Apr 20, 2023
Maintainer

... Ok, yeah, that wording leads me to a close-enough encoding for these sort of strided setups.

How about something like -g_strides = [294192] -m_strides = [64], -n_strides = [24576, 1] as the tuning encoding?

I don't want this on the type, necessarily - it's recoverable from the transform chain, but I think it makes a reasonable serialization key.

0 replies

krzysz00 · 2023-04-20T18:55:06Z

krzysz00
Apr 20, 2023
Maintainer

(That is, my proposal is "walk the transform chain to get the set of strides for each gemm/conv dimension")

... Ok, to add a few more words, when you have that sort of split dimension, you need -m, -n, and the like to, in general, be list-valued as well, but that's fine, and we can avoid the brackets in general.

That is, suppose we have a convolution input in NCHWC : 2x4x2x2x4 layout, we'd have

-n 2 -n_stride 64 -c [4, 4] -c_stride [16, 1] -h 2 -h_stride 8 -w 2 -w_stride 4

instead of any sort of -layout key

0 replies

manupak · 2023-04-21T12:43:49Z

manupak
Apr 21, 2023
Collaborator Author

Re: affine maps, I agree with shortcoming of not being able to represent pads.
Im not sure we are missing out anything relative to mentioning stride per dim solution. This is because we are talking about pads between the kernel boundary and the anchor op and we might tune for a kernel that provide exactly the shapes the anchor op requires. Thus, Im failing to see the significance of missing out on pads between kernel boundary and the anchor op.

otherwise @krzysz00 exactly my thoughts!

I think we are going in the right direction. We can figure out exact mechanics how to serialize/deserialize to/from that representation.
I agree this would be a great first step.

However, I would try to cover a bit more cases to increase longevity of the solution
Just to add a one complexity (as shown in the example), the anchor ops (gemm/convs) can have collapsed non-contiguous dimensions as one of the dimensions. What that means is element stride within a single dimension is not single static value.

So to cite the e.g. :

    %1 = migraphx.transpose(%arg1) {permutation = [0, 2, 1, 3]} : (tensor<1x12x384x64xf32>) -> tensor<1x384x12x64xf32>
    %2 = migraphx.reshape(%1) {dims = [1, 384, 768]} : (tensor<1x384x12x64xf32>) -> tensor<1x384x768xf32>
    %3 = migraphx.dot(%2, %0) : tensor<1x384x768xf32>, tensor<1x768x768xf32> -> tensor<1x384x768xf32>

So here, the gemm's k dim = 768 is collapsed from 12 and 64. The element strides for those original dims are 24756 and 1, respectively.
Therefore, I think we would need something like -k_strides = [64:24756, 1:1] (not sure thats the best, but open to alternative that offers similiar expressiveness). This expresses that for every 64 elements the addressing stride needs to be incremented by 24756.

3 replies

manupak Apr 21, 2023
Collaborator Author

(Im taking gemms for this example, but its applicable to convs as well)

jerryyin Apr 21, 2023
Maintainer

Let me try to understand... So when you have an array to represent the strides, what do you do at the tuning time? I assume you would tune with the sub-module (as-is) like above to enumerate with all the fields(m/block, n/block,... etc) that can change, and find the best combination of those. This additional strides fields are really a way to simplify the tuning workload, and encode this information as a string? Eventually this still sounds to me like we may have a (different) mapping from the original config to the same tuning space. And since it is hard to figure out whether this mapping can be derived from the original mapping where dimensions are contiguous, we need to do this additional tuning.

manupak Apr 24, 2023
Collaborator Author

So when you have an array to represent the strides, what do you do at the tuning time?

I think you are discussing "deserializing" the tuning string to kernel part.

So the array of numbers of strides, should re-create the candidate kernel to be tuned with view ops (we could plainly insert rock.transforms or tosa, whichever is easier) that represent the array of strides per dimension.

I assume you would tune with the sub-module (as-is) like above to enumerate with all the fields(m/block, n/block,... etc) that can change, and find the best combination of those.

Yes the tuning is perfomed as before but with view ops that agrees with the hierarchical element strides.

This additional strides fields are really a way to simplify the tuning workload, and encode this information as a string? Eventually this still sounds to me like we may have a (different) mapping from the original config to the same tuning space

It is more to capture combinations & permutations of view ops that ends up creating the same hierarchical strides.
The term "original config" is ambiguous here to me. I am assuming you mean origin problem string and not the perf config that is the result of the tuning. With that assumption, I'd argue if the strides are different the problem being tuned is different (well; subject to few experiments).

However, I dont mean to say it is entirely different. For e.g. : we could figure out bit of bucket based approach to fit the stride space to a histogram and use a representative problem config str to represent a small group of problem config strs. This is what I mean as "fast tuning".

And since it is hard to figure out whether this mapping can be derived from the original mapping where dimensions are contiguous, we need to do this additional tuning.

Yes, if we need to do additional tuning, I think we should do additional tuning -- hence the main point here.
I think whether we need to do additional tuning depends on few experiments that would ideally set out the conditions to see what is worth tuning again in quantitative manner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transpose folding and layouts #1054

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Transpose folding and layouts #1054

manupak Apr 20, 2023 Collaborator

Problem definition

Potential solutions

Strided memrefs

Affine maps

rock.transform chains

Discussion

Combination of rock.transform chains & affine maps ?

(Feel free to suggest discussion topic)

Replies: 5 comments · 9 replies

manupak Apr 20, 2023 Collaborator Author

jerryyin Apr 20, 2023 Maintainer

manupak Apr 20, 2023 Collaborator Author

jerryyin Apr 20, 2023 Maintainer

krzysz00 Apr 20, 2023 Maintainer

jerryyin Apr 20, 2023 Maintainer

krzysz00 Apr 20, 2023 Maintainer

jerryyin Apr 20, 2023 Maintainer

krzysz00 Apr 20, 2023 Maintainer

krzysz00 Apr 20, 2023 Maintainer

manupak Apr 21, 2023 Collaborator Author

manupak Apr 21, 2023 Collaborator Author

jerryyin Apr 21, 2023 Maintainer

manupak Apr 24, 2023 Collaborator Author

manupak
Apr 20, 2023
Collaborator

Replies: 5 comments 9 replies

manupak
Apr 20, 2023
Collaborator Author

jerryyin Apr 20, 2023
Maintainer

manupak Apr 20, 2023
Collaborator Author

jerryyin Apr 20, 2023
Maintainer

krzysz00
Apr 20, 2023
Maintainer

jerryyin Apr 20, 2023
Maintainer

krzysz00 Apr 20, 2023
Maintainer

jerryyin Apr 20, 2023
Maintainer

krzysz00
Apr 20, 2023
Maintainer

krzysz00
Apr 20, 2023
Maintainer

manupak
Apr 21, 2023
Collaborator Author

manupak Apr 21, 2023
Collaborator Author

jerryyin Apr 21, 2023
Maintainer

manupak Apr 24, 2023
Collaborator Author