Skip to content

Latest commit

 

History

History
833 lines (603 loc) · 34 KB

02_layout_operations.md

File metadata and controls

833 lines (603 loc) · 34 KB

CuTe Layout Operations

CuTe provides an "algebra of Layouts." Layouts can be combined and manipulated to construct more complicated Layouts. This includes tiling and partitioning Layouts across other Layouts. In this section, we explain some of these core operations in detail.

How do I print CuTe objects on host or device?

CuTe comes with different ways to print CuTe objects. You can print human-readable text, or you can print LaTeX commands for generating a beautifully formatted and colored table describing the CuTe object. Both of these can be helpful for reasoning about or debugging layouts, copy atoms, or matrix multiply atoms (don't worry, we'll explain all of these things in this tutorial).

CuTe's print functions work on either host or device. Note that on device, printing is expensive. Even just leaving print code in place on device, even if it is never called (e.g., printing in an if branch that is not taken at run time), may generate slower code. Thus, be sure to remove code that prints on device after debugging.

The following code examples assume that you have a using namespace cute; statement in scope.

Printing human-readable text

The cute::print function has overloads for almost all CuTe types, including Pointers, Layout, Shape, Stride, and Tensors. When in doubt, try calling print on it. You might also only want to print on thread 0 of each thread block, or block 0 of the grid. The thread0() function returns true only for global thread 0 of the kernel. A typical idiom for printing CuTe objects to print only on thread 0 of block 0.

if (thread0()) {
  print(some_cute_object);
}

Some algorithms do different things on different threads or blocks, so you might sometimes need to print on threads or blocks other than zero. The header file cute/util/debug.hpp, among other utilities, includes the function bool thread(int tid, int bid) that returns true if running on thread tid and block bid.

Some CuTe types have special printing functions that use a different output format. For example, print_layout can display a rank-2 layout in a table (using plain text formatting). It has an overload taking a rank-2 matrix layout and a thread layout, that displays a table with the mapping between threads and values.

Printing LaTeX output

The cute::print_latex function works like cute::print, but prints LaTeX commands that you can use to generate a nicely formatted and colored table.

Fundamental types

Layout and its components

This directory includes an overview of CuTe's fundamental types for describing layouts.

Tuple

CuTe starts with a Tuple, which is a finite ordered list of zero or more elements. In C++, we identify a Tuple with the cute::tuple class. cute::tuple behaves like std::tuple, but it works on device or host, and it imposes restrictions on its template arguments for performance and simplicity.

IntTuple

CuTe then defines an IntTuple as either an integer, or a Tuple of IntTuple. This recursive definition lets us build arbitrarily nested layouts. In C++, we identify an IntTuple with IntTuple, which is just an alias of cute::tuple. Any of the following are thus valid template arguments of IntTuple.

  1. "Run-time integers" (or "static integers") are just ordinary integral types like int or size_t.

  2. "Compile-time integers" include std::integral_constant or subclasses of it that CuTe defines, such as Int<Value> (see below). These types all have in common that the value is encoded in the type itself (as a public static constexpr value member). CuTe defines aliases _1, _2, _3 etc. to the types Int<1>, Int<2>, Int<3> etc.

  3. IntTuple with any valid template arguments.

CuTe reuses IntTuple for many different things, including Shape, Stride, Step, and Coord (see include/cute/layout.hpp). In C++, Shape, Stride, Step, and Coord are all aliases for IntTuple.

Layout

A Layout is a tuple of (Shape, Stride). Semantically, it implements a mapping from a "logical" Shape-shaped (multidimensional) index, to a "physical" 1-D index into an array. Here is an example of a 2 x 3 array with static strides (3, 1).

Layout layout = make_layout(make_shape (_2{}, _3{}),
                            make_stride(_3{}, _1{}));
print_layout(layout);
for (int i = 0; i < size(layout); ++i) {
  print(layout(i));
  print(", ");
}
print("\n");
print(layout(1, 1));
print("\n");

This code produces the following text output.

(_2,_3):(_3,_1)
      0   1   2
    +---+---+---+
 0  | 0 | 1 | 2 |
    +---+---+---+
 1  | 3 | 4 | 5 |
    +---+---+---+
0, 3, 1, 4, 2, 5,
4

print(layout(1, 1)) prints the mapping of the logical 2-D coordinate (1,1) to the 1-D index, which is 4. You can see that from the table, which shows the left logical index as the "row," and the right logical index as the "column."

Underscore (_)

An Underscore is a special type used for array slices. The underscore punctuation _ is a constant instance of Underscore. It acts like : (the colon punctuation) in Python or Fortran array slices. See include/cute/underscore.hpp.

Tile

"A Tile is not a Layout, it's a tuple of Layouts or Tiles or Underscores." See include/cute/tile.hpp.

The algebraic layout operations discussed below are defined on Layouts, but Tile allows these operations to recurse and to be applied to sublayouts or particular modes of a given Layout. These are referred to as by-mode operations.

See the section on "Logical Divide" to see an example of using Tile to extract portions of a row-mode and portions of a column-mode independently.

Layout definitions and operations

Layouts are functions from integers (logical 1-D coordinate) to integers (1-D index)

The for loop in the above print example shows how CuTe identifies 1-D coordinates with a column-major layout of logical 2-D coordinates. Iterating from i = 0 to size(layout) (which is 6), and indexing into our layout with the single integer coordinate i, traverses the layout in column-major fashion, even though this is a row-major layout. You can see this from the output of the for loop (0, 3, 1, 4, 2, 5). CuTe calls this index i a "1-D coordinate," versus the "natural coordinate," which would be the logical 2-D coordinate.

If you're familiar with the C++23 feature mdspan, this is an important difference between mdspan layout mappings and CuTe Layouts. mdspan layout mappings are one way: they always take a multidimensional logical coordinate, and they return an integer offset. Depending on the strides, the offset may skip over elements of the physical 1-D array. Thus, mdspan's offset does NOT mean the same thing as the 1-D logical coordinate i in the for loop above. You can iterate correctly over any CuTe Layout by using the 1-D logical coordinate. mdspan doesn't have an idea of a 1-D logical coordinate.

Rank, depth, size, cosize

Rank: the tuple size of the layout's shape.

Depth: the depth of the layout's shape. A single integer has depth 0. A tuple has depth 1 + the max depth of its components.

Size: Size of the shape; size of the domain of the function. This is the product of all extents in the layout's shape.

Cosize: Size of the function's codomain (not necessarily the range); for a layout A, A(size(A) - 1) + 1. (Here, we use size(A) - 1 as a 1-D logical coordinate input.)

Layout compatibility

We say that layouts A and B are compatible if their shapes are compatible. Shape A is compatible with shape B if any natural coordinate of A is also a valid coordinate for B.

Flatten

The flatten operation "un-nests" a potentially nested Layout. For example,

Layout layout = Layout<Shape <Shape <_4, _3>, _1>,
                     Stride<Stride<_3, _1>, _0>>{};
Layout flat_layout = flatten(layout);

results in flat_layout having the following type

Layout<Shape<_4, _3, _1>, Stride<_3, _1, _0>>

and

Layout layout = Layout<Shape <_4, Shape <_4,  _2>>,
                     Stride<_4, Stride<_1, _16>>>{};
Layout flat_layout = flatten(layout);

results in flat_layout having the following type

Layout<Shape<_4, _4, _2>, Stride<_4, _1, _16>>

Hierarchical Layouts and flattening let us reinterpret tensors in place as matrices, matrices as vectors, vectors as matrices, etc. This lets us implement arbitrary tensor contractions as batched matrix multiply, by combining the contraction modes into a single mode, and combining the A, B, C, and "batch" modes as needed to reach the desired form.

Coalesce

The coalesce operation first flattens the layout, then combines all the modes that are possible to combine, starting with mode 0 (the leftmost mode) and moving right. If all the modes can be combined, then this results in a 1-D layout expressing what array elements the original layout accesses.

For example,

layout: (_2,(_1,_6)):(_1,(_6,_2))
coalesce(layout): _12:_1

What does it mean to "combine" modes? In the above example, the flattened layout is (2, 1, 6) : (1, 6, 2).

  1. If we look at the leftmost two modes, this is just a vector of length 2 and stride 1. The middle mode has extent 1, so the corresponding stride 6 would not be observed anyway. This leaves us with (2, 6) : (1, 2).

  2. The intermediate result (2, 6) : (1, 2) is just a 2 x 6 column-major matrix, which can be coalesced into a vector of length 12 and stride 1.

More formally, "combining all the modes" means a left fold, where the binary operation that combines two modes has three cases.

  1. If the leftmost layout is s1:d1, and the next layout is 1:d0, then combine into s1:d1. This generalizes Step 1 above. If a mode has extent 1, we can't observe its stride, so we can skip the mode.

  2. If the leftmost layout is 1:d1, and the next layout is s0:d0, then combine into s0:d0. Again, if a mode has extent 1, we can't observe its stride, so we can skip the mode.

  3. If the leftmost layout is s1:d1, and the next layout is s0 : s1*d1, then combine into s0 * s1 : d1. This generalizes Step 2 above. One can call this "noticing a column-major layout sequence."

That's it! For example, the result of coalescing the row-major layout (2, 2) : (2, 1) is (2, 2) : (2, 1), the same layout, because none of the above three cases applies.

Complement

Definition

The complement B of a layout A with respect to an integer M satisfies the following properties.

  1. $A$ and $B$ are disjoint: $A(x) \neq B(x)$ for all $x \neq 0$ in the domain of $A$.

  2. B is ordered: $B(x-1) \lt B(x)$ for all $x$ in ${0, 1, \dots, size(B) - 1}$.

  3. B is bounded by M: $size(B) \geq M / size(A)$, and $cosize(B) \leq floor(M / cosize(A)) * cosize(A)$.

Regarding disjointness: we need to specify $x \neq 0$ because CuTe layouts are linear. That is, if the domain is nonempty, the range always contains zero.

Regarding the ordered property: CuTe layouts are hierarchically strided, so this implies that if size(B) is nonzero, then the strides of B are all positive.

Examples

complement(4:1, 24) is 6:4.

  1. The result is disjoint of 4:1, so it must have a stride of at least 4 (since it includes 0, but must skip over 1, 2, 3).

  2. The size of the result is $\geq 24 / 4 = 6$. (This plus Step (1) means that the cosize is at least 24.)

  3. The cosize of the result is $\leq (24 / 4) * 4 = 24$. (This plus Step (2) means that the cosize is exactly 24.)

  4. The only (1-D) layout with size 6 and cosize 24 is 6:4.

complement(6:4, 24) is 4:1.

  1. 4:1 is disjoint of 6:4, but so is s:d for any s > 0 and d > 20.

  2. The size of the result is $\geq 24 / 6 = 4$.

  3. The cosize of the result is $\leq (24 / 21) * 21 = 21$.

  4. The stride cannot be greater than 20 (else (2) would contradict (3)), so it must be less than 4.

  5. This leaves 4:1 by elimination.

Composition

Layouts are functions, so composition of layouts is just composition of functions. The composition $A \circ B$ means "apply the layout B first, then treat the result as a 1-D logical coordinate input to the layout A, and apply A to it." Very often, this composition can be represented as another Layout.

Rules for computing composition

Both humans and CuTe compute composition using the following rules.

  1. $A \circ B$ has a shape that is compatible with B. In function composition, the rightmost function defines the domain. For Layouts this means that any valid coordinate for $B$ can also be used as a coordinate for $A \circ B$.

  2. Concatenation: A layout can be expressed as the concatenation of its sublayouts. We denote concatenation with parentheses: $B = (B_0,B_1,...)$. The CuTe function make_layout, when given zero or more Layouts, concatenates them.

  3. Composition is (left-)distributive with concatenation: $A \circ B = A \circ (B_0, B_1, ...) = (A \circ B_0, A \circ B_1, ...)$.

  4. "Base case": For layouts $A = a : b$ and $B = c : d$ with integral shape and stride, $A \circ B = R = c : (b * d)$.

  5. By-mode composition: Let $\langle B, C \rangle$ (angle brackets, not parentheses) denote a tuple of two layouts B and C, not their concatenation. Let $A = (A_0, A_1)$. Then, $A \circ \langle B, C \rangle = (A_0, A_1) \circ \langle B, C \rangle = (A_0 \circ B, A_1 \circ C)$. This allows the application of composition independently to sublayouts of $A$.

Examples: Reshape a vector into a matrix

This section gives two composition examples. Both start with a vector with layout $20:2$ (that is, the vector has 20 elements, and the stride between each is 2). They compose this vector with a 4 x 5 matrix layout. This effectively "reshapes" the vector in place into a matrix.

Example 1

$20:2 \circ (4,5) : (1,4)$.

This describes interpreting the vector $20:2$ as a 4 x 5 column-major matrix.

The resulting layout has shape $(4,5)$, because in function composition, the rightmost function defines the domain. What are the strides?

  1. A layout can be expressed as the concatenation of its sublayouts, so $(4,5) : (1,4)$ is $(4:1, 5:4)$.

  2. Composition is distributive, so $20:2 \circ (4:1, 5:4)$ is $(20:2 \circ 4:1, 20:2 \circ 5:4)$.

  3. $20:2 \circ 4:1$ has shape 4 (rightmost function defines the domain) and stride $2 = 2 \cdot 1$.

  4. $20:2 \circ 5:4$ has shape 5 and stride $8 = 2 \cdot 4$.

  5. Result: (4:2, 5:8), which by concatenation is (4,5) : (2,8).

Example 2

$20:2 \circ (4,5) : (5,1)$.

This describes interpreting the vector 20:2 as a 4 x 5 row-major matrix.

The resulting layout has shape $(4,5)$, just as before. What are the strides?

  1. By deconcatenation, $(4,5) : (5,1)$ is $(4:5, 5:1)$.

  2. Composition is distributive, so $20:2 \circ (4:5, 5:1)$ is $(20:2 \circ 4:5, 20:2 \circ 5:1)$.

  3. $20:2 \circ 4:5$ has shape $4$ and stride $10 = 2 \cdot 5$.

  4. $20:2 \circ 5:1$ has shape $5$ and stride $2 = 2 \cdot 1$.

  5. Result: (4:10, 5:2), which by concatenation is (4,5) : (10,2).

Example: Reshape a matrix into another matrix

The composition $((20,2):(16,4) \circ (4,5):(1,4))$ expresses reshaping the matrix with layout (20,2):(16:4), into a 4 x 5 matrix in a column-major way.

  1. By deconcatenation, $(4,5) : (1,4)$ is $(4:1, 5:4)$.

  2. Composition is distributive, so $(20,2):(16,4) \circ (4:1, 5:4)$ is $((20,2):(16,4) \circ 4:1, (20,2):(16,4) \circ 5:4)$.

  3. $(20,2):(16,4) \circ 4:1$ has shape $4$ and stride $16$. (4:1 expresses picking the first 4 consecutive elements of (20,2):(16,4). These elements run down the 0th column (leftmost mode) of the layout, whose stride is 16.)

  4. $(20,2):(16,4) \circ 5:4$ has shape $5$ and stride $64 = 4 \cdot 16$.

  5. Result: $(4:16, 5:64)$, which by concatenation is $(4,5) : (16,64)$.

We get exactly this result with CuTe if we use compile-time shapes and strides. The following C++ code prints (_4,_5):(_16,_64).

using namespace cute;
auto a = make_layout(make_shape(Int<20>{}, _2{}), make_stride(_16{}, _4{}));
auto b = make_layout(make_shape(     _4{}, _5{}), make_stride( _1{}, _4{}));
auto c = composition(a, b);
printf("\n");
print(c);

Results may look different (but are the same mathematically) if we use run-time integers. The following C++ code prints ((4,1),(5,1)):((16,4),(64,4)).

using namespace cute;
auto a = make_layout(make_shape(20, 2), make_stride(16, 4));
auto b = make_layout(make_shape( 4, 5), make_stride( 1, 4));
auto c = composition(a, b);
printf("\n");
print(c);

((4,1),(5,1)):((16,4),(64,4)) is effectively the same layout as (4,5) : (16,64), because the 1s in the shape don't affect the layout (as a mathematical function from one integer to one integer). CuTe chooses not to simplify layout computations with run-time values in them as much as it could, because simplifications involving run-time values have a run-time cost.

Product

CuTe includes four different kinds of layout products.

  1. logical_product

  2. blocked_product

  3. raked_product

  4. tiled_product

logical_product(A, B) results in a layout where each element of layout B has been replaced by a "copy" of layout A. The other three products offer variations of this idea.

Example: Tiled matrix

Suppose that I want to make a matrix consisting of 3 x 4 tiles in a row-major arrangement, where each tile is a 2 x 2 column-major matrix.

The Layout of each tile (tile) has Shape (2,2) and Stride (1,2).

The Layout of the "matrix of tiles" (matrix_of_tiles) has Shape (3,4) and Stride (4,1).

Blocked product: the intuitive tiling

If I were to deduce by hand what the layout of the tiled matrix should be, it would look like this.

(0,0) (1,0) (0,1) (1,1) (0,2) (1,2) (0,3) (1,3)
(0,0) 0 2 4 6 8 10 12 14
(1,0) 1 3 5 7 9 11 13 15
(0,1) 16 18 20 22 24 26 28 30
(1,1) 17 19 21 23 25 27 29 31
(0,2) 32 34 36 38 40 42 44 46
(1,2) 33 35 37 39 41 43 45 47

The row and column labels use the equivalence of 1-D logical coordinates and 2-D column-major coordinates. The left index in each pair is the row resp. column coordinate of the tile, while the right index in each pair is the row resp. column coordinate of the matrix-of-tiles. The resulting layout has Shape ((2, 3), (2, 4)), and Stride ((1, 16), (2, 4)), and the second mode can be coalesced. The Shape ((2, 3), (2, 4)) is hierarchical, but it is still rank-2 and can be drawn in 2D as above. Note how the row mode of the tile remains part of the row mode of the product, and the column mode of the tile remains a column mode of the product.

The above layout is what blocked_product(tile, matrix_of_tiles) produces. A critical use case for blocked product is "tiling" an "atom" (some tile that relates to a hardware feature) over a matrix.

Layout tile            = Layout<Shape <_2,_2>,
                                Stride<_1,_2>>{};
Layout matrix_of_tiles = Layout<Shape <_3,_4>,
                                Stride<_4,_1>>{};

print_layout(blocked_product(tile, matrix_of_tiles));
Logical product

The logical product logical_product(tile, matrix_of_tiles) results in Shape ((2, 2), (3, 4)) and Stride ((1, 2), (16, 4)).

(0,0) (1,0) (2,0) (0,1) (1,1) (2,1) (0,2) (1,2) (2,2) (0,3) (1,3) (2,3)
(0,0) 0 16 32 4 20 36 8 24 40 12 28 44
(1,0) 1 17 33 5 21 37 9 25 41 13 29 45
(0,1) 2 18 34 6 22 38 10 26 42 14 30 46
(1,1) 3 19 35 7 23 39 11 27 43 15 31 47

Note how the tile appears in the leftmost column and is reproduced in each column in the same order as the matrix-of-tiles. That is, the tile can be indexed through the first mode of the result and the matrix-of-tiles can be indexed through the second mode.

Layout tile            = Layout<Shape <_2,_2>,
                                Stride<_1,_2>>{};
Layout matrix_of_tiles = Layout<Shape <_3,_4>,
                                Stride<_4,_1>>{};

print_layout(logical_product(tile, matrix_of_tiles));
Raked product

The raked product raked_product(tile, matrix_of_tiles) results in Shape ((3, 2), (4, 2)) and Stride ((16, 1), (4, 2)).

(0,0) (1,0) (2,0) (3,0) (0,1) (1,1) (2,1) (3,1)
(0,0) 0 4 8 12 2 6 10 14
(1,0) 16 20 24 28 18 22 26 30
(2,0) 32 36 40 44 34 38 42 46
(0,1) 1 5 9 13 3 7 11 15
(1,1) 17 21 25 29 19 23 27 31
(2,1) 33 37 41 45 35 39 43 47

The tile is now interleaved or "raked" with the other 3x4 matrix-of-tiles instead of appearing as blocks. Other references call this a "cyclic distribution."

This might look familiar if you have ever used ScaLAPACK. It expresses a 2-D block cyclic distribution of a 6 x 8 matrix over 4 processes in a 2 x 2 "process grid." See "The Two-dimensional Block-Cyclic Distribution" and "Local Storage Scheme and Block-Cyclic Mapping" in the ScaLAPACK Users' Guide.

In general, logical_product and these variations can produce any interleaving, including blocked, cyclic, by-mode blocked/cyclic, and intermediate interleavings that don't have common names.

Layout tile            = Layout<Shape <_2,_2>,
                                Stride<_1,_2>>{};
Layout matrix_of_tiles = Layout<Shape <_3,_4>,
                                Stride<_4,_1>>{};

print_layout(raked_product(tile, matrix_of_tiles));

Division

The previous section covered layout products, that reproduce one layout over another. This section covers layout division. Functions that divide a layout into components are useful as a basis for tiling and partitioning layouts.

For example, consider folding a vector into a matrix. We could imagine an operation, called logical_divide,

Layout vec = Layout<_16,_3>{};           //  16 : 3
Layout col = Layout< _4,_1>{};           //   4 : 1
Layout mat = logical_divide(vec, col);   // (4,4) : (3,12)

that "takes" the first 4 elements of the vector into the first mode and leaves the "rest" in the second mode. This is a column-major matrix view of the data in vec. What if we want a row-major matrix view?

Layout vec = Layout<_16,_3>{};           //  16 : 3
Layout col = Layout< _4,_4>{};           //   4 : 4
Layout mat = logical_divide(vec, col);   // (4,4) : (12,3)

Now, every fourth element of the vector is in the first mode and the "rest" are in the second mode. Multidimensional, hierarchical indices let us extend this operation to any layout that "divides" the vector.

Layout vec = Layout<_16,_3>{};           //  16 : 3
Layout col = Layout< _4,_2>{};           //   4 : 2
Layout mat = logical_divide(vec, col);   // (4,(2,2)) : (6,(3,24))
Layout vec = Layout<_16,_3>{};           //  16 : 3
Layout col = Layout<Shape <_2,_2>,
                    Stride<_4,_1>>{};    // (2,2) : (4,1)
Layout mat = logical_divide(vec, col);   // ((2,2),(2,2)) : ((12,3),(6,24))

All of the above examples produce a 4x4 matrix that can be indexed and treated like a normal 4x4 matrix, but each has a different underlying layout. Thus, our algorithms can be written using logical coordinates, without needing to address the detailed indexing that each layout requires.

CuTe includes 3 different kinds of layout division operations.

  1. logical_divide

  2. zipped_divide

  3. tiled_divide

We will summarize these in the sections that follow.

Logical divide

Example worked in detail

This section will work the following logical divide example in detail.

Layout a = make_layout(24, 2);
Layout b = make_layout( 4, 2);
Layout c = logical_divide(a, b);

Logical divide produces a rank-2 Layout, where mode 0 (the leftmost mode) corresponds to the divisor b, and mode 1 (the rightmost mode) corresponds to the "remainder." Intuitively, the remainder of 24 divided by 4 is 6, so we know that mode 1 has 6 elements. We just don't know its shape yet.

CuTe defines logical_divide(a, b) as composition(a, make_layout(b, complement(b, size(a)))). Here, size(a) is 24. What is complement(b, 24)? Intuitively, it means "the remainder," what's left over after applying b to 0, 1, 2, $\dots$, 23.

The layout 4:2 means "take 4 elements at even-numbered indices." The following table overlays the range of 4:2 atop the complement's codomain 0, 1, $\dots$, 23.

Range of 4:2 0 2 4 6
Codomain 0 1 2 3 4 5 6 7 8 9 $\dots$ 23

Layouts are linear, so their range must include zero. The complement of 4:2 with respect to 24 is thus a layout whose range

  • includes zero;

  • does not include any other elements of the range of 4:2 (i.e., satisfies the disjoint property; see above); and

  • includes as much of 0, 1, $\dots$, 23 as possible (so that it forms the "remainder" of 4:2 with respect to 24).

Intuitively, the range of the complement must look like this: 0, 1, 8, 9, 16, 17. The resulting layout is ordered. It has size 6 and cosize 18, so it satisfies the bounded property (see above). This is the layout (2, 3) : (1, 8). (Going from this intuitive sense of the complement to knowing how to compute it directly is out of scope for this part of the tutorial.)

The following table shows 4:2 with its complement (2, 3) : (1, 8).

Range of 4:2 0 2 4 6
Codomain 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 $\dots$ 23
--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
Range of complement 0 1 8 9 16 17

Now we know that logical_divide(24:2, 4:2) is composition(24:2, make_layout(4:2, (2,3):(1,8))). The composition of two layouts has the shape of the second (rightmost) layout, so the resulting shape is (4, (2, 3)). We see that the leftmost mode 4 corresponds to the divisor 4:2, and the rightmost mode (2, 3) describes what's "left over" from the original shape 24.

What are the strides? We can start from the leftmost mode. 4:2 takes every other element (the even-numbered elements) of 24:2. That's a stride-2 thing, striding over a stride-2 thing. The resulting stride is 4. Similarly, the stride 2 of 24:2 doubles the two strides of the rightmost mode. The resulting layout is (4, (2, 3)) : (4, (2, 16)).

Tiling example

Suppose I have the 6 x 8 matrix from the Raked Product section and want to "collect" the tile, turning the Raked Product into the Blocked Product.

To do this, we would like to gather two elements from the column and leave the rest, then gather two elements from the row and leave the rest. Thus, we want to apply logical_divide independently to the rows and cols in order to retrieve the appropriate elements.

In code, we copy the Layout from the result of the Raked Product section, then specify the elements in the rows and cols we would like to gather.

Layout raked_prod = Layout<Shape <Shape < _3,_2>,Shape <_4,_2>>,
                           Stride<Stride<_16,_1>,Stride<_4,_2>>>{};
Tile   subtile    = make_tile(Layout<_2,_3>{},    // Gather elements 2 : 3 from mode 0
                              Layout<_2,_4>{});   // Gather elements 2 : 4 from mode 1

print_layout(logical_divide(raked_prod, subtile));

Indeed, this does produce the result from the Blocked Product section.

(0,0) (1,0) (0,1) (1,1) (0,2) (1,2) (0,3) (1,3)
(0,0) 0 2 4 6 8 10 12 14
(1,0) 1 3 5 7 9 11 13 15
(0,1) 16 18 20 22 24 26 28 30
(1,1) 17 19 21 23 25 27 29 31
(0,2) 32 34 36 38 40 42 44 46
(1,2) 33 35 37 39 41 43 45 47

Of course, any other rearrangement of the rows and cols is also valid.

Zipped divide

The zipped_divide function applies logical_divide, and then gathers the "subtiles" into a single mode and the "rest" into a single mode.

For example, if we apply zipped_divide instead of logical_divide in the example above,

Layout raked_prod = Layout<Shape <Shape < _3,_2>,Shape <_4,_2>>,
                           Stride<Stride<_16,_1>,Stride<_4,_2>>>{};
Tile   subtile    = make_tile(Layout<_2,_3>{},    // Gather elements 2 : 3 from mode 0
                              Layout<_2,_4>{});   // Gather elements 2 : 4 from mode 1

print_layout(zipped_divide(raked_prod, subtile));

then we get the result

(0,0) (1,0) (2,0) (0,1) (1,1) (2,1) (0,2) (1,2) (2,2) (0,3) (1,3) (2,3)
(0,0) 0 16 32 4 20 36 8 24 40 12 28 44
(1,0) 1 17 33 5 21 37 9 25 41 13 29 45
(0,1) 2 18 34 6 22 38 10 26 42 14 30 46
(1,1) 3 19 35 7 23 39 11 27 43 15 31 47

Note that this is the same layout as the result in the Logical Product section. That is, the first mode is our original tile (and can be interpreted as a 2x2 matrix itself) and the second mode is its logical layout within the raked layout.

More Examples of Divide

For brevity, shapes can be used with logical_divide and tiled_divide to quickly split and tile modes of a tensor. For example, this C++ code

Layout layout     = Layout<Shape <_12, _32,_6>,
                           Stride< _1,_128,_0>>{};
Shape  tile_shape = make_shape(_4{},_8{});
Layout logical_divided_tile = logical_divide(layout, tile_shape);
Layout zipped_divided_tile  =  zipped_divide(layout, tile_shape);

print("layout               :  "); print(layout);               print("\n");
print("tile_shape           :  "); print(tile_shape);           print("\n");
print("logical_divided_tile :  "); print(logical_divided_tile); print("\n");
print("zipped_divided_tile  :  "); print(zipped_divided_tile);  print("\n\n");

produces the following output when we vary layout.

full_layout          :  (_12,_32,_6):(_1,_128,_0)
tile_shape           :  (_4,_8)
logical_divided_tile :  ((_4,_3),(_8,_4),_6):((_1,_4),(_128,_1024),_0)
zipped_divided_tile  :  ((_4,_8),(_3,_4,_6)):((_1,_128),(_4,_1024,_0))

full_layout          :  (_12,(_4,_8),_6):(_1,(_32,_512),_0)
tile_shape           :  (_4,_8)
logical_divided_tile :  ((_4,_3),((_4,_2),_4),_6):((_1,_4),((_32,_512),_1024),_0)
zipped_divided_tile  :  ((_4,(_4,_2)),(_3,_4,_6)):((_1,(_32,_512)),(_4,_1024,_0))

This code

Layout layout = make_layout(Shape<_8,_8>{},
                            Stride<_8,_1>{});
Layout tile = make_tile(make_layout(Shape<_4>{}),
                        make_layout(Shape<_2>{}));
print("layout: ");
print_layout(layout);
print("\n");
print("tile: ");
print(tile);
print("\n");
print("logical_divide: ");
print_layout(logical_divide(layout, tile));
print("zipped_divide: ");
print_layout(zipped_divide(layout, tile));

results in the following layouts.

logical_divide-and-zipped_divide

This code

Layout layout = make_layout(Shape<_8,_8>{},
                            Stride<_8,_1>{});
Layout tile = make_tile(make_layout(Shape<_2>{}),
                        make_layout(Shape<_4>{}));
print("layout: ");
print_layout(layout);
print("\n");
print("tile: ");
print(tile);
print("\n");
print("logical_divide: ");
print_layout(logical_divide(layout, tile));
print("zipped_divide: ");
print_layout(zipped_divide(layout, tile));

results in the following layouts.

logical_divide-and-zipped_divide-2

Tiled divide

The tiled_divide function works like zipped_divide, except that it unpacks the second mode. This is useful when you have a Tile that describes all of the elements for a particular operation, for example, and want to gather those together but retain the logical shape of those tiles within the original layout. That is,

Layout Shape : (M, N, L, ...)
Tile Shape   : <M', N'>
Tiled Result : ((M', N'), m, n, L, ...)

where m is M / M' and n is N / N'. We can consider m as the "number of Tiles in M" and n as the "number of Tiles in N". This style of operation is common when applying MMA Atoms and Copy Atoms.