Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37988: [Format] Add VariableShapeTensor canonical extension type definition #37992

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions docs/source/format/CanonicalExtensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,112 @@ Fixed shape tensor
by this specification. Instead, this extension type lets one use fixed shape tensors
as elements in a field of a RecordBatch or a Table.

.. _variable_shape_tensor_extension:

Variable shape tensor
=====================

* Extension name: `arrow.variable_shape_tensor`.

* The storage type of the extension is: ``StructArray`` where struct
is composed of **data** and **shape** fields describing a single
tensor per row:

* **data** is a ``List`` holding tensor elements of a single tensor.
Data type of the list elements is uniform across the entire column.
* **shape** is a ``FixedSizeList<int32>[ndim]`` of the tensor shape where
the size of the list ``ndim`` is equal to the number of dimensions of the
tensor.

* Extension type parameters:

* **value_type** = the Arrow data type of individual tensor elements.

Optional parameters describing the logical layout:

* **dim_names** = explicit names to tensor dimensions
as an array. The length of it should be equal to the shape
length and equal to the number of dimensions.

``dim_names`` can be used if the dimensions have well-known
names and they map to the physical layout (row-major).

* **permutation** = indices of the desired ordering of the
original dimensions, defined as an array.

The indices contain a permutation of the values [0, 1, .., N-1] where
N is the number of dimensions. The permutation indicates which
dimension of the logical layout corresponds to which dimension of the
physical tensor (the i-th dimension of the logical view corresponds
to the dimension with number ``permutations[i]`` of the physical tensor).

Permutation can be useful in case the logical order of
the tensor is a permutation of the physical order (row-major).

When logical and physical layout are equal, the permutation will always
be ([0, 1, .., N-1]) and can therefore be left out.

* **uniform_dimensions** = indices of dimensions whose sizes are
guaranteed to remain constant. Indices are a subset of all possible
dimension indices ([0, 1, .., N-1]).
The uniform dimensions must still be represented in the `shape` field,
and must always be the same value for all tensors in the array -- this
allows code to interpret the tensor correctly without accounting for
uniform dimensions while still permitting optional optimizations that
take advantage of the uniformity. uniform_dimensions can be left out,
in which case it is assumed that all dimensions might be variable.

* **uniform_shape** = shape of the dimensions that are guaranteed to stay
constant over all tensors in the array, with the shape of the ragged dimensions
set to 0.
An array containing tensor with shape (2, 3, 4) and uniform dimensions
(0, 2) would have uniform shape (2, 0, 4).

* Description of the serialization:

The metadata must be a valid JSON object, that optionally includes
dimension names with keys **"dim_names"**, ordering of
dimensions with key **"permutation"**, indices of dimensions whose sizes
are guaranteed to remain constant with key **"uniform_dimensions"** and
shape of those dimensions with key **"uniform_shape"**.
Minimal metadata is an empty JSON object.

- Example of minimal metadata is:

``{}``

- Example with ``dim_names`` metadata for NCHW ordered data:

``{ "dim_names": ["C", "H", "W"] }``

- Example with ``uniform_dimensions`` metadata for a set of color images
with variable width:

``{ "dim_names": ["H", "W", "C"], "uniform_dimensions": [1] }``

- Example of permuted 3-dimensional tensor:

``{ "permutation": [2, 0, 1] }``

This is the physical layout shape and the shape of the logical
layout would given an individual tensor of shape [100, 200, 500]
be ``[500, 100, 200]``.

.. note::

With the exception of ``permutation``, the parameters and storage
of VariableShapeTensor relate to the *physical* storage of the tensor.

For example, consider a tensor with:
shape = [10, 20, 30]
dim_names = [x, y, z]
permutations = [2, 0, 1]

This means the logical tensor has names [z, x, y] and shape [30, 10, 20].

Elements in a variable shape tensor extension array are stored
in row-major/C-contiguous order.

=========================
Community Extension Types
=========================
Expand Down