-
Notifications
You must be signed in to change notification settings - Fork 54
Tensor introduction
The tensor object is the main functionality provided by Cyclops at the C++ and Python levels.
This wiki tutorial describes how Cyclops stores tensor data, introduces basic functionality provided by a tensor, and describes how these are exposed in the tensor interface. The description of C++ functionality provided here might be of interest also for Python interface users, as the more general form of the tensor object is used from within Python-level Cyclops library functions.
The examples
and the test
directories have a large pool of programs that extensively use the CTF infrastructure and are a good point of reference.
A Cyclops tensor object is defined based on the following properties. Aside from the shape of the tensor, in simple cases, default parameters can be relied on for all of these specifications.
- order/lens/shape: specifies the size of a tensor, in terms of the number of modes (indices) by
order
and their dimensions (ranges) bylens
- The C++ interface has Scalar/Vector/Matrix objects, which simplify specification of tensor size, the Python interface does not.
- element datatype: specifies the element type of the tensor
- The C++ interface enables templated instantiation with any fixed-size type, given an associated MPI Datatype.
- The Python interface supports only boolean, int, as well as floating point (real and complex) types.
- algebraic structure (monoid/group/ring/semiring): specifies the default additive and multiplicative operators for the tensor element datatype
- In C++, lambda functions are used to specify these, while the Python version supports only default operators.
- sparsity: specifies whether a sparse or dense format should be used to store the tensor data
- Cyclops either maintains a fixed-size buffer for local tensor data (dense format) or uses a structurally sparse format, which stores data values (that may be zero) along with their coordinates
- symmetry: specifies whether a symmetric-packed format should be used to store the tensor data
- In both C++ and Python, this is specified by an array of size equal to tensor order, whose
$i_{th}$ element describes whether the$i^{th}$ mode is part of a symmetry group with the$(i+1)^{th}$ mode.SY=2
(symmetric),AS=1
(skew/anti-symmetric), andSH=3
(symmetric hollow), andNS=0
(nonsymmetric). The last element of the array describing the symmetry should beNS
. For example, if a tensorT
has symmetry[SY,SY,NS,SH,NS]
only elementsT[i,j,k,l,m]
withi<=j<=k
andl<m
are stored. - Symmetry results in (anti)symmetrization of contraction/summation operation of which this tensor is an output, e.g.,
T["ij"] += A["ik"]*B["kj"]
becomesT["ij"] += A["ik"]*B["kj"] + A["jk"]*B["ki"]
ifT
has symmetry[SY,NS]
(and even ifA=B
numerically or by reference). - Cyclops will store only unique entries of a symmetric tensor, which reduces static memory footprint and may reduce cost for some operations while increasing cost for others (e.g., when unpacking is required).
- In both C++ and Python, this is specified by an array of size equal to tensor order, whose
- world/communicator: specifies the set of processors the tensor is distributed on (MPI communicator)
- partition: specifies a mapping of a tensor onto some grid of processors
- For example, given a world with 60 processors, may define
[3,4,5]
processor grid, then specify how up to 3 modes of the tensor should be distributed on this grid, e.g.,T[i,j,k,l]
will be owned by processor[p_j,0,p_i]
(here, middle processor grid dimension is unmapped). - Cyclops always distributes data across processors in a cyclic manner, so in the above example
p_j = j % 3
,p_i = i % 5
- Cyclops will pick a suitable processor grid mapping by default (unless the tensor is very small, it will be distributed over all processors), so the mapping should only be specified for manual performance optimization.
- For example, given a world with 60 processors, may define
- name/profile: assigns the tensor a name and toggles performance profiling
- This helps provide human-readable output regarding performance of operations involving a particular tensor, if Cyclops is build with
-DPROFILE
and-DVERBOSE=1
.
- This helps provide human-readable output regarding performance of operations involving a particular tensor, if Cyclops is build with
At Python-level, a Cyclops tensor can be initialized in a variety of ways. All MPI processes executing the program should invoke the constructors, as well as subsequent routines, in a bulk-synchronous manner.
import ctf
U = ctf.tensor([5,7]) # dense zero matrix
M = ctf.random.random((4,4)) # random dense tensor
O = ctf.ones((4,3,5)) # tensor full of ones
I = ctf.eye(9) # dense identity
T = ctf.tensor([5,3,4], sp=True) # sparse tensor
T.fill_sp_random(-1.,1.,.1) # 10% density
S = ctf.speye(9) # sparse identity
T = ctf.tensor([4,4,4], sym=[ctf.SYM.SY, ctf.SYM.SY, ctf.SYM.NS]) # symmetric tensor
For C++, constructors enable specifications of the aforementioned parameters as well as types and algebraic structures that are not available at Python level.
Vector<> v(8); # vector of doubles of dimension 8
Matrix<float> M(4,4,SP|SY); # 4-by-4 matrix of floats in sparse and symmetric packed format
Tensor< std::complex<float> > T(4,{3,4,6,5}); # 3-by-4-by-6-by-5 complex tensor
Semiring<float> max_semiring(0., [](float a, float b){ return std::max(a,b); }, MPI_MAX,
1.0, [](float a, float b){ return a*b; });
Vector<float> T(n,max_semiring); # vector with elementwise addition defined by max operation and elementwise multiplication defined by the standard product
Tensor<> T(4,{m,m,n,n},{AS,NS,AS,NS}); # order 4 tensor with two 2-index anti symmetries, t_{ijkl} = -t_{jikl} = -t_{ijlk} = t_{jilk}
Partition processor_grid(3,{3,4,5});
Tensor<> T(4,lens,"ijkl",processor_grid["jmi"]); # order 4 tensor distributed on a 3D processor grid, so that processor [p_j,0,p_i] owns T[i',j',k,l] if i and j are equal to i' and j' modulus the processor dimension lengths.
In the C++ interface, for operations such as summation and contraction of tensors, the output tensor needs to be defined before the operation.
On the other hand, the Python interface functions often create the output tensor object (e.g., ctf.einsum
does this by default, although ctf.einsum(...,out=...)
may be used to specify the output tensor to accumulate ahead of time, which enables specification whether the output should be symmetrized, stored in symmetric packed format and/or stored in sparse format.
Cyclops tensor interface functions are all high-level operations.
Since the library is designed for parallel execution in a distributed setting, and any given piece of data may not be available locally, (with some exceptions) each operation on a tensor must be done with all MPI processes in the world (MPI communicator) on which the tensor is distributed.
Operations execute synchronously.
Unless the sparsity pattern of a sparse Cyclops tensor is changed, a sparse or dense Cyclops tensor is stored in a statically allocated buffer.
In particular, the member variable data
will point to the same buffer before and after the execution of a Cyclops operation (but additional data buffers are dynamically allocated during execution).
By default, a Cyclops tensor is initialized to be fully zero, with the buffer allocated in the dense case, and no buffer allocated in the sparse case. More generally, the data values are set to the additive identity element of the algebraic structure, or left uninitialized if no additive identity is provided at all. The tensor data buffer is padded, so that the local dimension of each tensor mode is chosen to be the closest multiple of the processor grid dimension onto which that mode is mapped. Consequently, the dense tensor data buffer size is always the same for all processors.
Tensor data is generally retrievable from a Cyclops tensor in batches.
To enable distribution-oblivious data access, any processor may input or query any subset of the tensor elements, specified by their position in the tensor.
These pairs are stored internally in a array-of-structs format, where each element's position is represented by a single 64-bit signed integer, but may be input also in a struct-of-arrays format, or with an array to specify the coordinate of each element's position along each dimension of the tensor.
The position of an element is calculated in column-major order in C++ and the reverse order in Python, i.e., if A
is n-by-n A[i,j]
(assuming 0-based indexing) is stored in position i+jn
, while in Python, it is stored in position j+in
.
Tensors whose overall size exceeds the max value of a 64-bit integer (this includes some publicly available sparse tensor datasets) will not work with Cyclops due to the choice of internal format.
Data stored locally by a processor, may be accessed (without coordinating with other processors) via get_local_data()/read_local()
.
Each data element is stored on a unique process (some processes may store no elements).
World dw(MPI_COMM_WORLD);
Vector<int> v(n, SP, dw); // A sparse vector of size n
int64_t npairs;
Pair<int> * loc_pairs;
v.read_local(&npairs, &loc_pairs); // memory allocation for loc_pairs is handled by CTF
delete [] loc_pairs;
A user-defined set of elements may be queried from each processor via read()
or accumulated to the tensor via write()
(the sets defined by each process are allowed to have overlap, in which case write()
will accumulate over the additive identity).
Pair<int> * update_pairs = new Pair<int>[x]; // x is the number of elements to update from this processor; the value of x need not be the same across processors
for (int64_t i = 0; i < x; i++) {
update_pairs[i].k = element_tensor_index;
update_pairs[i].d = new_data_value;
}
v.write(x, update_pairs); // write data to CTF tensor bulk synchronously from all processors
delete [] update_pairs;
All data of the tensor may be collected on each processor via get_all_data()/read_all()
or by converting the tensor to a numpy array ctf.tensor.to_nparray()
in Python.
The most performant way of accessing the data of a dense CTF tensor in C++, is to work with the data pointer directly (via get_raw_data()
).
The data available locally may be controlled by manually defining the input distribution of the tensor, via the Partition/Idx_Partition
constructor interface.
However, usage of the data pointer must manage the possibility of padding along any given mode of the tensor, which may affect program semantics depending on the number of processors being used.
When data is queried via read_local()
, padded elements are not exposed to the user, but a conversion to key-value pair format takes place for a dense tensor.
It is possible to extract a subtensor (slice) of a Cyclops tensor to define a new Cyclops tensor.
In C++, the slice()
function most generally permits accumulation of a slice of one tensor into that of another.
The slice can be specified by ranges (offsets and end-points) along each mode, or by specifying the positions of two end-points of the bounding box (one defined by the coordinates of the offsets and the other defined by the end-point coordinates).
In Python, Cyclops tensor slices can be extracted in the same way as numpy
array slices, as in the following example.
T[n/2:n,0:n/2] = V[-n/2:,::2] # t_{ij} = v_{n-i,2j}
The above example extracts every other column of the matrix V
, which is done by the more general C++ permute()
function.
The permute()
function extracts and reorders any subset of rows/columns of a matrix and more generally permutes and subsamples the index range along each mode.
For example, given arrays r
, p
, q
. permute()
can form a tensor T
of size equal to or smaller than V
as follows.
T[i,j,k] = V[r[i],p[j],q[k]]
Cyclops allows specification of tensor summation and transposition operations via syntax that follows the Einstein summation notation convention. In C++, tensor addition, transposition, summation, and reductions, can be specified via indexed syntax as follows.
T["ijkl"] += 2.*V["jilk"]; // t_{ijkl} = t_{ijkl} + 2 * v_{jikl}
3.*T["ijkl"] += 2.*V["jil"]; // t_{ijkl} = 3. * t_{ijkl} + 2 * v_{jil}
V["jil"] = T["ijkl"]; // v_{jil} = sum_k t_{ijkl}
V["iiij"] = M["ij"]; // v_{iiij} = m_{ij}
double a = T["ijkl"]; // a = sum_{ijkl} t_{ijkl}
In Python, the same operations may be expressed in multiple ways:
3.*T.i("ijkl") << 2.*V.i("jil");
T = 3.*T + 2.*ctf.einsum("jil->ijkl",V)
The output to einsum
(which handles a slightly more general set of operations than numpy.einsum
) may also be prespecified as in the following example for the same operation.
T = 3.*T
ctf.einsum("jil->ijkl",V,out=T)
The Einstein notation syntax used for summation operations extends to contraction of two or more tensors in the natural way. In C++, tensor contractions may be expressed as follows.
C["ij"] = A["ik"]*B["kj"]; // matrix multiplication
T["klij"] += U["iajb"]*V["akbl"]; // tensor contraction
T["ijk"] -= Z["abc"]*U["ai"]*V["bj"]*W["ck"]; // multi-tensor contraction
M["ij"] += (A["ik"]*B["ik"] + 2.*C["ik"])*D["kj"]; // general expression
In Python, the ctf.tensor.i(...)
or ctf.einsum
syntax may be used to express these operations similarly.
Contractions may also be performed via functions such as dot
and tensordot
, which follow numpy
conventions.
When more than two tensors are contracted, given an Einstein summation or einsum
specification of the contractions, Cyclops will perform a sequence of binary (two-tensor) contractions to compute the full expression.
The order of contractions is defined by an exhaustive search over a sliding window of a fixed constant (e.g., 8-10) tensors in the contraction expression.
The time to execute each contraction is estimated at this point using a coarse model that takes into account the tensor size and an estimate of total flops.
Intermediate tensors will be defined as sparse if the contracted operands are sparse and the output is predicted to have sufficiently low density (estimates are computed under the assumption that nonzeros are distributed uniformly at random).
Cyclops also supports specialized (all-at-once contraction) kernels for some multi-tensor contraction routines. The extent of this functionality is under active development. Current support includes the matricized tensor times Khatri-Rao product (MTTKRP) and tensor times tensor product (TTTP) routines, which perform
w_{ir} = sum_{j,k} t_{ijk}*u_{jr}*v_{kr}; # MTTKRP for order 3 tensor
w = ctf.einsum("ijk,jr,kr->ir",T,U,V) # MTTKRP via pairwise contraction
ctf.MTTKRP(T,[W,U,V]) # All-at-once MTTKRP
z_{ijk} = sum_{r} t_{ijk}*u_{ir}*v_{jr}*w_{kr}; # TTTP for order 3 tensor
Z = ctf.TTTP(T,[U,V,W])
These kernels, which arise in computing tensor decomposition in completion with the CP model, are much more efficient than binary contraction when the tensor T
is sparse.
At C++ level, Einstein summation operations may also be equipped with an arbitrary elementwise function, which enables parallelization of a general class of nested loop programs.
Given function f : dtype_A x dtype_B -> dtype_C
and associative function (reduce operation) g : dtype_C x dtype_C -> dtype_C
, a nested loop with the following structure
for a
...
for b
for u
...
for v
for x
...
for y
C[a,...,b,u,...,v] = g(C[a,...,b,x,...,y], f(A[a,...,b,x,...,y],B[x,...,y,u,...,v))
Given definition of g
, f
, Cyclops can express this operation via the Function
construct.
Tensor<dtype_A> A(...);
Tensor<dtype_B> B(...);
Monoid<dtype_C> mon_g(..., g, ...);
Tensor<dtype_C> C(..., mon_g);
Function<dtype_A,dtype_B,dtype_C> func_f(f);
C["a...bu...v"] += func_f(A["a...bx...y"],B[x...yu...v"])
The Transform
construct takes a step further by also accepting the output element of C
by reference.
Both Function
and Transform
may be executed with fewer tensor operands, e.g., as in the following rounding functions.
Vector<int> v(n);
Vector<double> w(n);
v["i"] = Function<double,int>([](double a){ return (int)a; } )(w["i"]);
Transform<double,int>([](double a, int & b)){ b = (int)a; })(w["i"],v["i"]);
These constructs require passing statically-defined functions, and so are not available in the C++ Python interface. However, they are used by many of the implemented Python Cyclops library functions and new functions in similar style may be added upon request (see discussion of software hierarchy below).
The core functionality of tensor operations is implemented as part of the C++ untyped tensor
class (see src/tensor/untyped_tensor.h
).
Datatype and elementwise operator information is stored in the algebraic structure (algstrct * sr
member variable) of this class (so it is not technically an abstract class).
Typed algebraic structures, e.g., Semiring<int>
derive from the algstrct
class and implement its abstract functions, including optimized batch kernels, such as sparse and dense sequential/threaded matrix-matrix products.
The typed Tensor<dtype>
class derives from tensor
and acts as the main C++ interface to the tensor object, providing more convenient typed interface functions for most tensor subroutines.
At Python-level, the tensor object is a Cython object that stores a pointer to a C++ tensor
object (which is also associated with a pre-instantiated Tensor<dtype>
for a few choices of dtype
) and performs operations by calling appropriate functions on the C++ tensor object.
Consequently, modulo buffering/conversion of data inputs to tensor functions, the performance and memory usage of Python-level Cyclops routines should match the C++ counterparts.
The need to pre-instantiate and inability to use lambda functions implies that more can be done via general routines on the C++ tensor object than what will have been exposed at Python-level at any given time.
If such missing auxiliary tensor functions (e.g., elementwise functions such as abs()
, conj()
) are needed and missing at Python-level, please create a request via GitHub issue.