-
Notifications
You must be signed in to change notification settings - Fork 689
Data Structures
Breeze has three basic kinds of data structures: vectors, matrices and counters. All of them are breeze.linalg.Tensor[K, V]
, which is more or less like a scala.collection.mutable.Map[K, V]
that is geared towards working with numeric value types. The key type K
is the index type. For instance, Vectors have a key type of Int
, while Matrices have a key type of (Int, Int)
. Counters differs from vectors and matrices in such way that they can be arbitrarily keyed.
Most of the time you will need only DenseVector and DenseMatrix.
For a UML-like diagram of implementation details, see here
Tensors have a lot of methods defined on them, mostly from their parent traits breeze.linalg.NumericOps
and breeze.linalg.QuasiTensor
. The former defines operators that you use for adding, multiplication, etc. The latter defines operations relevant to selecting elements from the tensor.
Tensors are a lot like a scala Map[K, V]
that support numeric operations. One important way they differ is that there is no automatic foreach
or map
method. Instead, you must explicitly ask for tensor.keys
, tensor.values
, or tensor.pairs
to iterate through the keys, values, and pairs respectively.
Below is an overview of the methods and what they mean. Here, t
and t2
are Tensors, k
is a key type, v
is a scalar value type, and u
is either a Tensor or a scalar value type.
Method | Meaning |
---|---|
t(k) |
returns the value associated with k
|
t(k) = v |
sets the index k 's value to v . |
t.size |
returns the logical size of the tensor. |
t.activeSize |
returns the number of elements that have storage allocated in this tensor. |
t.keys |
returns a TensorKeys object for this tensor. |
t.values |
returns a TensorValues object for this tensor. |
t.pairs |
returns a TensorPairs object for this tensor. |
t.active.keys |
returns a TensorKeys object for this tensor, selecting only elements with storage in this tensor. |
t.active.values |
returns a TensorValues object for this tensor, selecting only elements with storage in this tensor. |
t.active.pairs |
returns a TensorPairs object for this tensor, selecting only elements with storage in this tensor. |
t.iterator |
iterates through all pairs (k, v) . |
t.pairs |
returns a TensorPairs object offering methods like foreach and map |
t.keysIterator |
iterates through all keys k . |
t.valuesIterator |
iterates through all values v . |
t.activeIterator |
iterates through all pairs (k, v) that are stored in this tensor. Others are presumed to be zero. |
t.activeKeysIterator |
iterates through all keys k with values that are stored in this tensor. Others are presumed to be zero. |
t.activeValuesIterator |
iterates through all values v that are stored in this tensor. Others are presumed to be zero. |
t.keySet |
returns the set of keys in this tensor |
Method | Meaning |
---|---|
t.foreachPair(f) |
foreach applied to all pairs. |
t.pairs.foreach(f) |
same as the above. |
t.foreachKey(f) |
foreach applied to all keys. |
t.keys.foreach(f) |
same as the above. |
t.foreachValue(f) |
foreach applied to all values. |
t.values.foreach(f) |
same as the above. |
t.mapPairs(f) |
map applied to all pairs. |
t.pairs.map(f) |
same as the above. |
t.mapKeys(f) |
map applied to all keys. |
t.keys.map(f) |
same as the above. |
t.mapValues(f) |
map applied to all values. |
t.values.foreach(f) |
same as the above. |
t.findAll(f) |
returns all indices k whose value satisfies a predicate. |
See UFunc wiki page for full list of operators and reduction UFuncs.
The built-in Tensor
hierarchy looks something like this:
- Int-keyed
Tensor
s:-
Vector
- DenseVector: a "normal" array-backed Vector.
-
SparseVector: a sparse vector backed by binary search. O(log numNonZero) access to elements with optimizations for in-order traversal. Increasing the number of nonzero elements in the SparseVector (say, via
update
) is expensive, as O(numNonZero). -
HashVector: a sparse vector backed by a quadratic-probing open address hash array. O(1) access, but typically less memory efficient than
SparseVector
. No in-order traversal.
-
Matrix
-
DenseMatrix: a "normal" array-backed
Matrix
. Column-major unless isTranspose is true, in which case it is row-major. (This is consistent with standard BLAS/LAPACK assumptions.) - CSCMatrix: a still-under-development Compressed Sparse Columns Matrix. Each column of the matrix is analogous to a SparseVector: access to an element is O(log numNonZeroInColumn).
-
DenseMatrix: a "normal" array-backed
-
- Arbitrarily keyed
Tensor
s:
breeze.linalg.DenseVector
is the simplest Tensor type in Breeze; it's basically a glorified wrapper around an Array that supports numeric operations via the NumericOps
interface. (See the Operators section for more information.)
DenseVectors are your standard run-of-the-mill vector. They're backed by an Array field called data
, an offset
into the data array, a logical length
field, and a stride
field that is the width between components of the vector. Indexing into a DenseVector is defined as follows: vec(i) == vec.data(i * stride + offset)
.
DenseVectors are meant to be fast: their operators are either code-generated or defer to BLAS, and they are @specialized for basic types.
DenseVectors pair naturally with the DenseMatrix class.
SparseVector is a sparse vector backed by binary search. Indices and values are kept in two parallel arrays (index and data, respectively), with the indices sorted. Operations are:
- access to an element:
O(log numNonZero)
, with optimizations for in-order traversal. - updating a value for a index already stored in the Vector:
O(log numNonZero)
- inserting a value at a new index (via update): O(numNonZero)
- Note that appending to the end is amortized O(log numNonZero).
SparseVectors support efficient operations with DenseVectors and themselves. These include sum, elementwise product, dot product, etc. Elementwise dividing by a SparseVector is slow, but elementwise divide is kind of a weird operation for a SparseVector...
You should not be adding lots of values to a SparseVector if you want good speed. SparseVectors have to maintain the invariant that the index array is always sorted, which makes insertions expensive. Often one wants to create a SparseVector with a bunch of values, sort them, and then turn them into a SparseVector.
The class that supports this is VectorBuilder
, which is kind of like an unsorted SparseVector, where indices may be redundant. VectorBuilders have a method add(index, value)
that appends a pair to the end of its array, giving constant time updating. When you're done with the builder, you can call toSparseVector
, which sorts the values, collapses duplicate indices via summing, and then gives the result.
SparseVectors support the standard suite of iterators defined on Tensors, as well as a set of "active iterators" that only iterate over the keys in the SparseVector that have storage associated with them. This does not mean all non-zero values are skipped, but only zero values will be skipped.
However, these iterators have a boxing penalty associated with them, and until we have macro-supported iteration, the following pattern can be used when you need to efficiently iterate over active indices:
var offset = 0
while(offset < vec.activeSize) {
val index = vec.indexAt(offset)
val value = vec.valueAt(offset)
// use index and value
offset += 1
}
A HashVector is a sparse Vector
that uses an open address hash table to store non-zero values. Like a SparseVector, indices and values are kept in two parallel arrays. Unlike a SparseVector, these arrays are unsorted. Moreover, there can be unused slots in the arrays.
Accessing and updating a single value to a HashVector take expected constant time.
HashVectors support the standard suite of iterators defined on Tensors, as well as a set of "active iterators" that only iterate over the keys in the SparseVector that have storage associated with them. This does not mean all non-zero values are skipped, but only zero values will be skipped.
However, these iterators have a boxing penalty associated with them, and until we have macro-supported iteration, the following pattern can be used when you need to efficiently iterate over active indices:
var offset = 0
while(offset < vec.activeSize) {
if(vec.isUsed(offset)) {
val index = vec.indexAt(offset)
val value = vec.valueAt(offset)
// use index and value
offset += 1
}
}
The only difference from the pattern in SparseVector is the isUsed method, which checks that the slot is active. SparseVectors also have an isUsed method, which can be used for convenience.
The basic Matrix type in Breeze is the DenseMatrix, which is the Matrix equivalent of a DenseVector. It is a wrapper around an array of values. This array is column-major, like Matlab and most Fortran libraries, meaning that elements in the same column are laid out next too each other.
DenseMatrices have the following fields: the number of rows; the number of columns; the data array; an offset into the data array (useful for slices); an optional major stride, which dictates how far apart columns are; and a flag isTranspose, for if the matrix should be treated as being transposed or not.
DenseMatrices support the usual elementwise operations like addition and subtraction, as well as matrix multiply. This last operation is sent to native code via jblas when it is available. In addition, matrix solves are supported as well, and they are also sent to native code when possible.
There are a large number of Linear Algebra operations supported for DenseMatrices. These are also preferably solved using native code.
A CSCMatrix is a Compressed Sparse Column matrix. It is currently the only sparse matrix in Breeze, though others will be added eventually. CSCMatrices are in some sense the natural generalization of a SparseVector, being essentially a Matrix with SparseVectors as columns. It is best when every column is sparse, but most columns have some non-zero value.
CSCMatrices are not fully supported yet. They are missing several basic operations. Help is appreciated in fleshing it out.
A CSCMatrix has 3 arrays: a column pointers array called colPtr
, a rowIndices
array, and a data
array. colPtr
is of length cols + 1
(the number of columns + 1). colPtr(c)
is the index into rowIndices
and data
where the elements for the c'th column are stored, and colPtr(c+1)
is the index beyond the last element. colPtr.last
is always equal to rowIndices.length
. rowIndices
and data
are parallel arrays, containing the row index and value for that entry in the matrix (surprising, I know). Row indices between colPtr(c)
and colPtr(c+1)
are always sorted.
The complexity profile for CSCMatrix is a lot like a SparseVector. It needs O(cols + numNonZero) storage, so it is not the ideal choice if the number of columns with no nonzero elements is large. (Consider using/implementing a CSRMatrix or some kind of unstructured sparse matrix like a HashMatrix if so.)
Access to an individual element is O(log numNonZero), as is updating a value that is already stored in the matrix. Adding a new non-zero value to a CSCMatrix is O(numNonZero), and it should generally be avoided.
CSCMatrices support all the standard matrix construction methods, but in general creating a CSCMatrix should be done with a CSCMatrix.Builder, like the following:
val builder = new CSCMatrix.Builder[Double](rows=10, cols=10)
builder.add(3,4, 1.0)
// etc.
val myMatrix = builder.result()
CSCMatrix.Builder keeps its entries in an unsorted list, meaning that adding new values is amortized constant time.
Counters are like Vectors that can have an arbitrary key type, rather than an Int. Moreover, the range of the keys in a Counter is unrestricted: new key value pairs can be added at at time. For example:
val c = Counter("a"->1, "b"->2, "x"->3) // Counter(b -> 2, a -> 1, x -> 3)
c("y") = 17
println(c) // Counter(b -> 2, y -> 17, a -> 1, x -> 3)
Counters also support all the operations that are defined on vectors, including sums, elementwise operations, and dot products.
c dot c // 303
c += 10 // Counter(b -> 12, y -> 27, a -> 11, x -> 13)
c += Counter("x"->100, "c"->3) // Counter(b -> 12, y -> 27, a -> 11, x -> 113, c -> 3)
Counters are implemented as a mutable HashMap in the obvious way. Counters are not particularly efficient, and should only be used for non-critical computations.
Counter2s are to Counters what matrices are to vectors. They have two keys types: one for rows, and one for columns. Counter2s are basically HashMap[RowType,Counter[ColumnType,ValueType]]
with syntactic sugar to make them behave like matrices.
val m = Counter2((4,"b", -1), (5,"b", -2) ) // Counter2(5 -> Counter(b -> -2), 4 -> Counter(b -> -1))
m(5, ::) // Counter(b -> -2)
m(::, "b") // Counter(5 -> -2, 4 -> -1)
Breeze is a numerical processing library for Scala. http://www.scalanlp.org