Skip to content
chrismedrela edited this page May 16, 2014 · 9 revisions

Breeze has three basic kinds of data structures: vectors, matrices and counters. All of them are breeze.linalg.Tensor[K, V], which is more or less like a scala.collection.mutable.Map[K, V] that is geared towards working with numeric value types. The key type K is the index type. For instance, Vectors have a key type of Int, while Matrices have a key type of (Int, Int). Counters differs from vectors and matrices in such way that they can be arbitrarily keyed.

Most of the time you will need only DenseVector and DenseMatrix.

Tensor Methods

Tensors have a lot of methods defined on them, mostly from their parent traits breeze.linalg.NumericOps and breeze.linalg.QuasiTensor. The former defines operators that you use for adding, multiplication, etc. The latter defines operations relevant to selecting elements from the tensor.

Tensors are a lot like a scala Map[K, V] that support numeric operations. One important way they differ is that there is no automatic foreach or map method. Instead, you must explicitly ask for tensor.keys, tensor.values, or tensor.pairs to iterate through the keys, values, and pairs respectively.

Below is an overview of the methods and what they mean. Here, t and t2 are Tensors, k is a key type, v is a scalar value type, and u is either a Tensor or a scalar value type.

Accessing and Iterating

Method Meaning
t(k) returns the value associated with k
t(k) = v sets the index k's value to v.
t.size returns the logical size of the tensor.
t.activeSize returns the number of elements that have storage allocated in this tensor.
t.keys returns a TensorKeys object for this tensor.
t.values returns a TensorValues object for this tensor.
t.pairs returns a TensorPairs object for this tensor.
t.active.keys returns a TensorKeys object for this tensor, selecting only elements with storage in this tensor.
t.active.values returns a TensorValues object for this tensor, selecting only elements with storage in this tensor.
t.active.pairs returns a TensorPairs object for this tensor, selecting only elements with storage in this tensor.
t.iterator iterates through all pairs (k, v).
t.pairs returns a TensorPairs object offering methods like foreach and map
t.keysIterator iterates through all keys k.
t.valuesIterator iterates through all values v.
t.activeIterator iterates through all pairs (k, v) that are stored in this tensor. Others are presumed to be zero.
t.activeKeysIterator iterates through all keys k with values that are stored in this tensor. Others are presumed to be zero.
t.activeValuesIterator iterates through all values v that are stored in this tensor. Others are presumed to be zero.
t.keySet returns the set of keys in this tensor

Higher-Order Methods (like foreach and map)

Method Meaning
t.foreachPair(f) foreach applied to all pairs.
t.pairs.foreach(f) same as the above.
t.foreachKey(f) foreach applied to all keys.
t.keys.foreach(f) same as the above.
t.foreachValue(f) foreach applied to all values.
t.values.foreach(f) same as the above.
t.mapPairs(f) map applied to all pairs.
t.pairs.map(f) same as the above.
t.mapKeys(f) map applied to all keys.
t.keys.map(f) same as the above.
t.mapValues(f) map applied to all values.
t.values.foreach(f) same as the above.
t.findAll(f) returns all indices k whose value satisfies a predicate.

Reduction and operators

See UFunc wiki page for full list of operators and reduction UFuncs.

Tensor hierarchy

The built-in Tensor hierarchy looks something like this:

  • Int-keyed Tensors:
    • Vector
      • DenseVector: a "normal" array-backed Vector.
      • SparseVector: a sparse vector backed by binary search. O(log numNonZero) access to elements with optimizations for in-order traversal. Increasing the number of nonzero elements in the SparseVector (say, via update) is expensive, as O(numNonZero).
      • HashVector: a sparse vector backed by a quadratic-probing open address hash array. O(1) access, but typically less memory efficient than SparseVector. No in-order traversal.
    • Matrix
      • DenseMatrix: a "normal" array-backed Matrix. Column-major unless isTranspose is true, in which case it is row-major. (This is consistent with standard BLAS/LAPACK assumptions.)
      • CSCMatrix: a still-under-development Compressed Sparse Columns Matrix. Each column of the matrix is analogous to a SparseVector: access to an element is O(log numNonZeroInColumn).
  • Arbitrarily keyed Tensors:
    • Counter: basically a wrapped Map[K, V] with nice defaults and operators.
    • Counter2: the "matrix" version of a Counter.

DenseVector

breeze.linalg.DenseVector is the simplest Tensor type in Breeze; it's basically a glorified wrapper around an Array that supports numeric operations via the NumericOps interface. (See the Operators section for more information.)

DenseVectors are your standard run-of-the-mill vector. They're backed by an Array field called data, an offset into the data array, a logical length field, and a stride field that is the width between components of the vector. Indexing into a DenseVector is defined as follows: vec(i) == vec.data(i * stride + offset).

DenseVectors are meant to be fast: their operators are either code-generated or defer to BLAS, and they are @specialized for basic types.

DenseVectors pair naturally with the DenseMatrix class.

SparseVector

SparseVector is a sparse vector backed by binary search. Indices and values are kept in two parallel arrays (index and data, respectively), with the indices sorted. Operations are:

  • access to an element: O(log numNonZero), with optimizations for in-order traversal.
  • updating a value for a index already stored in the Vector: O(log numNonZero)
  • inserting a value at a new index (via update): O(numNonZero)
    • Note that appending to the end is amortized O(log numNonZero).

SparseVectors support efficient operations with DenseVectors and themselves. These include sum, elementwise product, dot product, etc. Elementwise dividing by a SparseVector is slow, but elementwise divide is kind of a weird operation for a SparseVector...

VectorBuilder: Efficiently Building SparseVectors

You should not be adding lots of values to a SparseVector if you want good speed. SparseVectors have to maintain the invariant that the index array is always sorted, which makes insertions expensive. Often one wants to create a SparseVector with a bunch of values, sort them, and then turn them into a SparseVector.

The class that supports this is VectorBuilder, which is kind of like an unsorted SparseVector, where indices may be redundant. VectorBuilders have a method append(index, value) that appends a pair to the end of its array, giving constant time updating. When you're done with the builder, you can call toSparseVector, which sorts the values, collapses duplicate indices via summing, and then gives the result.

Efficiently Iterating over a SparseVector

SparseVectors support the standard suite of iterators defined on Tensors, as well as a set of "active iterators" that only iterate over the keys in the SparseVector that have storage associated with them. This does not mean all non-zero values are skipped, but only zero values will be skipped.

However, these iterators have a boxing penalty associated with them, and until we have macro-supported iteration, the following pattern can be used when you need to efficiently iterate over active indices:

var offset = 0
while(offset < vec.activeSize) {
  val index = vec.indexAt(offset)
  val value = vec.valueAt(offset)
  // use index and value
  offset += 1
}

HashVector

A HashVector is a sparse Vector that uses an open address hash table to store non-zero values. Like a SparseVector, indices and values are kept in two parallel arrays. Unlike a SparseVector, these arrays are unsorted. Moreover, there can be unused slots in the arrays.

Accessing and updating a single value to a HashVector take expected constant time.

Efficiently Iterating over a HashVector

HashVectors support the standard suite of iterators defined on Tensors, as well as a set of "active iterators" that only iterate over the keys in the SparseVector that have storage associated with them. This does not mean all non-zero values are skipped, but only zero values will be skipped.

However, these iterators have a boxing penalty associated with them, and until we have macro-supported iteration, the following pattern can be used when you need to efficiently iterate over active indices:

var offset = 0
while(offset < vec.activeSize) {
  if(vec.isUsed(offset)) {
    val index = vec.indexAt(offset)
    val value = vec.valueAt(offset)
    // use index and value
    offset += 1
  }
}

The only difference from the pattern in SparseVector is the isUsed method, which checks that the slot is active. SparseVectors also have an isUsed method, which can be used for convenience.

DenseMatrix

The basic Matrix type in Breeze is the DenseMatrix, which is the Matrix equivalent of a DenseVector. It is a wrapper around an array of values. This array is column-major, like Matlab and most Fortran libraries, meaning that elements in the same column are laid out next too each other.

DenseMatrices have the following fields: the number of rows; the number of columns; the data array; an offset into the data array (useful for slices); an optional major stride, which dictates how far apart columns are; and a flag isTranspose, for if the matrix should be treated as being transposed or not.

DenseMatrices support the usual elementwise operations like addition and subtraction, as well as matrix multiply. This last operation is sent to native code via jblas when it is available. (See Breeze and Native Libraries for more information.) In addition, matrix solves are supported as well, and they are also sent to native code when possible.

There are a large number of Linear Algebra operations supported for DenseMatrices. These are also preferably solved using native code.

CSCMatrix (sparse matrix)

A CSCMatrix is a Compressed Sparse Column matrix. It is currently the only sparse matrix in Breeze, though others will be added eventually. CSCMatrices are in some sense the natural generalization of a SparseVector, being essentially a Matrix with SparseVectors as columns. It is best when every column is sparse, but most columns have some non-zero value.

CSCMatrices are not fully supported yet. They are missing several basic operations. Help is appreciated in fleshing it out.

A CSCMatrix has 3 arrays: a column pointers array called colPtr, a rowIndices array, and a data array. colPtr is of length cols + 1 (the number of columns + 1). colPtr(c) is the index into rowIndices and data where the elements for the c'th column are stored, and colPtr(c+1) is the index beyond the last element. colPtr.last is always equal to rowIndices.length. rowIndices and data are parallel arrays, containing the row index and value for that entry in the matrix (surprising, I know). Row indices between colPtr(c) and colPtr(c+1) are always sorted.

The complexity profile for CSCMatrix is a lot like a SparseVector. It needs O(cols + numNonZero) storage, so it is not the ideal choice if the number of columns with no nonzero elements is large. (Consider using/implementing a CSRMatrix or some kind of unstructured sparse matrix like a HashMatrix if so.)

Access to an individual element is O(log numNonZero), as is updating a value that is already stored in the matrix. Adding a new non-zero value to a CSCMatrix is O(numNonZero), and it should generally be avoided.

Creating CSCMatrices

CSCMatrices support all the standard matrix construction methods, but in general creating a CSCMatrix should be done with a CSCMatrix.Builder, like the following:

val builder = new CSCMatrix.Builder[Double](rows=10, cols=10)
builder.add(3,4, 1.0)
// etc.
val myMatrix = builder.result()

CSCMatrix.Builder keeps its entries in an unsorted list, meaning that adding new values is amortized constant time.

Counter

Counters are like Vectors that can have an arbitrary key type, rather than an Int. Moreover, the range of the keys in a Counter is unrestricted: new key value pairs can be added at at time. For example:

val c = Counter("a"->1, "b"->2, "x"->3) //  Counter(b -> 2, a -> 1, x -> 3)
c("y") = 17
println(c) // Counter(b -> 2, y -> 17, a -> 1, x -> 3)

Counters also support all the operations that are defined on vectors, including sums, elementwise operations, and dot products.

c dot c // 303
c += 10 // Counter(b -> 12, y -> 27, a -> 11, x -> 13)
c += Counter("x"->100, "c"->3) // Counter(b -> 12, y -> 27, a -> 11, x -> 113, c -> 3)

Counters are implemented as a mutable HashMap in the obvious way. Counters are not particularly efficient, and should only be used for non-critical computations.

Counter2

Counter2s are to Counters what Matrices are to Vectors. They have two keys types: one for rows, and one for columns. Counter2s are basically HashMap[RowType,Counter[ColumnType,ValueType]] with syntactic sugar to make them behave like matrices.

val m = Counter2((4,"b", -1), (5,"b", -2) ) // Counter2(5 -> Counter(b -> -2), 4 -> Counter(b -> -1))
m(5, ::) //  Counter(b -> -2)
m(::, "b") // Counter(5 -> -2, 4 -> -1)
Clone this wiki locally