algorithms for feature extraction from spatio-temporal data
Source or feature extraction is the process of identifying spatial features of interest from data that varies over space and time. It can be either unsupervised or supervised, and is common in biological data analysis problems, like identifying neurons in calcium imaging data.
This package contains a collection of approaches for solving this problem. It defines a set of algorithms
in the scikit-learn
style, each of which can be fit
to data, and return a model
that can be used to transform
new data. Compatible with Python 2.7+ and 3.4+. Works well alongside thunder
and supprts parallelization via spark
, but can be used as a standalone package on local numpy
arrays.
pip install thunder-extraction
# generate data
from extraction.utils import make_gaussian
data = make_gaussian()
# fit a model
from extraction import NMF
model = NMF().fit(data)
# extract sources by transforming data
sources = model.transform(data)
Analysis starts by import and constructing an algorithm
from extraction import NMF
algorithm = NMF(k=10)
Algorithms can be fit to data in the form of a thunder
images
object or an t,x,y(,z)
numpy
array
model = algorithm.fit(data)
The model is a collection of identified features that can be used to extract temporal signals from new data
signals = model.transform(data)
All algorithms have the following methods
Fits the algorithm to the data, which should be a collection of time-varying images. It can either be a thunder
images
object, or a numpy
array with shape t,x,y(,z)
.
For many algorithms, fit
will take the optional arguments chunk_size
and padding
, which allows the algorithm to be performed on smaller chunks of the data, either in serial (if running locally) or in parallel (if running on a cluster).
A chunk
is defined a subset of the image in space, including all time points. The chunk_size
is the size of each chunk in pixels, and padding
is the amount by which to pad the chunks in each dimension. For example, given a (100,100,500)
data set, we could set chunk_size=(50,50)
resulting in four chunks each of which are (50,50,500)
.
The result of fitting an algorithm
is a model
. Every model
has the following properties and methods.
The spatial regions identified during fitting.
Transform a new data set using the model
, by averaging pixels within each of the regions
. As with fitting, data
can either be a thunder
images
object, or a numpy
array with shape t,x,y(,z)
. It will return a thunder
series
object, which can be converted to a numpy
array by calling toarray()
.
Merge overlapping regions in the model, by greedily comparing nearby regions and merging those that are similar to one another more than the specified overlap
. Repeats greedy merging process max_iter
times. Only considers k_nearest
neighbors to speed up computation.
Here are all the algorithms currently available.
Local non-negative matrix factorization followed by thresholding to yield binary spatial regions. Applies factorization either to image blocks or to the entire image.
The algorithm takes the following parameters.
k
number of components to estimate per blockmax_size
maximum size of each regionmin_size
minimum size for each regionmax_iter
maximum number of algorithm iterationspercentile
value for thresholding (higher means more thresholding)overlap
value for determining whether to merge (higher means fewer merges)
The fit method takes the following options.
block_size
a size in megabytes like150
or a size in pixels like(10,10)
, ifNone
will use full image