algorithms for mass univariate regression
Mass univariate regression is the process of independently regressing multiple response variables against a single set of explantory features. It is common in any domain in which a lage number of response variables are measured, and fitting large collections of such models can benefit significantly from parallelization.
This package provides a simple API for fitting these kinds of models. It provides a collection of algorithms
for performing different types of mass regression, all following the scikit-learn
style. It also supports providing custom algorithms directly from scikit-learn
. The algorithms
are fit
to data, returning a fitted model
that contains regression coefficients and allows for prediction
and scoring
on new data. Compatible with Python 2.7+ and 3.4+. Works well alongside thunder
and supprts parallelization via spark
, but can also be used as a standalone module on local numpy
arrays.
pip install thunder-regression
In this example we'll create data and fit a collection of models
# generate data
from sklearn.datasets import make_regression
X, Y = make_regression(n_samples=100, n_features=3, n_informative=3, n_targets=10, noise=1.0)
# create and fit the model
from regression import LinearRegression
algorithm = LinearRegression(fit_intercept=False)
model = algorithm.fit(X, Y.T)
After fitting, model.betas
is an array with the 3 coefficients for each of 10 response variables.
Import and construct an algorithm
from regression import LinearRegression
algorithm = LinearRegression(fit_intercept=False)
Fit the algorithm to data in the form of a samples x features
design matrix X
and a targets x samples
response matrix Y
.
model = algorithm.fit(X, Y)
The results of the fit are accessible on the fitted model, and the model can be used to score new data
betas = model.betas
rsq = model.score(X, Y)
For all methods, X
should be a local numpy
array, and Y
can be either a local numpy
array, a bolt
array, or a thunder
Series
object.
All algorithms have the following methods:
Fit the algorithm to data
X
design matrix, dimensionssamples x features
Y
collection of responses, dimensionstargets x samples
- returns a fitted
MassRegressionModel
The result of fitting an algorithm
is a model with the following properties and methods:
Array of regression coefficients, dimensions targets x features
. If an intercept was fit, it will be the
the first feature.
Array of regression coefficients, followed by prediction scores on the fitted data, dimensions targets x (feature + 1)
. If an intercept was fit, it will be the the first feature.
Array of individual fitted models, dimensions 1 x targets
.
Array of coefficients, not including a possible intercept term, for consistency with scikit-learn
.
Array of intercepts, for consistency with scikit-learn
. If no intercepts were fit, all will have values 0.0
.
Predicts the response to new inputs.
X
design matrix, dimensionsnew samples x features
- returns an array of responses, dimensions
targets x new samples
Computes the goodness of fit (r-squared, unless otherwise stated) of the model for given data
X
design matrix, dimensionssamples x features
Y
collection of responses, dimensionstargets x samples
- returns an array of scores
Simultaneously computes the results of predict(X)
and score(X, Y)
X
design matrix, dimensionssamples x features
Y
collection of responses, dimensionstargets x samples
- returns an array of predictions and an array of scores
Here are all the algorithms currently available.
Linear regression through ordinary least squares as implemented in scikit-learn's LinearRegression
algorithm.
fit_intercept
whether or not to fit intercept termsnormalize
whether or not to normalize the data before fitting the models
Use a custom regression algorithm in a mass regression analysis. The provided algorithm
should operate on single response variables, and must conform to the scikit-learn
API as follows
- Must implement a
.fit(X, Y)
method that takes a design matrix (samples x features
) and a response vector and returns an object representing the fitted model. - The returned fitted model must must have attributes
.coef_
and.intercept_
that hold the results of the the fit (.coef_
having dimensions1 x features
and.intercept_
being a scalar). - The returned fitted model must also have methods
.predict(X)
and.score(X, y)
(X
having dimensionsnew samples x features
andy
having dimensions1 x new samples
). The former should return a vector of predictions (dimensions1 x new samples
) and the former should return a scalar score (likely r-squared).
This allows you to define an algorithm in scikit-learn
and then wrap it for mass fitting, for example
from regression import CustomRegression
from sklearn.linear_model import LassoCV
algorithm = CustomRegression(LassoCV(normalize=True, fit_intercept=False))
model = algorithm.fit(X, Y)
Run tests with
py.test
Tests run locally with numpy
by default, but the same tests can be run against a local spark
installation using
py.test --engine=spark