Skip to content

Causal Inference

jcanny edited this page May 20, 2014 · 32 revisions

IPTW

BIDMach has several basic causal estimators. IPTW stands for Inverse Probability of Treatment Weighting and is a widely used method technique for causal inference with binary treatments. We start with some data features <math>X</math>, a response <math>Y</math>, and a "treatment" <math>Z</math>. In causal inference we are interested in the effects of directly changing <math>Z</math> on <math>Y</math>. This is different from the conditional probability of <math>Y</math> given <math>Z</math> which depends on joint probability distribution of the system "as is". The causal effect instead models a change to the system where we force <math>Z</math> to a new value. The simplest approach is to regress <math>Y</math> on <math>X,Z</math>, using linear or logistic regression. The coefficient of <math>Z</math> in the regression model captures the direct influence of <math>Z</math> on <math>Y</math>. If the regression model is exact, i.e. if <math>Y</math> really is a linear function of the inputs, then this coefficient accurately captures the influence of <math>Z</math> on <math>Y</math> (for logistic regression we need to use instead the influence of the regression coefficient, which is <math>L(1,X)-L(0,X)</math> for a particular input X, where <math>L(Z,X)</math> is the logistic predictor).

However, a regression model often wont be exact, and a different kind of estimate is needed. The next approach is to simulate randomization of <math>Z</math>. If we had randomly assigned each user to classes <math>Z=0</math> and <math>Z=1</math> we could simply use the difference in responses as the causal effect. This is the approach taken in randomized trials. But given a dataset, we can't change the assignments to <math>Z</math> that were made. The actual assignment could depend in an arbitrary fashion on the other features <math>X</math> and then the difference in response will depend on the influence of those features as well as <math>Z</math>. We could partition or cluster the data according to <math>X</math>, and then look within each subset at the difference in outcome for samples with <math>Z=0</math> and <math>Z=1</math>. But this is very difficult with high-dimensional features, and there turns out to be a much more efficient approach based on the *propensity score*. The propensity score is the probability of the treatment assignment, i.e. <math>P(Z|X)</math>, and the notations <math>g_0(X)=P(Z=0|X)</math> and <math>g_1(X)=P(Z=1|X)</math> are often used. Stratification by the propensity score turns out to be sufficient for accurate estimation of the causal effect.

For a regression-based estimate, we can treat the propensity score as a sampling bias. By dividing by it, we create a pseudo-sample in which units have equal (and random) probability of assignment to <math>Z=0</math> or <math>Z=1</math>. Then the IPTW estimator for the causal effect is defined as:

<math>E(Y_1)-E(Y_0) = \frac{1}{n}\sum\limits_{i=1}^n \left( \frac{Z_i Y_i}{g_1(X_i)} - \frac{(1-Z_i)Y_i}{g_0(X_i)}\right)</math>

where <math>Y_0</math> and <math>Y_1</math> are the responses <math>Y</math> when <math>Z</math> is forced respectively to 0 or 1. There is a simple estimator for IPTW effects in the causal package. It concurrently computes <math>P(Z|X)</math> using logistic regression and the estimator above. It expects a targets option which encodes both the treatment(s) <math>Z</math> and the effects <math>Y</math>. It can analyze k effects at once, and targets should be a 2k x nfeats matrix. The first k rows encode the k treatments, and the next k rows encode the corresponding effects. The treatment and effects should therefore be among the input features, and the <math>j^{th}</math> row of targets will normally be zeros with a single 1 in the position of the feature that encodes the <math>j^{th}</math> treatment. The <math>(j+k)^{th}</math> row should have a single 1 in the position that encodes the feature for the <math>j^{th}</math> effect.

The output is in the learner's modelmats(1) field. This will be an FMat with k entries corresponding to the k effects.

A-IPTW

Regression estimates will be inaccurate if the outcome <math>Y</math> is not accurately modeled as a linear or logistic function of <math>X</math>. IPTW estimates can be quite inaccurate if the model used for <math>P(Z|X)</math> is poor. Doubly-Robust estimators combine regression and propensity score estimators in such a way that if either estimator is accurate, the errors from the other estimator cancel out. In other words, a doubly robust estimator is accurate if either the base regression model or the propensity score are accurate. One class of these estimators are called A-IPTW for Augmented IPTW.

<math>E(Y_1)-E(Y_0) = \frac{1}{n}\sum\limits_{i=1}^n \left( \frac{Z_i Y_i}{g_1(X_i)} - \frac{(1-Z_i)Y_i}{g_0(X_i)}- (Z_i-g_1(X_i))\left( \frac{m_1(X_i)}{g_1(X_i)} + \frac{m_0(X_i)}{g_0(X_i)}\right)\right)</math>

where <math>m_0(X_i)</math> and <math>m_1(X_i)</math> are derived from a regression model <math>R(X,Z)</math> for <math>Y</math> given <math>X,Z</math>. <math>m_0(X_i)= R(X_i,0)</math> and <math>m_1(X_i)=R(X_i,1)</math>.

Creation of an A-IPTW model is the same as for an IPTW model. Both estimators are computed by the same IPTW class. IPTW estimates appear in modelmats(1), while the A-IPTW estimates are in modelmats(2).

Data Preparation

We assume that there is a data matrix dat which is nfeatures x nsamples, and which holds the vectors <math>X</math> above. We further assume that the treatment and outcome are encoded as features in dat as well, and that there are maps <math>Z = AX</math> and <math>Y= BX</math> that select <math>Z</math> and <math>Y</math>. So <math>A</math> and <math>B</math> should be zero-one-valued matrices. <math>A_{i,j}=1</math> when feature <math>j</math> encodes <math>Z_i</math> and <math>B_{i,j}=1</math> when feature <math>j</math> encodes <math>Y_i</math>. Thus <math>A</math> and <math>B</math> should have the same size. For efficiency the are combined into a single target matrix. Thus preparing a causal estimation run should look like this:

> val (nn, opts) = IPTW.learner(dat)
> opts.targets = A on B

As with regression modeling, its essential to mask out these features so that they are not allowed to predict themselves. Thus the rmask matrix should have the same size as opts.targets, and (at least) zeros wherever opts.targets has ones. Furthermore, we dont now allow outcome features (Y) to be used to predict treatment (X), so the corresponding values in rmask should also be set to zero. Then

> opts.rmask = 1 - opts.targets
> // Zero Y --> Z terms as well
> nn.train
...

By default, the IPTW class trains logistic models for Z and Y. You can change that by setting opts.links which has the same behavior as for the GLM class. When training is done, nn.modelmats(1) and nn.modelmats(2) contain vectors (GMat or FMat) of estimates for the IPTW and A-IPTW estimators respectively. You can retrieve them as FMats with:

> val iptw = FMat(nn.modelmats(1))
> val aiptw = FMat(nn.modelmats(2))
Clone this wiki locally