Skip to content

Commit

Permalink
Adding g-formula stochastic treatments
Browse files Browse the repository at this point in the history
  • Loading branch information
pzivich committed Mar 11, 2019
1 parent 5fadacc commit bfbab69
Show file tree
Hide file tree
Showing 2 changed files with 275 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parametric g-formula: stochastic interventions\n",
"In the previous tutorial we went over the basics of the parametric g-formula using `TimeFixedGFormula` for basic interventions. Additionally, we can use the g-formula to look at stochastic interventions. Stochastic interventions are treatment plans under which not necessarily everyone is treated, but some random percentage are treated.\n",
"\n",
"To estimate the g-formula for stochastic treatments, the process is fairly similar. However, instead of treating everyone, some percentage are treated. A random percentage are treated and then $\\hat{Y_i^a}$ are predicted and averaged. This process is repeated some number times and the average of the averaged potential outcomes is returned.\n",
"\n",
"For our example, we will return to the previous data set on ART among HIV-infected individuals and all-cause mortality. First, we will load the data (again ignoring missing data)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 517 entries, 0 to 546\n",
"Data columns (total 9 columns):\n",
"id 517 non-null int64\n",
"male 517 non-null int64\n",
"age0 517 non-null int64\n",
"cd40 517 non-null int64\n",
"dvl0 517 non-null int64\n",
"art 517 non-null int64\n",
"dead 517 non-null float64\n",
"t 517 non-null float64\n",
"cd4_wk45 430 non-null float64\n",
"dtypes: float64(3), int64(6)\n",
"memory usage: 40.4 KB\n"
]
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from zepid import load_sample_data, spline\n",
"from zepid.causal.gformula import TimeFixedGFormula\n",
"\n",
"df = load_sample_data(timevary=False)\n",
"dfs = df.dropna(subset=['dead']).copy()\n",
"dfs.info()\n",
"\n",
"dfs[['cd4_rs1', 'cd4_rs2']] = spline(dfs, 'cd40', n_knots=3, term=2, restricted=True)\n",
"dfs[['age_rs1', 'age_rs2']] = spline(dfs, 'age0', n_knots=3, term=2, restricted=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar to the previous tutorial, we initialize the `TimeFixedGFormula` with the data set (`dfs`), our treatment variable (`art`), and binary outcome (`dead`). Then we fit a regression model predicting all-cause mortality as a function of ART and our set of confounding variables (age, CD4 T-cell count, detectable viral load, gender)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Generalized Linear Model Regression Results \n",
"==============================================================================\n",
"Dep. Variable: dead No. Observations: 517\n",
"Model: GLM Df Residuals: 507\n",
"Model Family: Binomial Df Model: 9\n",
"Link Function: logit Scale: 1.0000\n",
"Method: IRLS Log-Likelihood: -202.83\n",
"Date: Mon, 11 Mar 2019 Deviance: 405.67\n",
"Time: 07:08:33 Pearson chi2: 534.\n",
"No. Iterations: 6 Covariance Type: nonrobust\n",
"==============================================================================\n",
" coef std err z P>|z| [0.025 0.975]\n",
"------------------------------------------------------------------------------\n",
"Intercept -3.9822 2.621 -1.520 0.129 -9.119 1.154\n",
"art -0.7278 0.393 -1.854 0.064 -1.497 0.042\n",
"male -0.0773 0.334 -0.231 0.817 -0.732 0.578\n",
"age0 0.1548 0.092 1.689 0.091 -0.025 0.334\n",
"age_rs1 -0.0059 0.004 -1.493 0.135 -0.014 0.002\n",
"age_rs2 0.0129 0.006 2.035 0.042 0.000 0.025\n",
"cd40 -0.0121 0.004 -3.028 0.002 -0.020 -0.004\n",
"cd4_rs1 1.887e-05 1.19e-05 1.581 0.114 -4.52e-06 4.23e-05\n",
"cd4_rs2 -3.866e-05 4.57e-05 -0.846 0.398 -0.000 5.09e-05\n",
"dvl0 -0.1254 0.398 -0.315 0.753 -0.905 0.654\n",
"==============================================================================\n"
]
}
],
"source": [
"g = TimeFixedGFormula(dfs, exposure='art', outcome='dead')\n",
"g.outcome_model(model='art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, this time we do some backgound research and find that one potential intervention to increase ART prescriptions increases the probability of ART treatment to 80%. As a result, it is potentially misleading to compare to compare the treat-all vs treat-none scenarios. Instead, we will compare the stochastic treatment where 80% of individuals are treated with ART to the scenario where no one is treated.\n",
"\n",
"## Stochastic Treatment Plans\n",
"To do this using `TimeFixedGFormula` we will instead call `fit_stochastic()` function instead of `fit()`. This function allows us to estimate a stochastic treatment. We specify `p=0.8` to have 80% of the population treated at random. By default, `fit_stochastic()` repeats this process 100 times and takes the average of these repeated random treatments. I will also use the `seed` argument to get replicable results. Let's look at the example"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RD: -0.06041404870415\n"
]
}
],
"source": [
"g.fit_stochastic(p=0.8, seed=1000191)\n",
"r_80 = g.marginal_outcome\n",
"\n",
"g.fit(treatment='none')\n",
"r_none = g.marginal_outcome\n",
"\n",
"print('RD:', r_80 - r_none)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Under the treatment plan where 80% of people are randomly treated, the risk of all-cause mortality would have been 6.0% points lower than if no one was treated. \n",
"\n",
"After reading some more articles, we find an alternative treatment plan. Under this plan, 75% of men and 90% of women start using HIV. For this plan, we are interested in a conditional stochastic treatment. Again, we want to compare this to the scenario where no one is treated\n",
"\n",
"## Conditional Stochastic Treatment Plans\n",
"For conditionally stochastic treatments, we instead provide `p` a list of probabilities. Additionally, we specify the `conditional` argument with the group restrictions. Again, we will need to use the magic-g functionality. Below is the example of the stochastic plan where 75% of men are treated and 90% of women"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RD: -0.058656195525173926\n"
]
}
],
"source": [
"g.fit_stochastic(p=[0.75, 0.90], conditional=[\"g['male']==1\", \"g['male']==0\"], seed=518012)\n",
"r_cs = g.marginal_outcome\n",
"\n",
"print('RD:', r_cs - r_none)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Under the treatment plan where 75% of men and 90% of women are randomly treated, the risk of all-cause mortality would have been 5.9% points lower than if no one was treated. This plan reduces the marginal mortality less than the previous stochastic plan because our HIV-infected population is predominantly men. \n",
"\n",
"# Conclusion\n",
"In this tutorial, I detailed stochastic treatment plans using the g-formula. While presented for a binary outcome, the same procedure can also be used to estimate stochastic treatments for continuous outcomes. Please view other tutorials for information other functions in *zEpid*\n",
"\n",
"## Further Readings\n",
"Ahern et al. (2016). Predicting the population health impacts of community interventions: the case of alcohol outlets and binge drinking. *AJPH*, 106(11), 1938-1943.\n",
"\n",
"Snowden et al. (2011) \"Implementation of G-computation on a simulated data set: demonstration of a causal inference technique.\" *AJE* 173.7: 731-738.\n",
"\n",
"Robins. (1986) \"A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect.\" *Mathematical modelling* 7.9-12: 1393-1512"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
67 changes: 67 additions & 0 deletions 3_Epidemiology_Analysis/c_causal_inference/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
Throughout the following tutorials in this branch, we will make the following identifiability assumptions.
We additionally will assume no measurement error, no selection bias, and no interference.

# Assumptions

## Conditional Exchangeability
Conditional exchangeability is the assumption that potential outcomes are independent of the treatment received
conditional on some set of covariates. Using causal diagrams, this amounts to no open backdoor paths between the
treatment and outcome. See the further reading list for publications on the assumption of conditional exchangeability
and introductions to two different approaches to causal diagrams (directed acyclic graphs (DAG) and single-world
intervention graphs (SWIG))

### Further Reading
Hernán MA, Robins JM. (2006). Estimating causal effects from epidemiological data. *Journal of Epidemiology
& Community Health*, 60(7), 578-586.

Greenland S, Pearl J, Robins JM. (1999). Causal diagrams for epidemiologic research. *Epidemiology*, 10, 37-48.

Richardson TS, Robins JM. (2013). Single world intervention graphs: a primer. *In Second UAI workshop on
causal structure learning*, Bellevue, Washington.

Breskin A, Cole SR, Hudgens MG. (2018). A practical example demonstrating the utility of single-world
intervention graphs. *Epidemiology*, 29(3), e20-e21.

## Positivity
The positivity assumption is that there are treated and untreated individuals at every combination of covariates. There
are two potential positivity violations; deterministic or random. Deterministic positivity violations can never occur
despite additional data collection. For an example of a deterministic positivity violation, consider the risk of death
by hysterectomy. Since men lack a uterus, they are unable to receive a hysterectomy. Random positivity violations
occur as a result of finite samples. In a small sample, it may just occur that we didn't observe anyone treated between
ages 32-35. It isn't that no one could have been treated in that age group, we just didn't observe it in our sample.
For these scenarios, we will assume that our statistical model correctly interpolates over these areas (often a
strong assumption in small data sets)

### Further Reading
Westreich D, Cole SR. (2010). Invited commentary: positivity in practice. *American Journal of Epidemiology*,
171(6), 674-677.

Cole SR, Hernán MA. (2008). Constructing inverse probability weights for marginal structural models.
*American Journal of Epidemiology*, 168(6), 656-664.

## Causal Consistency
Causal consistency is also referred to as treatment variation irrelevance. Under this assumption we assume that there
is only one version of treatment (consistency) or that any differences remaining between treatments is irrelevant
(treatment variation irrelevance). For example, consider a study on 200mg daily aspirin and all-cause mortality. In our
study, we may be willing to assume that taking aspirin in the morning versus at night is irrelevant to all-cause
mortality. This is an example of assuming treatment variation irrelevance. Generally, defining the treatment more
precisely can get you out of this as an issue. There are also some additional approaches. I recommend reviewing the
below readings for further discussions

### Further Reading
Cole SR, Frangakis CE. (2009). The consistency statement in causal inference: a definition or an assumption?.
*Epidemiology*, 20(1), 3-5.

VanderWeele TJ. (2009). Concerning the consistency assumption in causal inference. *Epidemiology*, 20(6), 880-883.

VanderWeele TJ. (2018). On well-defined hypothetical interventions in the potential outcomes framework.
*Epidemiology*, 29(4), e24-e25.

## Correctly specified model
Since we will be working with continuous and high-dimensional data, we will be using parametric regression models.
We assume that these models are correctly specified. To make less restrictive assumptions regarding the functional
forms of continuous variables, we will use splines throughout. Please refer to the Data Basics for an intro to
using splines with *zEpid*

Additionally, we will sometime uses machine learning approaches to relax this assumption further (see TMLE tutorials
for some examples)

0 comments on commit bfbab69

Please sign in to comment.