diff --git a/3_Epidemiology_Analysis/c_causal_inference/1_time-fixed-treatments/2_gformula_stochastic.ipynb b/3_Epidemiology_Analysis/c_causal_inference/1_time-fixed-treatments/2_gformula_stochastic.ipynb new file mode 100644 index 0000000..3ba9bff --- /dev/null +++ b/3_Epidemiology_Analysis/c_causal_inference/1_time-fixed-treatments/2_gformula_stochastic.ipynb @@ -0,0 +1,208 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Parametric g-formula: stochastic interventions\n", + "In the previous tutorial we went over the basics of the parametric g-formula using `TimeFixedGFormula` for basic interventions. Additionally, we can use the g-formula to look at stochastic interventions. Stochastic interventions are treatment plans under which not necessarily everyone is treated, but some random percentage are treated.\n", + "\n", + "To estimate the g-formula for stochastic treatments, the process is fairly similar. However, instead of treating everyone, some percentage are treated. A random percentage are treated and then $\\hat{Y_i^a}$ are predicted and averaged. This process is repeated some number times and the average of the averaged potential outcomes is returned.\n", + "\n", + "For our example, we will return to the previous data set on ART among HIV-infected individuals and all-cause mortality. First, we will load the data (again ignoring missing data)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Int64Index: 517 entries, 0 to 546\n", + "Data columns (total 9 columns):\n", + "id 517 non-null int64\n", + "male 517 non-null int64\n", + "age0 517 non-null int64\n", + "cd40 517 non-null int64\n", + "dvl0 517 non-null int64\n", + "art 517 non-null int64\n", + "dead 517 non-null float64\n", + "t 517 non-null float64\n", + "cd4_wk45 430 non-null float64\n", + "dtypes: float64(3), int64(6)\n", + "memory usage: 40.4 KB\n" + ] + } + ], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "from zepid import load_sample_data, spline\n", + "from zepid.causal.gformula import TimeFixedGFormula\n", + "\n", + "df = load_sample_data(timevary=False)\n", + "dfs = df.dropna(subset=['dead']).copy()\n", + "dfs.info()\n", + "\n", + "dfs[['cd4_rs1', 'cd4_rs2']] = spline(dfs, 'cd40', n_knots=3, term=2, restricted=True)\n", + "dfs[['age_rs1', 'age_rs2']] = spline(dfs, 'age0', n_knots=3, term=2, restricted=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similar to the previous tutorial, we initialize the `TimeFixedGFormula` with the data set (`dfs`), our treatment variable (`art`), and binary outcome (`dead`). Then we fit a regression model predicting all-cause mortality as a function of ART and our set of confounding variables (age, CD4 T-cell count, detectable viral load, gender)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " Generalized Linear Model Regression Results \n", + "==============================================================================\n", + "Dep. Variable: dead No. Observations: 517\n", + "Model: GLM Df Residuals: 507\n", + "Model Family: Binomial Df Model: 9\n", + "Link Function: logit Scale: 1.0000\n", + "Method: IRLS Log-Likelihood: -202.83\n", + "Date: Mon, 11 Mar 2019 Deviance: 405.67\n", + "Time: 07:08:33 Pearson chi2: 534.\n", + "No. Iterations: 6 Covariance Type: nonrobust\n", + "==============================================================================\n", + " coef std err z P>|z| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "Intercept -3.9822 2.621 -1.520 0.129 -9.119 1.154\n", + "art -0.7278 0.393 -1.854 0.064 -1.497 0.042\n", + "male -0.0773 0.334 -0.231 0.817 -0.732 0.578\n", + "age0 0.1548 0.092 1.689 0.091 -0.025 0.334\n", + "age_rs1 -0.0059 0.004 -1.493 0.135 -0.014 0.002\n", + "age_rs2 0.0129 0.006 2.035 0.042 0.000 0.025\n", + "cd40 -0.0121 0.004 -3.028 0.002 -0.020 -0.004\n", + "cd4_rs1 1.887e-05 1.19e-05 1.581 0.114 -4.52e-06 4.23e-05\n", + "cd4_rs2 -3.866e-05 4.57e-05 -0.846 0.398 -0.000 5.09e-05\n", + "dvl0 -0.1254 0.398 -0.315 0.753 -0.905 0.654\n", + "==============================================================================\n" + ] + } + ], + "source": [ + "g = TimeFixedGFormula(dfs, exposure='art', outcome='dead')\n", + "g.outcome_model(model='art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, this time we do some backgound research and find that one potential intervention to increase ART prescriptions increases the probability of ART treatment to 80%. As a result, it is potentially misleading to compare to compare the treat-all vs treat-none scenarios. Instead, we will compare the stochastic treatment where 80% of individuals are treated with ART to the scenario where no one is treated.\n", + "\n", + "## Stochastic Treatment Plans\n", + "To do this using `TimeFixedGFormula` we will instead call `fit_stochastic()` function instead of `fit()`. This function allows us to estimate a stochastic treatment. We specify `p=0.8` to have 80% of the population treated at random. By default, `fit_stochastic()` repeats this process 100 times and takes the average of these repeated random treatments. I will also use the `seed` argument to get replicable results. Let's look at the example" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RD: -0.06041404870415\n" + ] + } + ], + "source": [ + "g.fit_stochastic(p=0.8, seed=1000191)\n", + "r_80 = g.marginal_outcome\n", + "\n", + "g.fit(treatment='none')\n", + "r_none = g.marginal_outcome\n", + "\n", + "print('RD:', r_80 - r_none)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Under the treatment plan where 80% of people are randomly treated, the risk of all-cause mortality would have been 6.0% points lower than if no one was treated. \n", + "\n", + "After reading some more articles, we find an alternative treatment plan. Under this plan, 75% of men and 90% of women start using HIV. For this plan, we are interested in a conditional stochastic treatment. Again, we want to compare this to the scenario where no one is treated\n", + "\n", + "## Conditional Stochastic Treatment Plans\n", + "For conditionally stochastic treatments, we instead provide `p` a list of probabilities. Additionally, we specify the `conditional` argument with the group restrictions. Again, we will need to use the magic-g functionality. Below is the example of the stochastic plan where 75% of men are treated and 90% of women" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RD: -0.058656195525173926\n" + ] + } + ], + "source": [ + "g.fit_stochastic(p=[0.75, 0.90], conditional=[\"g['male']==1\", \"g['male']==0\"], seed=518012)\n", + "r_cs = g.marginal_outcome\n", + "\n", + "print('RD:', r_cs - r_none)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Under the treatment plan where 75% of men and 90% of women are randomly treated, the risk of all-cause mortality would have been 5.9% points lower than if no one was treated. This plan reduces the marginal mortality less than the previous stochastic plan because our HIV-infected population is predominantly men. \n", + "\n", + "# Conclusion\n", + "In this tutorial, I detailed stochastic treatment plans using the g-formula. While presented for a binary outcome, the same procedure can also be used to estimate stochastic treatments for continuous outcomes. Please view other tutorials for information other functions in *zEpid*\n", + "\n", + "## Further Readings\n", + "Ahern et al. (2016). Predicting the population health impacts of community interventions: the case of alcohol outlets and binge drinking. *AJPH*, 106(11), 1938-1943.\n", + "\n", + "Snowden et al. (2011) \"Implementation of G-computation on a simulated data set: demonstration of a causal inference technique.\" *AJE* 173.7: 731-738.\n", + "\n", + "Robins. (1986) \"A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect.\" *Mathematical modelling* 7.9-12: 1393-1512" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/3_Epidemiology_Analysis/c_causal_inference/README.md b/3_Epidemiology_Analysis/c_causal_inference/README.md new file mode 100644 index 0000000..6c5efd6 --- /dev/null +++ b/3_Epidemiology_Analysis/c_causal_inference/README.md @@ -0,0 +1,67 @@ +Throughout the following tutorials in this branch, we will make the following identifiability assumptions. +We additionally will assume no measurement error, no selection bias, and no interference. + +# Assumptions + +## Conditional Exchangeability +Conditional exchangeability is the assumption that potential outcomes are independent of the treatment received +conditional on some set of covariates. Using causal diagrams, this amounts to no open backdoor paths between the +treatment and outcome. See the further reading list for publications on the assumption of conditional exchangeability +and introductions to two different approaches to causal diagrams (directed acyclic graphs (DAG) and single-world +intervention graphs (SWIG)) + +### Further Reading +Hernán MA, Robins JM. (2006). Estimating causal effects from epidemiological data. *Journal of Epidemiology +& Community Health*, 60(7), 578-586. + +Greenland S, Pearl J, Robins JM. (1999). Causal diagrams for epidemiologic research. *Epidemiology*, 10, 37-48. + +Richardson TS, Robins JM. (2013). Single world intervention graphs: a primer. *In Second UAI workshop on +causal structure learning*, Bellevue, Washington. + +Breskin A, Cole SR, Hudgens MG. (2018). A practical example demonstrating the utility of single-world +intervention graphs. *Epidemiology*, 29(3), e20-e21. + +## Positivity +The positivity assumption is that there are treated and untreated individuals at every combination of covariates. There +are two potential positivity violations; deterministic or random. Deterministic positivity violations can never occur +despite additional data collection. For an example of a deterministic positivity violation, consider the risk of death +by hysterectomy. Since men lack a uterus, they are unable to receive a hysterectomy. Random positivity violations +occur as a result of finite samples. In a small sample, it may just occur that we didn't observe anyone treated between +ages 32-35. It isn't that no one could have been treated in that age group, we just didn't observe it in our sample. +For these scenarios, we will assume that our statistical model correctly interpolates over these areas (often a +strong assumption in small data sets) + +### Further Reading +Westreich D, Cole SR. (2010). Invited commentary: positivity in practice. *American Journal of Epidemiology*, +171(6), 674-677. + +Cole SR, Hernán MA. (2008). Constructing inverse probability weights for marginal structural models. +*American Journal of Epidemiology*, 168(6), 656-664. + +## Causal Consistency +Causal consistency is also referred to as treatment variation irrelevance. Under this assumption we assume that there +is only one version of treatment (consistency) or that any differences remaining between treatments is irrelevant +(treatment variation irrelevance). For example, consider a study on 200mg daily aspirin and all-cause mortality. In our +study, we may be willing to assume that taking aspirin in the morning versus at night is irrelevant to all-cause +mortality. This is an example of assuming treatment variation irrelevance. Generally, defining the treatment more +precisely can get you out of this as an issue. There are also some additional approaches. I recommend reviewing the +below readings for further discussions + +### Further Reading +Cole SR, Frangakis CE. (2009). The consistency statement in causal inference: a definition or an assumption?. +*Epidemiology*, 20(1), 3-5. + +VanderWeele TJ. (2009). Concerning the consistency assumption in causal inference. *Epidemiology*, 20(6), 880-883. + +VanderWeele TJ. (2018). On well-defined hypothetical interventions in the potential outcomes framework. +*Epidemiology*, 29(4), e24-e25. + +## Correctly specified model +Since we will be working with continuous and high-dimensional data, we will be using parametric regression models. +We assume that these models are correctly specified. To make less restrictive assumptions regarding the functional +forms of continuous variables, we will use splines throughout. Please refer to the Data Basics for an intro to +using splines with *zEpid* + +Additionally, we will sometime uses machine learning approaches to relax this assumption further (see TMLE tutorials +for some examples)