From 6d5d2137f09a22550c712c28f92b1ef358e3e7b1 Mon Sep 17 00:00:00 2001 From: pzivich Date: Mon, 11 Mar 2019 08:38:39 -0400 Subject: [PATCH] AIPTW tutorial --- .../1_time-fixed-treatments/5_AIPTW.ipynb | 310 ++++++++++++++++++ 1 file changed, 310 insertions(+) create mode 100644 3_Epidemiology_Analysis/c_causal_inference/1_time-fixed-treatments/5_AIPTW.ipynb diff --git a/3_Epidemiology_Analysis/c_causal_inference/1_time-fixed-treatments/5_AIPTW.ipynb b/3_Epidemiology_Analysis/c_causal_inference/1_time-fixed-treatments/5_AIPTW.ipynb new file mode 100644 index 0000000..dcfdb1c --- /dev/null +++ b/3_Epidemiology_Analysis/c_causal_inference/1_time-fixed-treatments/5_AIPTW.ipynb @@ -0,0 +1,310 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Augmented Inverse Probability of Treatment Weights\n", + "Augmented-IPTW (AIPTW) is a doubly robust estimator. Essentially, AIPTW combines the IPTW estimator and g-formula into a single estimate. Before continuing, I will briefly outline what a doubly-robust estimator is and why you would want to use one. In observational research with high-dimensional data, we (generally) are forced to use parametric models to adjust for many confounders. In this scenario, we assume that our parametric models are correctly specified. Our statistical model, $\\mathcal{M}$, must include the distribution that the data came from. \n", + "\n", + "With other estimators, like IPTW or g-formula, we have one chance to specify $\\mathcal{M}$ correctly. Doubly-robust estimators use a model to predict the treatment (like IPTW) and another model to predict the outcome (like g-formula). The estimator then combines the estimates, such that if either is correct, then our estimate will be consistent. Essentially, we get two chances to get the statistical model correct.\n", + "\n", + "A more in-depth description of doubly robust estimators is available in [this pre-print](https://statnav.files.wordpress.com/2017/10/doublerobustness-preprint.pdf)\n", + "\n", + "## AIPTW\n", + "\n", + "AIPTW takes the following form\n", + "$$E[Y^a] = \\frac{1}{n} \\sum_i^n \\left(\\frac{Y \\times I(A=a)}{\\widehat{\\Pr}(A=a|L)} - \\frac{\\hat{E}[Y|A=a, L] \\times (I(A=a) - \\widehat{\\Pr}(A=a|L))}{1 - \\widehat{\\Pr}(A=a|L)}\\right)$$\n", + "where $\\widehat{\\Pr}(A=a|L)$ comes from the IPTW model and $\\hat{E}[Y|A=a,L]$ comes from the g-formula. If we do some manipulations and assume an infinite sample size, we can end up with\n", + "$$\\hat{E}^{IPW}[Y^a] \\times \\frac{\\Pr(A=a|L)}{\\widehat{\\Pr}(A=a|L, \\mathcal{M})} - \\hat{E}^{STD}[Y^a] \\times \\frac{\\Pr(A=a|L) - \\widehat{\\Pr}(A=a|L, \\mathcal{M})}{\\widehat{\\Pr}(A=a|L, \\mathcal{M})}$$\n", + "from this form, we can see that if as long as one estimate is correct then AIPTW will be unbiased\n", + "\n", + "## An example\n", + "To motivate our example, we will use a simulated data set included with *zEpid*. In the data set, we have a cohort of HIV-positive individuals. We are interested in the sample average treatment effect of antiretroviral therapy (ART) on all-cause mortality at 45-weeks. Based on substantive background knowledge, we believe that the treated and untreated population are exchangeable based gender, age, CD4 T-cell count, and detectable viral load. \n", + "\n", + "In this tutorial, we will focus on a complete case analysis. Therefore, we will drop the `cd4_wk45` column and all the missing data in `dead`. This will leave 517 observations with no missing data" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Int64Index: 517 entries, 0 to 546\n", + "Data columns (total 12 columns):\n", + "id 517 non-null int64\n", + "male 517 non-null int64\n", + "age0 517 non-null int64\n", + "cd40 517 non-null int64\n", + "dvl0 517 non-null int64\n", + "art 517 non-null int64\n", + "dead 517 non-null float64\n", + "t 517 non-null float64\n", + "age_rs1 517 non-null float64\n", + "age_rs2 517 non-null float64\n", + "cd4_rs1 517 non-null float64\n", + "cd4_rs2 517 non-null float64\n", + "dtypes: float64(6), int64(6)\n", + "memory usage: 52.5 KB\n" + ] + } + ], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "from zepid import load_sample_data, spline\n", + "from zepid.causal.doublyrobust import AIPTW\n", + "\n", + "df = load_sample_data(False)\n", + "df[['age_rs1', 'age_rs2']] = spline(df, 'age0', n_knots=3, term=2, restricted=True)\n", + "df[['cd4_rs1', 'cd4_rs2']] = spline(df, 'cd40', n_knots=3, term=2, restricted=True)\n", + "\n", + "dfcc = df.drop(columns=['cd4_wk45']).dropna()\n", + "dfcc.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our data is now ready to conduct a complete case analysis using TMLE. First, we initialize TMLE with our complete-case data (dfcc), the treatment (art), and the outcome (dead)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "aipw = AIPTW(dfcc, exposure='art', outcome='dead')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Treatment Model\n", + "First, we will specify our treatment model. We believe the sufficient set for the treatment model is gender (`male`), age (`age0`), CD4 T-cell (`cd40`) and detectable viral load (`dvl0`). To relax the functional for assumptions, we will model age and CD4 using restricted quadratic splines" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "----------------------------------------------------------------\n", + "MODEL: art ~ male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0\n", + "-----------------------------------------------------------------\n", + " Generalized Linear Model Regression Results \n", + "==============================================================================\n", + "Dep. Variable: art No. Observations: 517\n", + "Model: GLM Df Residuals: 508\n", + "Model Family: Binomial Df Model: 8\n", + "Link Function: logit Scale: 1.0000\n", + "Method: IRLS Log-Likelihood: -206.06\n", + "Date: Mon, 11 Mar 2019 Deviance: 412.12\n", + "Time: 08:24:05 Pearson chi2: 510.\n", + "No. Iterations: 5 Covariance Type: nonrobust\n", + "==============================================================================\n", + " coef std err z P>|z| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "Intercept 1.4498 1.679 0.864 0.388 -1.841 4.741\n", + "male -0.1159 0.321 -0.361 0.718 -0.745 0.513\n", + "age0 -0.1026 0.059 -1.726 0.084 -0.219 0.014\n", + "age_rs1 0.0048 0.003 1.706 0.088 -0.001 0.010\n", + "age_rs2 -0.0077 0.006 -1.373 0.170 -0.019 0.003\n", + "cd40 0.0041 0.004 0.964 0.335 -0.004 0.012\n", + "cd4_rs1 -2.422e-05 1.2e-05 -2.014 0.044 -4.78e-05 -6.49e-07\n", + "cd4_rs2 8.875e-05 4.55e-05 1.952 0.051 -3.81e-07 0.000\n", + "dvl0 -0.0158 0.399 -0.040 0.968 -0.797 0.765\n", + "==============================================================================\n" + ] + } + ], + "source": [ + "aipw.exposure_model('male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`AIPTW` uses a logistic regression model to estimate the probabilities of treatment and the corresponding summary of the model fit are printed to the console. \n", + "\n", + "### Outcome Model\n", + "Now, we will estimate the outcome model. We will model the outcomes as ART (`art`), gender (`male`), age (`age0`), CD4 T-cell (`cd40`) and detectable viral load (`dvl0`). Again, we will model age and CD4 using restricted quadratic splines " + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "----------------------------------------------------------------\n", + "MODEL: dead ~ art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0\n", + "-----------------------------------------------------------------\n", + " Generalized Linear Model Regression Results \n", + "==============================================================================\n", + "Dep. Variable: dead No. Observations: 517\n", + "Model: GLM Df Residuals: 507\n", + "Model Family: Binomial Df Model: 9\n", + "Link Function: logit Scale: 1.0000\n", + "Method: IRLS Log-Likelihood: -202.85\n", + "Date: Mon, 11 Mar 2019 Deviance: 405.71\n", + "Time: 08:25:18 Pearson chi2: 535.\n", + "No. Iterations: 6 Covariance Type: nonrobust\n", + "==============================================================================\n", + " coef std err z P>|z| [0.025 0.975]\n", + "------------------------------------------------------------------------------\n", + "Intercept -4.0961 2.713 -1.510 0.131 -9.413 1.220\n", + "art -0.7274 0.392 -1.853 0.064 -1.497 0.042\n", + "male -0.0774 0.334 -0.232 0.817 -0.732 0.577\n", + "age0 0.1605 0.096 1.670 0.095 -0.028 0.349\n", + "age_rs1 -0.0058 0.004 -1.481 0.139 -0.013 0.002\n", + "age_rs2 0.0128 0.006 2.026 0.043 0.000 0.025\n", + "cd40 -0.0123 0.004 -2.987 0.003 -0.020 -0.004\n", + "cd4_rs1 1.872e-05 1.18e-05 1.584 0.113 -4.45e-06 4.19e-05\n", + "cd4_rs2 -3.868e-05 4.59e-05 -0.842 0.400 -0.000 5.13e-05\n", + "dvl0 -0.1261 0.398 -0.317 0.751 -0.906 0.653\n", + "==============================================================================\n" + ] + } + ], + "source": [ + "aipw.outcome_model('art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Again, logistic regression is used to predict the outcome data\n", + "\n", + "### Estimation\n", + "To estimate the risk difference and risk ratio, we will now call the `fit()` function. After this, `AIPTW` gains the attributes `risk_difference` and `risk_ratio`. Additionally, results can be printed to the console using the `summary()` function" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RD: -0.06857139263314216\n", + "RR: 0.5844630051369846\n", + "----------------------------------------------------------------------\n", + "Risk Difference: -0.0686\n", + "Risk Ratio: 0.5845\n", + "----------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "aipw.fit()\n", + "\n", + "print('RD:', aipw.risk_difference)\n", + "print('RR:', aipw.risk_ratio)\n", + "\n", + "aipw.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Interpreting the risk difference, we would conclude that had everyone in our cohort been treated with ART, the risk of all-cause mortality would have been 6.9% points lower than had no one been treated.\n", + "\n", + "### Confidence Intervals\n", + "To obtain correct confidence intervals, we need to use a bootstrap procedure. Influence curves are an alternative, but not currently available" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "95% LCL: -0.12324970574432371\n", + "95% UCL -0.005298084807482341\n" + ] + } + ], + "source": [ + "rd_results = []\n", + "for i in range(1000):\n", + " dfs = dfcc.sample(n=dfcc.shape[0],replace=True)\n", + " s = AIPTW(dfs,exposure='art',outcome='dead')\n", + " s.exposure_model('male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0',\n", + " print_results=False)\n", + " s.outcome_model('art + male + age0 + age_rs1 + age_rs2 + cd40 + cd4_rs1 + cd4_rs2 + dvl0',\n", + " print_results=False)\n", + " s.fit()\n", + " rd_results.append(s.risk_difference)\n", + "\n", + "\n", + "print('95% LCL:', np.percentile(rd_results, q=2.5))\n", + "print('95% UCL', np.percentile(rd_results, q=97.5))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Under the counterfactual of everyone receiving treatment with ART, the risk of all-cause mortality was 6.9% points lower (95% CL: ) than the counterfactual where no one had been treated.\n", + "\n", + "# Conclusion\n", + "In this tutorial, I introduced the concept of doubly-robust estimators and detailed augmented-IPTW. I demonstrated estimation with `AIPTW` using *zEpid* and how to obtain confidence intervals. Please view other tutorials for information on other functionality within *zEpid*\n", + "\n", + "## References\n", + "Funk MJ, Westreich D, Wiesen C, Stürmer T, Brookhart MA, Davidian M. (2011). Doubly robust estimation of causal effects. *AJE*, 173(7), 761-767.\n", + "\n", + "Keil AP et al. (2018). Resolving an apparent paradox in doubly robust estimators. *AJE*, 187(4), 891-892.\n", + "\n", + "Robins JM, Rotnitzky A, Zhao LP. (1994). Estimation of regression coefficients when some regressors are not always observed. *JASA*, 89(427), 846-866." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}