{ "cells": [ { "cell_type": "markdown", "id": "629df219", "metadata": {}, "source": [ "# Hernan & Robins (2023): Causal Inference with Models\n", "\n", "The following section replicates selected examples from the textbook [\"Causal Inference: What If\"](https://www.routledge.com/Causal-Inference-What-If/Hernan-Robins/p/book/9781420076165) by Hernan and Robins. These replications focus on Part II of the textbook. I recommend reading Part I prior to looking through the following code. It is a great, approachable, and freely available resource on causal inference. \n", "\n", "Here, we demonstrate application of `delicatessen` for causal inference. Throughout, we use the empirical sandwich variance estimator. This is not described in the textbook, but is an alternative to the bootstrap. Importantly, it is a consistent estimator of the variance that is computationally simpler (in terms of the computers computational effort). `delicatessen` automates the whole procedure, so it is even simpler for us.\n", "\n", "Broadly, interest will be in estimating the average causal effect of stopping smoking (variable name: `qsmk`) on 10-year weight change (variable name: `wt82_71`). If we let $Y^a$ indicate the potential weight change under smoking status $a$, then the average causal effect can be written as\n", "$$E[Y^1] - E[Y^0]$$\n", "Herafter, we assume that the interest parameter is identified (see the book for details on what this means). Our focus will be on estimators described in the book for this quantity (or related ones).\n", "\n", "## Setup" ] }, { "cell_type": "code", "id": "23a12509", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:17.916733900Z", "start_time": "2026-04-07T16:31:17.152508300Z" } }, "source": [ "import numpy as np\n", "import scipy as sp\n", "import pandas as pd\n", "\n", "import delicatessen as deli\n", "from delicatessen import MEstimator\n", "from delicatessen.estimating_equations import (ee_regression, ee_glm, ee_ipw_msm, \n", " ee_gformula, ee_aipw, \n", " ee_gestimation_snmm, \n", " ee_iv_causal, ee_2sls, ee_gestimation_snmm_iv)\n", "from delicatessen.utilities import inverse_logit, regression_predictions\n", "\n", "print(\"Versions\")\n", "print(\"NumPy: \", np.__version__)\n", "print(\"SciPy: \", sp.__version__)\n", "print(\"pandas: \", pd.__version__)\n", "print(\"Delicatessen: \", deli.__version__)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Versions\n", "NumPy: 2.3.5\n", "SciPy: 1.16.3\n", "pandas: 2.3.3\n", "Delicatessen: 4.2\n" ] } ], "execution_count": 1 }, { "cell_type": "markdown", "id": "67174499-e54e-4a45-bc94-afe6492734c3", "metadata": {}, "source": [ "Data for these replications is available at the following Harvard School of Public Health [website](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/). Data comes from the National Health and Nutrition Examination Survey Data I Epidemiologic Follow-up Study (NHEFS). The NHEFS was jointly initiated by the National Center for Health Statistics and the National Institute on Aging in collaboration with other agencies of the United States Public Health Service. A detailed description of the NHEFS, together with publicly available data sets and documentation, can be found at wwwn.cdc.gov/nchs/nhanes/nhefs/\n", "\n", "The data set used in the book and this tutorial is a subset of the full NHEFS. First, we will load the data and run some basic variable manipulations. " ] }, { "cell_type": "code", "id": "c34711ca", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:17.950540700Z", "start_time": "2026-04-07T16:31:17.930977500Z" } }, "source": [ "df = pd.read_csv('data/nhefs.csv')\n", "df.dropna(subset=['sex', 'age', 'race', 'ht', \n", " 'school', 'alcoholpy', 'smokeintensity'], \n", " inplace=True)\n", "\n", "# recoding some variables\n", "df['inactive'] = np.where(df['active'] == 2, 1, 0)\n", "df['no_exercise'] = np.where(df['exercise'] == 2, 1, 0)\n", "df['university'] = np.where(df['education'] == 5, 1, 0)\n", "\n", "# Subsetting only variables of interest\n", "df = df[['wt82_71', 'qsmk', 'sex', 'age', 'race', 'wt71', 'wt82', 'ht', \n", " 'school', 'alcoholpy', 'smokeintensity', 'smokeyrs', 'smkintensity82_71',\n", " 'education', 'exercise', 'active', 'death']]\n", "\n", "# creating quadratic terms\n", "for col in ['age', 'wt71', 'smokeintensity', 'smokeyrs']:\n", " df[col+'_sq'] = df[col] * df[col]\n", "\n", "df['I'] = 1\n", "\n", "# Indicator terms\n", "df['educ_2'] = np.where(df['education'] == 2, 1, 0)\n", "df['educ_3'] = np.where(df['education'] == 3, 1, 0)\n", "df['educ_4'] = np.where(df['education'] == 4, 1, 0)\n", "df['educ_5'] = np.where(df['education'] == 5, 1, 0)\n", "df['exer_1'] = np.where(df['exercise'] == 1, 1, 0)\n", "df['exer_2'] = np.where(df['exercise'] == 2, 1, 0)\n", "df['active_1'] = np.where(df['active'] == 1, 1, 0)\n", "df['active_2'] = np.where(df['active'] == 2, 1, 0)\n", "\n", "# Interaction terms\n", "df['qsmk_smkint'] = df['qsmk'] * df['smokeintensity']\n", "\n", "# Complete-case data\n", "dc = df.dropna(subset=['wt82_71']).copy()" ], "outputs": [], "execution_count": 2 }, { "cell_type": "markdown", "id": "b0009679", "metadata": {}, "source": [ "## Chapter 12: IP weighting and marginal structural models\n", "\n", "The first estimation approach is inverse probability weighting. Inverse probability weights are defined as \n", "$$ \\frac{1}{\\Pr(A=a | W)} $$\n", "where $W$ is the set of confounders. We will estimate these weights and then use them to estimate the parameters of a marginal structural model. An example of a marginal structural model is\n", "$$ E[Y^a] = \\alpha_0 + \\alpha_1 a $$\n", "where $\\alpha$ are the parameters to estimate. Here, $\\alpha_1$ represents the average causal effect. We will estimate the marginal structural model using the observed data and a regression model weighted by the inverse probability weights (see the books for details). \n", "\n", "### 12.1: The causal question\n", "Chapter 12 starts out with estimating the crude association between `qsmk` and `wt82_71` (12.1). We will do this by fitting a linear regression model with `delicatessen`." ] }, { "cell_type": "code", "id": "278b397c", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:17.958612700Z", "start_time": "2026-04-07T16:31:17.950540700Z" } }, "source": [ "def psi_ols(theta):\n", " return ee_regression(theta=theta, X=dc[['I', 'qsmk']],\n", " y=dc['wt82_71'], model='linear')" ], "outputs": [], "execution_count": 3 }, { "cell_type": "code", "id": "00e71d13", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:17.975791700Z", "start_time": "2026-04-07T16:31:17.958612700Z" } }, "source": [ "estr = MEstimator(psi_ols, init=[0., 0.])\n", "estr.estimate()" ], "outputs": [], "execution_count": 4 }, { "cell_type": "code", "id": "679e5922", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:17.992527200Z", "start_time": "2026-04-07T16:31:17.975791700Z" } }, "source": [ "estr.print_results(subset=[1, ])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 2\n", "Solving algorithm: lm | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 2.54 0.49 5.22 1.59 3.49 0.00 22.39 \n", "==============================================================\n" ] } ], "execution_count": 5 }, { "cell_type": "markdown", "id": "ad503833", "metadata": {}, "source": [ "The book reports an unadjusted estimate of 2.5 (95% CI: 1.7, 3.4)\n", "\n", "You may notice that the confidence interval differs slightly. That is because we are using the sandwich variance (the book uses a different approach). While the variance estimators used here and in the book are expected to be asymptotically equal (i.e., equal as $n$ goes to $\\infty$), they can produce different results in finite samples as we see here.\n", "\n", "### 12.2: Estimating inverse probability weights via modeling\n", "\n", "Now we will estimate the parameters of the marginal structural model using unstabilized inverse probability weights.\n", "\n", "To do this with `delicatessen`, we will define the corresponding design matrices. Then we will define the stacked estimating equations. Then we will estimate the parameters and covariance using `MEstimator` and present the output." ] }, { "cell_type": "code", "id": "60e6be04", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.007826400Z", "start_time": "2026-04-07T16:31:17.992527200Z" } }, "source": [ "# Design matrix for the propensity score model\n", "W = dc[['I', 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']]\n", "# Design matrix for the marginal structural model\n", "msm = dc[['I', 'qsmk']]\n", "# treatment variable\n", "a = dc['qsmk']\n", "# outcome variable\n", "y = dc['wt82_71']" ], "outputs": [], "execution_count": 6 }, { "cell_type": "code", "id": "d3143781", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.015791500Z", "start_time": "2026-04-07T16:31:18.007826400Z" } }, "source": [ "def psi_ipw_msm1(theta):\n", " # Dividing parameters into corresponding estimation equations\n", " alpha = theta[0:2]\n", " beta = theta[2:]\n", " \n", " # Estimating the propensity scores\n", " ee_ps = ee_regression(theta=beta, # Estimate propensity scores\n", " X=W, y=a, # ... given observed A,W\n", " model='logistic') # ... with logit model\n", " pi = inverse_logit(np.dot(W, beta)) # Get Pr(A = 1 | W)\n", " ipw = 1 / np.where(a == 1, pi, 1-pi) # Convert to IPW\n", " \n", " # Estimating the MSM using a weighted linear model\n", " ee_msm = ee_regression(theta=alpha, # MSM parameters \n", " X=msm, y=y, # ... observed data \n", " model='linear', # ... with linear model\n", " weights=ipw) # ... but weighted by IPW\n", " \n", " # Stacking the estimating equations and returning\n", " return np.vstack([ee_msm, ee_ps])" ], "outputs": [], "execution_count": 7 }, { "cell_type": "code", "id": "574fdabd", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.046031600Z", "start_time": "2026-04-07T16:31:18.019534Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., ] + [0.,]*W.shape[1]\n", "estr = MEstimator(psi_ipw_msm1, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 8 }, { "cell_type": "code", "id": "4e9976f8", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.058482800Z", "start_time": "2026-04-07T16:31:18.046031600Z" } }, "source": [ "estr.print_results(subset=[0, 1])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 21\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 1.78 0.22 8.16 1.35 2.21 0.00 51.42 \n", " 3.44 0.49 7.06 2.49 4.40 0.00 39.17 \n", "==============================================================\n" ] } ], "execution_count": 9 }, { "cell_type": "markdown", "id": "1097cb4d", "metadata": {}, "source": [ "which are the estimates of the $\\alpha_0$ and $\\alpha_1$ parameters. The book provides the point estimate for `qsmk` ($\\hat{\\alpha}_1$) as 3.4 (95% CI: 2.4, 4.5).\n", "\n", "The point estimates presented here are the same as the book. However, the confidence intervals differ slightly. The confidence intervals in the book use the 'GEE trick', treating the weights as known in the weighted linear model. The GEE trick provides overly conservative confidence intervals. With `delicatessen` and the sandwich variance, we can estimate the variance *without being overly conservative*. So, the variance reported above would be preferred over the approach described in the book. This highlights the advantage of M-estimators for computation of the variance with nuisance parameters (like the IPW estimator when the propensity score is estimated).\n", "\n", "Instead of coding this by-hand, we can also use the built-in `ee_ipw_msm` function. This function estimates a marginal structural model using inverse probability weights, as done above. Below is how to apply this functionality" ] }, { "cell_type": "code", "id": "b8343216", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.080477100Z", "start_time": "2026-04-07T16:31:18.070985Z" } }, "source": [ "def psi_ipw_msm1a(theta):\n", " # Built-in estimating equation\n", " return ee_ipw_msm(theta, y=y, A=a, W=W, V=msm, \n", " distribution='normal', \n", " link='identity')" ], "outputs": [], "execution_count": 10 }, { "cell_type": "code", "id": "6326bbd2", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.101037700Z", "start_time": "2026-04-07T16:31:18.080477100Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., ] + [0.,]*W.shape[1]\n", "estr = MEstimator(psi_ipw_msm1a, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 11 }, { "cell_type": "code", "id": "a300d352", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.111850800Z", "start_time": "2026-04-07T16:31:18.101037700Z" } }, "source": [ "estr.print_results(subset=[0, 1])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 21\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 1.78 0.22 8.16 1.35 2.21 0.00 51.42 \n", " 3.44 0.49 7.06 2.49 4.40 0.00 39.17 \n", "==============================================================\n" ] } ], "execution_count": 12 }, { "cell_type": "markdown", "id": "ac5fb883", "metadata": {}, "source": [ "As expected, this built-in functionality produces the same results as the by-hand version.\n", "\n", "### 12.3: Stabilized inverse probability weights\n", "\n", "Next, we are going to use stabilized weights. The stabilized weights will require us to estimate an additional parameter. We will accomplish this by stacking an estimating equation for that parameter. This extra estimating equation is for an intercept-only model for the probability of `qsmk`." ] }, { "cell_type": "code", "id": "f6036925", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.131153400Z", "start_time": "2026-04-07T16:31:18.119554800Z" } }, "source": [ "def psi_ipw_msm2(theta):\n", " # Dividing parameters into corresponding estimation equations\n", " alpha = theta[0:2] # MSM parameters\n", " gamma = np.array([theta[2], ]) # Numerator parameter\n", " beta = theta[3:] # Propensity score parameters\n", " \n", " # Estimating the propensity scores using a logit model\n", " ee_ps = ee_regression(theta=beta, # Propensity score model\n", " X=W, y=a, # ... with observed data\n", " model='logistic') # ... and logit model\n", " pi = inverse_logit(np.dot(W, beta)) # Predicted probability of A=1\n", "\n", " # Estimating intercept-only for numerator\n", " ee_num = ee_regression(theta=gamma, # Numerator model\n", " X=dc[['I']], y=a, # ... intercept-only\n", " model='logistic') # ... logit model\n", " num = inverse_logit(np.dot(dc[['I']], gamma)) # Marginal probability\n", " \n", " # Construct stabilized weights\n", " ipw = np.where(a == 1, num/pi, (1-num)/(1-pi))\n", " \n", " # Estimating the MSM using a weighted linear model\n", " ee_msm = ee_regression(theta=alpha, # MSM\n", " X=msm, y=y, # ... with observed data\n", " model='linear', # ... linear \n", " weights=ipw) # ... weighted by stabilized\n", " \n", " # Stacking the estimating equations and returning\n", " return np.vstack([ee_msm, ee_num, ee_ps])" ], "outputs": [], "execution_count": 13 }, { "cell_type": "code", "id": "cfb7adc1", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.195805200Z", "start_time": "2026-04-07T16:31:18.131153400Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., ] + [0., ] + [0.,]*W.shape[1]\n", "estr = MEstimator(psi_ipw_msm2, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 14 }, { "cell_type": "code", "id": "eef650ee", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.207720200Z", "start_time": "2026-04-07T16:31:18.198808900Z" } }, "source": [ "estr.print_results(subset=[0, 1])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 22\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 1.78 0.22 8.16 1.35 2.21 0.00 51.42 \n", " 3.44 0.49 7.06 2.49 4.40 0.00 39.17 \n", "==============================================================\n" ] } ], "execution_count": 15 }, { "cell_type": "markdown", "id": "a57fc3a2", "metadata": {}, "source": [ "The estimate is the same in the previous section. This is expected because as long as the marginal structural model is saturated, the unstabilized and stabilized IPTW should produce the same answer. (Note there are some differences out in the smaller decimals places, but this is due to floating point errors, not differences between the estimators).\n", "\n", "Note: `ee_ipw_msm` does not support the computation of stabilized weights (results are equivalent in this setting, so we can be more computationally efficient by opting for the unstabilized weights).\n", "\n", "### 12.4: Marginal structural models\n", "\n", "Now we will consider the IPW estimator for a continuous action. We will look at `smokeintensity` on `wt82_71`." ] }, { "cell_type": "code", "id": "3420cde5", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.229861400Z", "start_time": "2026-04-07T16:31:18.219728200Z" } }, "source": [ "# Restricting data by smoking intensity\n", "ds = dc.loc[dc['smokeintensity'] <= 25].copy()\n", "\n", "# Design matrix for the propensity score model\n", "W = ds[['I', 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']]\n", "# Design matrix for the marginal structural model\n", "ds['smkint_sq'] = ds['smkintensity82_71']**2\n", "msm = ds[['I', 'smkintensity82_71', 'smkint_sq']]\n", "# Treatment array\n", "a = ds['smkintensity82_71']\n", "# Outcome array\n", "y = ds['wt82_71']" ], "outputs": [], "execution_count": 16 }, { "cell_type": "code", "id": "b269e3ab", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.236728400Z", "start_time": "2026-04-07T16:31:18.231860700Z" } }, "source": [ "def psi_ipw_msm3(theta):\n", " # Dividing parameters into corresponding estimation equations\n", " alpha = theta[0:3] # Marginal structural model\n", " gamma = np.array([theta[3], ]) # Numerator \n", " beta = theta[4:] # Propensity score model\n", " div_ps = W.shape[0] - len(beta) # Divisor for PS SD\n", " div_nm = W.shape[0] - len(gamma) # Divisor for Num SD\n", " \n", " # Estimating the propensity scores using a logit model\n", " ee_ps = ee_regression(theta=beta, # Generalized PS model \n", " X=W, y=a, # ... for observed data\n", " model='linear') # ... linear regression\n", " mu = np.dot(W, beta) # Predicted values\n", " mu_resid = np.sum((a - mu)**2) / div_ps # Standard deviation\n", " fAL = sp.stats.norm.pdf(a, mu, # PDF from normal\n", " np.sqrt(mu_resid)) # ... with estimates\n", "\n", " # Estimating intercept-only for numerator\n", " ee_num = ee_regression(theta=gamma, # Numerator for stabilized\n", " X=ds[['I', ]], y=a, # ... for observed data\n", " model='linear') # ... linear regression\n", " num = np.dot(ds[['I', ]], gamma) # Predicted values\n", " num_resid = np.sum((a - num)**2) / div_nm # Standard deviation\n", " fA = sp.stats.norm.pdf(a, num, # PDF from normal\n", " np.sqrt(num_resid)) # ... with estimates\n", " \n", " # Stabilized weights\n", " ipw = fA / fAL\n", " \n", " # Estimating the MSM using a weighted linear model\n", " ee_msm = ee_regression(theta=alpha, # Marginal structural model\n", " X=msm, y=y, # ... observed data\n", " model='linear', # ... linear model\n", " weights=ipw) # ... weighted by IPW\n", " \n", " # Stacking the estimating equations and returning\n", " return np.vstack([ee_msm, ee_num, ee_ps])" ], "outputs": [], "execution_count": 17 }, { "cell_type": "code", "id": "2cd2b330", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.319785100Z", "start_time": "2026-04-07T16:31:18.236728400Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., 0., ] + [0., ] + [0.,]*W.shape[1]\n", "estr = MEstimator(psi_ipw_msm3, init=init_vals)\n", "estr.estimate(solver='hybr', tolerance=1e-12, maxiter=5000)" ], "outputs": [], "execution_count": 18 }, { "cell_type": "code", "id": "8e777bac", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.349952400Z", "start_time": "2026-04-07T16:31:18.319785100Z" } }, "source": [ "estr.print_results(subset=[0, 1, 2], decimals=3)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1162 | No. Parameters: 23\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-12 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 2.005 0.288 6.967 1.441 2.568 0.000 38.165 \n", " -0.109 0.030 -3.691 -0.167 -0.051 0.000 12.128 \n", " 0.003 0.002 1.188 -0.002 0.007 0.235 2.091 \n", "==============================================================\n" ] } ], "execution_count": 19 }, { "cell_type": "markdown", "id": "2453b34c", "metadata": {}, "source": [ "The book reports coefficients of: 2.005, −0.109, 0.003. This matches the output shown above.\n", "\n", "As done in the book, we want to know the weight change for no change in smoking intensity and a +20 in smoking intensity. Rather than add corresponding estimation equations for those parts, we can directly manipulate the point and covariance estimates to get this. The function `regression_predictions` will do this for us given the estimated parameters and the values of interest" ] }, { "cell_type": "code", "id": "d3e00f31", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.359686900Z", "start_time": "2026-04-07T16:31:18.349952400Z" } }, "source": [ "# Creating dataframe for combinations to predict\n", "p = pd.DataFrame()\n", "p['smkintensity82_71'] = [0, 20]\n", "p['smkint_sq'] = [0**2, 20**2]\n", "p['I'] = 1\n", "vals = np.asarray(p[['I', 'smkintensity82_71', 'smkint_sq']])\n", "\n", "# Getting predicted values and variance for combinations\n", "pred_y = regression_predictions(vals, \n", " estr.theta[0:3], \n", " estr.variance[0:3, 0:3])\n", "\n", "# Displaying results\n", "print(\"No change in smoking\")\n", "print(pred_y[0, 0])\n", "print(\"95% CI:\", pred_y[0, 2:])\n", "print()\n", "print(\"+20 smoking intensity\")\n", "print(pred_y[1, 0])\n", "print(\"95% CI:\", pred_y[1, 2:])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No change in smoking\n", "2.004524735075372\n", "95% CI: [1.44058093 2.56846854]\n", "\n", "+20 smoking intensity\n", "0.9027234481660886\n", "95% CI: [-1.49966123 3.30510813]\n" ] } ], "execution_count": 20 }, { "cell_type": "markdown", "id": "e376193d", "metadata": {}, "source": [ "Again, we get similar results to those reported in the book: 2.0 (95% CI: 1.4, 2.6) and 0.9 (95% CI: −1.7, 3.5). However, our confidence intervals are slightly more narrow. Again this would be expected since we are using a variance estimator that is not overly conservative.\n", "\n", "Note: `ee_ipw_msm` does not support non-binary treatments. Allowing for generic distributions for the generalized propensity score model is difficult. So, you will need to implement them by-hand (using a similar approach to above).\n", "\n", "#### Binary outcome\n", "\n", "We now repeat the process, but for a binary outcome and treatment. We will use a GLM with the binomial distribution and logistic link (this will estimate the causal odds ratio). This is avaible in `ee_glm`. We will also be using `qsmk` again." ] }, { "cell_type": "code", "id": "5f78c896", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.383473700Z", "start_time": "2026-04-07T16:31:18.376056300Z" } }, "source": [ "# Design matrix for propensity scores\n", "W = dc[['I', 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']]\n", "# Design matrix for marginal structural model\n", "msm = dc[['I', 'qsmk']]\n", "# Treatment array\n", "a = dc['qsmk']\n", "# Outcome array\n", "y = dc['death']" ], "outputs": [], "execution_count": 21 }, { "cell_type": "code", "id": "a5be66b9", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.391102600Z", "start_time": "2026-04-07T16:31:18.385630100Z" } }, "source": [ "def psi_ipw_msm4(theta):\n", " # Dividing parameters into corresponding estimation equations\n", " alpha = theta[0:2] # Marginal structural model\n", " gamma = np.array([theta[2], ]) # Numerator model\n", " beta = theta[3:] # Propensity score model\n", " \n", " # Estimating the propensity scores using a logit model\n", " ee_ps = ee_regression(theta=beta, # Propensity score\n", " X=W, y=a, # ... observed data\n", " model='logistic') # ... logistic model\n", " pi = inverse_logit(np.dot(W, beta)) # Predicted probability\n", "\n", " # Estimating intercept-only for numerator\n", " ee_num = ee_regression(theta=gamma, # Numerator model\n", " X=dc[['I']], y=a, # ... observed data\n", " model='logistic') # ... logit model\n", " num = inverse_logit(np.dot(dc[['I']], gamma)) # Predicted prob\n", " \n", " # Stabilized inverse probability weights\n", " ipw = np.where(a == 1, num/pi, (1-num)/(1-pi))\n", " \n", " # Estimating the MSM using a weighted linear model\n", " ee_msm = ee_glm(theta=alpha, # MSM\n", " X=msm, y=y, # ... observed data\n", " link='logit', # ... logit link\n", " distribution='binomial', # ... binomial dist\n", " weights=ipw) # ... weighted\n", " \n", " # Stacking the estimating equations and returning\n", " return np.vstack([ee_msm, ee_num, ee_ps])" ], "outputs": [], "execution_count": 22 }, { "cell_type": "code", "id": "d25c4f4f", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.453700400Z", "start_time": "2026-04-07T16:31:18.391102600Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., ] + [0., ] + [0.,]*W.shape[1]\n", "estr = MEstimator(psi_ipw_msm4, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 23 }, { "cell_type": "code", "id": "5a88a6e6", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.475733800Z", "start_time": "2026-04-07T16:31:18.463576900Z" } }, "source": [ "ci = estr.confidence_intervals()\n", "print(\"qsmk:\", np.exp(estr.theta[1]))\n", "print(\"95% CI:\", np.exp(ci[1, :]))" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "qsmk: 1.0305778926815938\n", "95% CI: [0.78938459 1.34546684]\n" ] } ], "execution_count": 24 }, { "cell_type": "markdown", "id": "896d2679", "metadata": {}, "source": [ "The previous results are for the causal odds ratio. Again, they are similar to the book with slight differences in the confidence intervals (i.e., 1.0; 95% CI: 0.8, 1.4).\n", "\n", "We can replicate this approach using `ee_ipw_msm`. To simplify the process, I am going to use the previous propensity score model parameters as the starting values." ] }, { "cell_type": "code", "id": "9cae841b", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.504759300Z", "start_time": "2026-04-07T16:31:18.489247700Z" } }, "source": [ "init_ps = list(estr.theta[3:])" ], "outputs": [], "execution_count": 25 }, { "cell_type": "code", "id": "e6f83185", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.521195700Z", "start_time": "2026-04-07T16:31:18.513263900Z" } }, "source": [ "def psi_ipw_msm4a(theta):\n", " # Built-in estimating equation\n", " return ee_ipw_msm(theta, y=y, A=a, W=W, V=msm, \n", " distribution='binomial', \n", " link='logit')" ], "outputs": [], "execution_count": 26 }, { "cell_type": "code", "id": "3defde1b", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.541620100Z", "start_time": "2026-04-07T16:31:18.522194600Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., ] + init_ps\n", "estr = MEstimator(psi_ipw_msm4a, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 27 }, { "cell_type": "code", "id": "200622ec", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.554280700Z", "start_time": "2026-04-07T16:31:18.543620100Z" } }, "source": [ "ci = estr.confidence_intervals()\n", "print(np.exp(estr.theta[0:2]))\n", "print(\"95% CI\")\n", "print(np.exp(ci[0:2, :]).T)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.22527109 1.030578 ]\n", "95% CI\n", "[[0.19459809 0.78938473]\n", " [0.26077884 1.3454669 ]]\n" ] } ], "execution_count": 28 }, { "cell_type": "markdown", "id": "2af90744", "metadata": {}, "source": [ "Which are the same (exponentiated) coefficients as the previous approach. There is a slight discrepancy you may notice in the confidence intervals, but this is a floating point error resulting from a difference between the stabilized weights (by-hand) and the unstabilized weights (`ee_ipw_msm`).\n", "\n", "### 12.5: Effect modification and marginal structural models\n", "\n", "We will now use marginal structural models to study effect measure modification. We will look at effect modification by sex (`sex`) of quitting smoking (`qsmk`) on 10-year weight change (`wt82_71`)." ] }, { "cell_type": "code", "id": "0b317996", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.582556600Z", "start_time": "2026-04-07T16:31:18.565782500Z" } }, "source": [ "# Design matrix for propensity scores\n", "W = dc[['I', 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']]\n", "# Design matrix for marginal structural model\n", "dc['qsmk_sex'] = dc['qsmk']*dc['sex']\n", "msm = dc[['I', 'qsmk', 'sex', 'qsmk_sex']]\n", "# Treatment array\n", "a = dc['qsmk']\n", "# Outcome array\n", "y = dc['wt82_71']" ], "outputs": [], "execution_count": 29 }, { "cell_type": "code", "id": "b94f358c", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.594077500Z", "start_time": "2026-04-07T16:31:18.585556200Z" } }, "source": [ "def psi_ipw_msm5(theta):\n", " # Dividing parameters into corresponding estimation equations\n", " alpha = theta[0:4] # Marginal structural model\n", " gamma = np.array([theta[4], ]) # Numerator parameter\n", " beta = theta[5:] # Propensity score\n", " \n", " # Estimating the propensity scores using a logit model\n", " ee_ps = ee_regression(theta=beta, # Propensity score\n", " X=W, y=a, # ... observed data\n", " model='logistic') # ... logit model\n", " pi = inverse_logit(np.dot(W, beta)) # Predicted prob\n", "\n", " # Estimating intercept-only for numerator\n", " ee_num = ee_regression(theta=gamma, # Numerator model\n", " X=dc[['I']], y=a, # ... observed data\n", " model='logistic') # ... logit model\n", " num = inverse_logit(np.dot(dc[['I']], gamma)) # Predicted prob\n", " \n", " # Stabilized inverse probability weights\n", " ipw = np.where(a == 1, num/pi, (1-num)/(1-pi))\n", " \n", " # Estimating the MSM using a weighted linear model\n", " ee_msm = ee_regression(theta=alpha, # Marginal structural model\n", " X=msm, y=y, # ... observed data\n", " model='linear', # ... linear model\n", " weights=ipw) # ... weighted\n", " \n", " # Stacking the estimating equations and returning\n", " return np.vstack([ee_msm, ee_num, ee_ps])" ], "outputs": [], "execution_count": 30 }, { "cell_type": "code", "id": "8b3d0d83", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.654620Z", "start_time": "2026-04-07T16:31:18.594077500Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., 0., 0., ] + [0., ] + [0.,]*W.shape[1]\n", "estr = MEstimator(psi_ipw_msm5, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 31 }, { "cell_type": "code", "id": "ff308a62", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.666776Z", "start_time": "2026-04-07T16:31:18.657135600Z" } }, "source": [ "estr.print_results(subset=[0, 1, 2, 3])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 24\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 1.78 0.30 5.91 1.19 2.38 0.00 28.09 \n", " 3.52 0.63 5.57 2.28 4.76 0.00 25.22 \n", " -0.01 0.45 -0.02 -0.88 0.86 0.98 0.02 \n", " -0.16 1.02 -0.16 -2.16 1.84 0.88 0.19 \n", "==============================================================\n" ] } ], "execution_count": 32 }, { "cell_type": "markdown", "id": "f208c5a9", "metadata": {}, "source": [ "While not reported in the book, other online references report the following coefficients: 1.7844,\t3.5220, -0.0087, -0.1595.\n", "\n", "As before, we can replicate this result using the built-in `ee_ipw_msm` function." ] }, { "cell_type": "code", "id": "b351bfac", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.699613100Z", "start_time": "2026-04-07T16:31:18.682275Z" } }, "source": [ "def psi_ipw_msm5a(theta):\n", " # Built-in estimating equation\n", " return ee_ipw_msm(theta, y=y, A=a, W=W, V=msm, \n", " distribution='normal', \n", " link='identity')" ], "outputs": [], "execution_count": 33 }, { "cell_type": "code", "id": "fd07dd28", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.720379800Z", "start_time": "2026-04-07T16:31:18.701610100Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., ]*msm.shape[1] + init_ps\n", "estr = MEstimator(psi_ipw_msm5a, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 34 }, { "cell_type": "code", "id": "a5c18154", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.731938200Z", "start_time": "2026-04-07T16:31:18.721905200Z" } }, "source": [ "estr.print_results(subset=[0, 1, 2, 3])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 23\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 1.78 0.30 5.91 1.19 2.38 0.00 28.09 \n", " 3.52 0.63 5.57 2.28 4.76 0.00 25.22 \n", " -0.01 0.45 -0.02 -0.88 0.86 0.98 0.02 \n", " -0.16 1.02 -0.16 -2.16 1.84 0.88 0.19 \n", "==============================================================\n" ] } ], "execution_count": 35 }, { "cell_type": "markdown", "id": "f64bd4ee", "metadata": {}, "source": [ "which again replicates the by-hand results.\n", "\n", "### 12.6: Censoring and missing data\n", "\n", "To conclude, we will now consider the missing outcomes that were ignored earlier. To do this, we will use stabilized inverse probability of missingness weights (IPCW in the book). As we have done so many times already, we will use `delicatessen` to stack additional nuisance models together." ] }, { "cell_type": "code", "id": "2eef5bcd", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.754613900Z", "start_time": "2026-04-07T16:31:18.745958900Z" } }, "source": [ "# Design matrix for propensity score model\n", "W = df[['I', 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']]\n", "# Design matrix for missing model\n", "X = df[['I', 'qsmk', 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']]\n", "# Design matrix for marginal structural model\n", "msm = df[['I', 'qsmk']]\n", "# Treatment array\n", "a = df['qsmk']\n", "# Outcome array\n", "y = df['wt82_71']\n", "# Missing indicator\n", "r = np.where(df['wt82_71'].isna(), 0, 1)" ], "outputs": [], "execution_count": 36 }, { "cell_type": "code", "id": "7fb0a19b", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:18.770900500Z", "start_time": "2026-04-07T16:31:18.755610400Z" } }, "source": [ "def psi_ipw_msm6(theta):\n", " # Dividing parameters up for their corresponding estimation equations\n", " alpha = theta[0:2] # MSM\n", " gamma_n = np.array([theta[2], ]) # Numerator PS\n", " beta_n = np.array([theta[3], ]) # Numerator MW\n", " gamma_d = theta[4:4+W.shape[1]] # Propensity score\n", " beta_d = theta[4+W.shape[1]:] # Missing model\n", " \n", " # Estimating the propensity scores using a logit model\n", " ee_ps = ee_regression(theta=gamma_d, # Propensity score\n", " X=W, y=a, # ... observed data\n", " model='logistic') # ... logit model\n", " pi_a = inverse_logit(np.dot(W, gamma_d)) # Predicted prob\n", "\n", " # Estimating intercept-only for numerator of IPTW\n", " ee_num = ee_regression(theta=gamma_n, # Numerator\n", " X=df[['I']], y=a, # ... observed data\n", " model='logistic') # ... logit model\n", " num_a = inverse_logit(np.dot(df[['I']], gamma_n))\n", "\n", " # Estimating the missing scores using a logit model\n", " ee_ms = ee_regression(theta=beta_d, # Missing score\n", " X=X, y=r, # ... observed data\n", " model='logistic') # ... logit model\n", " pi_m = inverse_logit(np.dot(X, beta_d)) # Predicted prob\n", "\n", " # Estimating intercept-only for numerator of IPMW\n", " ee_sms = ee_regression(theta=beta_n, # Numerator\n", " X=df[['I']], y=r, # ... observed data\n", " model='logistic') # ... logit model\n", " num_m = inverse_logit(np.dot(df[['I']], beta_n))\n", " \n", " # Stabilized inverse probability weights\n", " iptw = np.where(a == 1, num_a/pi_a, (1-num_a)/(1-pi_a))\n", " ipmw = np.where(r == 1, num_m/pi_m, 0)\n", " ipw = iptw*ipmw\n", " \n", " # Estimating the MSM using a weighted linear model\n", " ee_msm = ee_regression(theta=alpha, # MSM\n", " X=msm, y=y, # ... observed data\n", " model='linear', # ... linear model\n", " weights=ipw) # ... weighted\n", " # Setting rows with missing Y's as zero (no contribution)\n", " ee_msm = np.nan_to_num(ee_msm, copy=False, nan=0.)\n", " \n", " # Stacking the estimating equations and returning\n", " return np.vstack([ee_msm, ee_num, ee_sms, ee_ps, ee_ms])" ], "outputs": [], "execution_count": 37 }, { "cell_type": "code", "id": "6838a60a", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.087490Z", "start_time": "2026-04-07T16:31:18.773899900Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., ] + [0., 0., ] + [0.,]*W.shape[1] + [0.,]*X.shape[1]\n", "estr = MEstimator(psi_ipw_msm6, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 38 }, { "cell_type": "code", "id": "c1914aad", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.115416Z", "start_time": "2026-04-07T16:31:19.088491700Z" } }, "source": [ "estr.print_results(subset=[0, 1])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1629 | No. Parameters: 43\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 1.66 0.22 7.47 1.23 2.10 0.00 43.50 \n", " 3.50 0.49 7.20 2.54 4.45 0.00 40.62 \n", "==============================================================\n" ] } ], "execution_count": 39 }, { "cell_type": "markdown", "id": "6c9a3721", "metadata": {}, "source": [ "Here, the book reports 3.5 (95% CI: 2.5, 4.5).\n", "\n", "Again, we will replicate this using `ee_ipw_msm` instead. To incorporate the missingness weights, we will fit a separate model and then use the optional `weights` argument. Below is how this can be done" ] }, { "cell_type": "code", "id": "73faacf3", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.125640700Z", "start_time": "2026-04-07T16:31:19.116420200Z" } }, "source": [ "def psi_ipw_msm6a(theta):\n", " # Separating parameters out\n", " alpha = theta[:2+W.shape[1]] # MSM & PS\n", " gamma = theta[2+W.shape[1]:] # Missing score\n", " \n", " # Estimating equation for IPMW\n", " ee_ms = ee_regression(theta=gamma, # Missing score\n", " X=X, y=r, # ... observed data\n", " model='logistic') # ... logit model\n", " pi_m = inverse_logit(np.dot(X, gamma)) # Predicted prob\n", " ipmw = r / pi_m # Missing weights\n", "\n", " # Estimating equations for MSM and PS\n", " ee_msm = ee_ipw_msm(theta=alpha, y=y, A=a, W=W, V=msm, \n", " distribution='normal', \n", " link='identity',\n", " weights=ipmw)\n", " # Setting rows with missing Y's as zero (no contribution)\n", " ee_msm = np.nan_to_num(ee_msm, copy=False, nan=0.)\n", " \n", " # Return the stacked estimating equations\n", " return np.vstack([ee_msm, ee_ms])" ], "outputs": [], "execution_count": 40 }, { "cell_type": "code", "id": "3b638089", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.194959700Z", "start_time": "2026-04-07T16:31:19.125640700Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., ]*msm.shape[1] + init_ps + [0., ]*X.shape[1]\n", "estr = MEstimator(psi_ipw_msm6a, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 41 }, { "cell_type": "code", "id": "696754ca", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.213431Z", "start_time": "2026-04-07T16:31:19.196468900Z" } }, "source": [ "estr.print_results(subset=[0, 1])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1629 | No. Parameters: 41\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 1.66 0.22 7.47 1.23 2.10 0.00 43.50 \n", " 3.50 0.49 7.20 2.54 4.45 0.00 40.62 \n", "==============================================================\n" ] } ], "execution_count": 42 }, { "cell_type": "markdown", "id": "eae99648", "metadata": {}, "source": [ "Again, we are able to replicate the by-hand implementation. This concludes chapter 12.\n", "\n", "## Chapter 13: Standardization and the Parametric G-Formula\n", "\n", "For Chapter 13, the book reviews the g-formula. Unlike IPW, the g-formula relies on modeling the outcome process. The g-computation algorithm estimator is \n", "$$ \\hat{E}[Y^a] = n^{-1} \\sum_{i=1}^n m(a, W_i; \\beta) $$\n", "where $m$ is a statistical model for $E[Y | A, W]$ and $\\beta$ are the parameters defining the model. This version is slightly different from the standardization form given in the book, but it is equivalent. Broadly, we apply the g-computation algorithm via (1) estimate an outcome model, (2) predict the outcomes had everyone been assigned $a$, and (3) compute the mean of those predictions. \n", "\n", "### 13.2: Estimating the mean outcome via modeling\n", "\n", "First, we will fit a linear model for `wt82_71` conditional on `qsmk` and the set of confounding variables. To begin, we will ignore the missing outcomes." ] }, { "cell_type": "code", "id": "0f92d4b7", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.232941100Z", "start_time": "2026-04-07T16:31:19.224452500Z" } }, "source": [ "# Design matrix for outcome model\n", "X = dc[['I', 'qsmk', 'qsmk_smkint',\n", " 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']]\n", "# Outcome array\n", "y = dc['wt82_71']" ], "outputs": [], "execution_count": 43 }, { "cell_type": "code", "id": "f7b43bc4", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.241014200Z", "start_time": "2026-04-07T16:31:19.233938100Z" } }, "source": [ "def psi_regression(theta): \n", " # Estimating the linear outcome model\n", " return ee_regression(theta=theta, \n", " X=X, y=y, \n", " model='linear')" ], "outputs": [], "execution_count": 44 }, { "cell_type": "code", "id": "2c09a41f", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.254005500Z", "start_time": "2026-04-07T16:31:19.241014200Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., ]*X.shape[1]\n", "estr = MEstimator(psi_regression, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 45 }, { "cell_type": "code", "id": "b7216d18", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.299496100Z", "start_time": "2026-04-07T16:31:19.257805300Z" } }, "source": [ "estr.print_results()" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 21\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " -1.59 4.71 -0.34 -10.83 7.65 0.74 0.44 \n", " 2.56 0.84 3.06 0.92 4.20 0.00 8.82 \n", " 0.05 0.04 1.14 -0.03 0.13 0.25 1.99 \n", " -1.43 0.51 -2.83 -2.42 -0.44 0.00 7.75 \n", " 0.56 0.60 0.94 -0.61 1.73 0.35 1.52 \n", " 0.36 0.19 1.92 -0.01 0.73 0.06 4.18 \n", " -0.01 0.00 -3.15 -0.01 -0.00 0.00 9.25 \n", " 0.79 0.63 1.25 -0.45 2.03 0.21 2.25 \n", " 0.56 0.55 1.01 -0.53 1.64 0.31 1.67 \n", " 1.49 0.76 1.97 0.01 2.97 0.05 4.36 \n", " -0.19 0.72 -0.27 -1.60 1.21 0.79 0.35 \n", " 0.05 0.06 0.86 -0.06 0.16 0.39 1.36 \n", " -0.00 0.00 -0.98 -0.00 0.00 0.33 1.62 \n", " 0.13 0.11 1.21 -0.08 0.35 0.23 2.14 \n", " -0.00 0.00 -1.07 -0.01 0.00 0.29 1.81 \n", " 0.30 0.47 0.64 -0.62 1.21 0.53 0.93 \n", " 0.35 0.52 0.68 -0.66 1.37 0.50 1.01 \n", " -0.95 0.40 -2.38 -1.73 -0.17 0.02 5.84 \n", " -0.26 0.75 -0.35 -1.72 1.20 0.73 0.46 \n", " 0.05 0.10 0.46 -0.15 0.24 0.64 0.64 \n", " -0.00 0.00 -1.50 -0.00 0.00 0.13 2.89 \n", "==============================================================\n" ] } ], "execution_count": 46 }, { "cell_type": "markdown", "id": "4ab00df1", "metadata": {}, "source": [ "To ease application of later steps, we are going to save a list of these optimized values for starting values of the root-finding procedure." ] }, { "cell_type": "code", "id": "666b5823", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.310008100Z", "start_time": "2026-04-07T16:31:19.301496200Z" } }, "source": [ "init_reg = list(estr.theta)" ], "outputs": [], "execution_count": 47 }, { "cell_type": "markdown", "id": "c3a8fcc8", "metadata": {}, "source": [ "### 13.3: Standardizing the mean outcome to the confounder distribution\n", "\n", "Now we can apply the g-computation algorithm (a way to evaluate the g-formula). To do this, we are going to create a copy of our data set. In that copy we are going to set `qsmk=1` for all observations. We will then save the design matrix. We will then repeat this process for `qsmk=0`." ] }, { "cell_type": "code", "id": "b3c91754", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.319517Z", "start_time": "2026-04-07T16:31:19.311008500Z" } }, "source": [ "# Copy of the data that we will update qsmk in\n", "dca = dc.copy()" ], "outputs": [], "execution_count": 48 }, { "cell_type": "code", "id": "0181be8b", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.334215Z", "start_time": "2026-04-07T16:31:19.319517Z" } }, "source": [ "# Setting qsmk to 1\n", "dca['qsmk'] = 1\n", "dca['qsmk_smkint'] = dca['qsmk'] * dca['smokeintensity']\n", "# Design matrix from qsmk=1 data\n", "X1 = dca[['I', 'qsmk', 'qsmk_smkint',\n", " 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']]" ], "outputs": [], "execution_count": 49 }, { "cell_type": "code", "id": "5409cd48", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.345778400Z", "start_time": "2026-04-07T16:31:19.336729Z" } }, "source": [ "# Setting qsmk to 0\n", "dca['qsmk'] = 0\n", "dca['qsmk_smkint'] = dca['qsmk'] * dca['smokeintensity']\n", "# Design matrix from qsmk=0 data\n", "X0 = dca[['I', 'qsmk', 'qsmk_smkint',\n", " 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']]" ], "outputs": [], "execution_count": 50 }, { "cell_type": "markdown", "id": "56d5b237", "metadata": {}, "source": [ "Now, we will use the design matrices to generate predicted outcomes (pseudo-outcomes) for each individual. The average causal effect can then be calculated from those pseudo-outcomes. \n", "\n", "The following is the corresponding estimating equations" ] }, { "cell_type": "code", "id": "05d5600d", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.359304Z", "start_time": "2026-04-07T16:31:19.345778400Z" } }, "source": [ "def psi_gcomp1(theta): \n", " # Dividing parameters into corresponding estimation equations\n", " rd, r1, r0 = theta[0], theta[1], theta[2] # Causal means\n", " beta = theta[3:] # Outcome model\n", " \n", " # Estimating the linear model\n", " ee_reg = ee_regression(theta=beta, # Outcome model\n", " X=X, y=y, # ... observed data\n", " model='linear') # ... linear model\n", " \n", " # Generating pseudo-outcome using the model\n", " y1hat = np.dot(X1, beta) # Predicted Y when qsmk=1\n", " y0hat = np.dot(X0, beta) # Predicted Y when qsmk=0\n", " \n", " # Causal means\n", " ee_r1 = y1hat - r1 # Causal mean for qsmk=1\n", " ee_r0 = y0hat - r0 # Causal mean for qsmk=0\n", " \n", " # Average causal effect\n", " ee_rd = np.ones(y.shape[0]) * ((r1 - r0) - rd)\n", " \n", " # Stacking the estimating equations and returning\n", " return np.vstack([ee_rd, ee_r1, ee_r0, ee_reg])" ], "outputs": [], "execution_count": 51 }, { "cell_type": "code", "id": "6b5ed64e", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.387772100Z", "start_time": "2026-04-07T16:31:19.361303800Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., 0.,] + init_reg\n", "estr = MEstimator(psi_gcomp1, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 52 }, { "cell_type": "code", "id": "24e1fab9", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.400099100Z", "start_time": "2026-04-07T16:31:19.389772800Z" } }, "source": [ "estr.print_results(subset=[0, 1, 2])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 24\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 3.52 0.48 7.36 2.58 4.45 0.00 42.36 \n", " 5.27 0.44 12.12 4.42 6.13 0.00 109.95 \n", " 1.76 0.22 8.08 1.33 2.18 0.00 50.48 \n", "==============================================================\n" ] } ], "execution_count": 53 }, { "cell_type": "markdown", "id": "a53cab09", "metadata": {}, "source": [ "The first estimate is for the average causal effect (the second are the causal means under all quit smoking and all don't quit smoking). Here, the confidence intervals match the book, but we were able to avoid the computational complexity of the bootstrap (another advantage of the sandwich variance estimator). \n", "\n", "In the book, they report 3.5 (95% CI: 2.6, 4.5). The confidence interval in the book was generated via the bootstrap, so there is potential for random error in the estimates.\n", "\n", "Rather than implement g-computation by-hand, we can also use the built-in estimating equations. Here is an example of that" ] }, { "cell_type": "code", "id": "0ae94f30", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.419695Z", "start_time": "2026-04-07T16:31:19.411837900Z" } }, "source": [ "def psi_gcomp1a(theta):\n", " # Built-in g-formula estimating equation\n", " return ee_gformula(theta, \n", " y=y, X=X,\n", " X1=X1, X0=X0)" ], "outputs": [], "execution_count": 54 }, { "cell_type": "code", "id": "18322f47", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.440440400Z", "start_time": "2026-04-07T16:31:19.421154300Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., 0.,] + init_reg\n", "estr = MEstimator(psi_gcomp1a, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 55 }, { "cell_type": "code", "id": "c3e0135b", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.453316500Z", "start_time": "2026-04-07T16:31:19.440440400Z" } }, "source": [ "estr.print_results(subset=[0, 1, 2])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 24\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 3.52 0.48 7.36 2.58 4.45 0.00 42.36 \n", " 5.27 0.44 12.12 4.42 6.13 0.00 109.95 \n", " 1.76 0.22 8.08 1.33 2.18 0.00 50.48 \n", "==============================================================\n" ] } ], "execution_count": 56 }, { "cell_type": "markdown", "id": "0ff2ba34-1ac7-447e-ba84-3d711c192d8a", "metadata": {}, "source": [ "which provides the same answers (as we would expect!)\n", "\n", "### Fine Point 13.2: A doubly robust estimator\n", "\n", "As a way to combine the estimators of the previous chapters, this aside in the book discusses construction of 'doubly robust' estimators. Namely, this estimator takes both the propensity score and outcome regression approaches, and then combines them in such a way that the resulting estimator is consistent when either of the two nuisance models is correctly specified. The Fine Point reviews a doubly robust estimator proposed by Bang & Robins. The doubly robust estimator considered here is different and is sometimes referred to as the canonical augmented inverse probability weighting (AIPW) estimator. Briefly, we apply the estimating functions for the nuisance models and then we combine these nuisance parameters via the following formula\n", "$$ \\hat{E}[Y^a] = n^{-1} \\sum_{i=1}^n \\frac{Y_i I(A_i = a)}{\\pi_{A=a}(W_i)} + m(a, W_i; \\beta) I(A_i) \\frac{\\pi_{A\\ne a}(W_i)}{\\pi_{A=a}(W_i)} $$\n", "which can be implemented manually with the following estimating functions" ] }, { "cell_type": "code", "id": "49d54406-70b0-4b55-81fd-4a54852029cc", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.482548200Z", "start_time": "2026-04-07T16:31:19.465328100Z" } }, "source": [ "# Refreshing design matrix for the propensity score model\n", "W = dc[['I', 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']]\n", "# Refreshing treatment array\n", "a = dc['qsmk']" ], "outputs": [], "execution_count": 57 }, { "cell_type": "code", "id": "bdb42ccb-c935-4c20-93ab-50510d1a5c91", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.492021800Z", "start_time": "2026-04-07T16:31:19.484552500Z" } }, "source": [ "def psi_aipw1(theta): \n", " # Dividing parameters into corresponding estimation equations\n", " ndim_X = X.shape[1]\n", " rd, r1, r0 = theta[0], theta[1], theta[2] # Causal means\n", " beta = theta[3: 3+ndim_X] # Outcome model\n", " alpha = theta[3+ndim_X:] # Outcome model\n", "\n", " # Estimating the propensity scores\n", " ee_ps = ee_regression(theta=alpha, # Estimate propensity scores\n", " X=W, y=a, # ... given observed A,W\n", " model='logistic') # ... with logit model\n", " pi = inverse_logit(np.dot(W, alpha)) # Get Pr(A = 1 | W)\n", " \n", " # Estimating the outcome model\n", " ee_reg = ee_regression(theta=beta, # Outcome model\n", " X=X, y=y, # ... observed data\n", " model='linear') # ... linear model\n", " y1 = np.dot(X1, beta) # Pseudo potential outcome\n", " y0 = np.dot(X0, beta) # Pseudo potential outcome\n", " \n", " # Causal means\n", " ee_r1 = (y*a/pi - y1*(a-pi)/pi) - r1 # Causal mean for qsmk=1\n", " ee_r0 = (y*(1-a)/(1-pi) + y0*(a-pi)/(1-pi)) - r0 # Causal mean for qsmk=0\n", " \n", " # Average causal effect\n", " ee_rd = np.ones(y.shape[0]) * ((r1 - r0) - rd)\n", " \n", " # Stacking the estimating equations and returning\n", " return np.vstack([ee_rd, ee_r1, ee_r0, ee_reg, ee_ps])" ], "outputs": [], "execution_count": 58 }, { "cell_type": "code", "id": "a413f0da-198e-413b-ad8a-3b9154b9af6b", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.595395Z", "start_time": "2026-04-07T16:31:19.493021300Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., 0., ] + [0., ]*X.shape[1] + [0.,]*W.shape[1]\n", "estr = MEstimator(psi_aipw1, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 59 }, { "cell_type": "code", "id": "f56f5c88-8f3b-4c46-ae06-551de8e102f7", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.627566700Z", "start_time": "2026-04-07T16:31:19.595395Z" } }, "source": [ "estr.print_results(subset=[0, 1, 2])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 43\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 3.46 0.48 7.14 2.51 4.41 0.00 39.99 \n", " 5.22 0.44 11.82 4.36 6.09 0.00 104.64 \n", " 1.77 0.22 8.07 1.34 2.20 0.00 50.39 \n", "==============================================================\n" ] } ], "execution_count": 60 }, { "cell_type": "markdown", "id": "c6f54163-afc8-49f7-ac42-8c033c2d8511", "metadata": {}, "source": [ "Here, the AIPW results are similar to the previous g-computation and IPW results. \n", "\n", "We can also use the built-in functionality for the canonical AIPW estimator (`ee_aipw`) to do this same approach without all the manual coding." ] }, { "cell_type": "code", "id": "7fbe1eb5-b5dd-4c79-9162-79fbe038e54f", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.641933300Z", "start_time": "2026-04-07T16:31:19.628565200Z" } }, "source": [ "def psi_aipw2(theta):\n", " return ee_aipw(theta=theta, y=y, A=a, W=W, X=X, X1=X1, X0=X0)" ], "outputs": [], "execution_count": 61 }, { "cell_type": "code", "id": "24b327df-9840-4cf8-bda2-fa91c3f380cf", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.687335700Z", "start_time": "2026-04-07T16:31:19.643932900Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., 0., ] + [-1.5, ] + [0., ]*(X.shape[1]-1) + [0.,]*W.shape[1]\n", "estr = MEstimator(psi_aipw2, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 62 }, { "cell_type": "code", "id": "ddff362a-d519-43ba-87d0-853360719a89", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.698350100Z", "start_time": "2026-04-07T16:31:19.689339Z" } }, "source": [ "estr.print_results(subset=[0, 1, 2])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 43\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 3.46 0.48 7.14 2.51 4.41 0.00 39.99 \n", " 5.22 0.44 11.82 4.36 6.09 0.00 104.64 \n", " 1.77 0.22 8.07 1.34 2.20 0.00 50.39 \n", "==============================================================\n" ] } ], "execution_count": 63 }, { "cell_type": "markdown", "id": "8dca05f4-f401-45ae-bbe6-52b08db04509", "metadata": {}, "source": [ "These results match the by-hand implementation, as expected.\n", "\n", "There is also an additional way to implement the AIPW estimator. This approach fits a weighted outcome model, where the weights are from IPW. This model is then used to predict values, like the standard g-computation algorithm. As shown below, this approach might be easier to implement than the others. This version is sometimes referred to as the weighted-regression AIPW" ] }, { "cell_type": "code", "id": "350d709f-2cd4-46ae-a88a-81ee6da7b187", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.721858200Z", "start_time": "2026-04-07T16:31:19.710852400Z" } }, "source": [ "def psi_aipw3(theta): \n", " # Dividing parameters into corresponding estimation equations\n", " ndim_X = X.shape[1]\n", " rd, r1, r0 = theta[0], theta[1], theta[2] # Causal means\n", " beta = theta[3: 3+ndim_X] # Outcome model\n", " alpha = theta[3+ndim_X:] # Outcome model\n", "\n", " # Estimating the propensity scores\n", " ee_ps = ee_regression(theta=alpha, # Estimate propensity scores\n", " X=W, y=a, # ... given observed A,W\n", " model='logistic') # ... with logit model\n", " pi = inverse_logit(np.dot(W, alpha)) # Get Pr(A = 1 | W)\n", " ipw = 1 / np.where(a == 1, pi, 1-pi) # Convert to IPW\n", " \n", " # Estimating the outcome model\n", " ee_reg = ee_regression(theta=beta, # Outcome model\n", " X=X, y=y, # ... observed data\n", " model='linear', # ... linear model\n", " weights=ipw) # ... weighted by IPW\n", " y1 = np.dot(X1, beta) # Pseudo potential outcome\n", " y0 = np.dot(X0, beta) # Pseudo potential outcome\n", " \n", " # Causal means\n", " ee_r1 = y1 - r1 # Causal mean for qsmk=1\n", " ee_r0 = y0 - r0 # Causal mean for qsmk=0\n", " \n", " # Average causal effect\n", " ee_rd = np.ones(y.shape[0]) * ((r1 - r0) - rd)\n", " \n", " # Stacking the estimating equations and returning\n", " return np.vstack([ee_rd, ee_r1, ee_r0, ee_reg, ee_ps])" ], "outputs": [], "execution_count": 64 }, { "cell_type": "code", "id": "bec6b950-2357-4054-87c9-fea7ba2ec1f7", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.874035200Z", "start_time": "2026-04-07T16:31:19.725422300Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., 0., ] + [0., ]*X.shape[1] + [0.,]*W.shape[1]\n", "estr = MEstimator(psi_aipw3, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 65 }, { "cell_type": "code", "id": "02f095f1-32a9-4c13-8c80-bba1899109a0", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.901600Z", "start_time": "2026-04-07T16:31:19.875539700Z" } }, "source": [ "estr.print_results(subset=[0, 1, 2])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1566 | No. Parameters: 43\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 3.43 0.48 7.16 2.49 4.36 0.00 40.13 \n", " 5.19 0.44 11.90 4.33 6.04 0.00 106.14 \n", " 1.76 0.22 8.06 1.33 2.19 0.00 50.19 \n", "==============================================================\n" ] } ], "execution_count": 66 }, { "cell_type": "markdown", "id": "da42b104", "metadata": {}, "source": [ "Note that these estimates differ slightly, but this is expected since it solves a different equation. While not relevant in this setting, the weighted-regression AIPW may be preferred over the canonical AIPW, since the former is guaranteed to be bounded in the parameter space while the latter is not.\n", "\n", "As detailed in Shook-Sa et al. *Biometrics* 2025, the M-estimation setup offers the additional feature that the AIPW estimators have doubly robust point *and* variance estimation. This is not always the case for variance estimation with AIPW (which is not desirable). \n", "\n", "## Chapter 14: G-Estimation of Structural Nested Models\n", "\n", "G-estimation differs from the previous approaches in what parameter it targets. Rather than the average causal effect, we will estimate the parameters of a structural nested model. The structural nested mean model is\n", "$$ E[Y^a - Y^{a=0} | A=a, W] = \\varphi_0 a$$\n", "Here, $\\varphi_0$ represents the difference by $a$. However, we can also study effect measure modification easily with structural nested models. Consider\n", "$$ E[Y^a - Y^{a=0} | A=a, W] = \\varphi_0 a + \\varphi_1 a V$$\n", "where $V \\in W$. Therefore, the structural nested model described effect measure modification by $V$. Importantly, structural nested models assume that we have correctly specified this model (hence the difference in interest parameters that occurs between methods).\n", "\n", "G-estimation is a bit less straightforward to understand compare to the previous methods. Note that we will be using the propensity score for estimation and we still assume the same identification conditions. See the book for a much more in-depth discussion.\n", "\n", "### 14.5 G-estimation\n", "\n", "For solve our g-estimator, we are going to use a different approach than the one described in the main text of the book. Instead, we are going to use the approach (the estimating equations) described in Technical Point 14.2. As mentioned there, this approach is equivalent to the process described in the main text." ] }, { "cell_type": "code", "id": "11ecd3fa", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.913514900Z", "start_time": "2026-04-07T16:31:19.902598700Z" } }, "source": [ "# Design matrix for propensity scores\n", "W = np.asarray(df[['I', 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']])\n", "# Design matrix for missing model\n", "X = np.asarray(df[['I', 'qsmk', 'sex', 'race', 'age', 'age_sq', \n", " 'educ_2', 'educ_3', 'educ_4', 'educ_5',\n", " 'smokeintensity', 'smokeintensity_sq',\n", " 'smokeyrs', 'smokeyrs_sq', \n", " 'exer_1', 'exer_2', 'active_1', 'active_2',\n", " 'wt71', 'wt71_sq']])\n", "# Design matrix for structural nested model\n", "snm = np.asarray(df[['I', ]])\n", "# Treatment array\n", "a = np.asarray(df['qsmk'])\n", "# Outcome array\n", "y = np.asarray(df['wt82_71'])\n", "# Missing indicator\n", "r = np.where(df['wt82_71'].isna(), 0, 1)" ], "outputs": [], "execution_count": 67 }, { "cell_type": "code", "id": "0e563f05", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.923795300Z", "start_time": "2026-04-07T16:31:19.916025600Z" } }, "source": [ "def psi_snm1(theta):\n", " # Dividing parameters into corresponding estimation equations\n", " alpha = theta[:snm.shape[1]] # SNM parameters\n", " beta = theta[snm.shape[1]: # Propensity score\n", " snm.shape[1]+W.shape[1]]\n", " gamma = theta[snm.shape[1]+W.shape[1]:] # Missing score\n", " \n", " # Estimating equation for IPMW\n", " ee_ms = ee_regression(theta=gamma, # Missing score\n", " X=X, y=r, # ... observed data\n", " model='logistic') # ... logit model\n", " pi_m = inverse_logit(np.dot(X, gamma)) # Predicted prob\n", " ipmw = r / pi_m # Missing weights\n", "\n", " # Estimating equations for PS \n", " ee_log = ee_regression(theta=beta, # Propensity score\n", " X=W, y=a, # ... observed data\n", " model='logistic', # ... logit model\n", " weights=ipmw) # ... weighted\n", " pi = inverse_logit(np.dot(W, beta)) # Predicted prob\n", " \n", " # H(psi) equation for linear models\n", " h_psi = y - np.dot(snm*a[:, None], alpha)\n", "\n", " # Estimating equation for the structural nested mean model\n", " ee_snm = ipmw*(h_psi[:, None] * (a - pi)[:, None] * snm).T\n", " # Setting rows with missing Y's as zero (no contribution)\n", " ee_snm = np.nan_to_num(ee_snm, copy=False, nan=0.)\n", "\n", " return np.vstack([ee_snm, ee_log, ee_ms])" ], "outputs": [], "execution_count": 68 }, { "cell_type": "code", "id": "6ac52feb", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.959371800Z", "start_time": "2026-04-07T16:31:19.925297300Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., ] + init_ps + [0., ]*X.shape[1]\n", "estr = MEstimator(psi_snm1, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 69 }, { "cell_type": "code", "id": "25663d6a", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.971831800Z", "start_time": "2026-04-07T16:31:19.961371800Z" } }, "source": [ "estr.print_results(subset=[0, ])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1629 | No. Parameters: 40\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 3.45 0.47 7.35 2.53 4.36 0.00 42.21 \n", "==============================================================\n" ] } ], "execution_count": 70 }, { "cell_type": "markdown", "id": "e1a79513", "metadata": {}, "source": [ "The book reported 3.4 (95% CI: 2.5, 4.5), generating confidence interval by inverting the test results. As such, there is expected differences (different estimators are being used). As in the inverse probability weighting examples, these confidence intervals are expected due to be conservative due to the use of the 'GEE trick'. \n", "\n", "Now consider how we can use the built-in g-estimation functionality, `ee_gestimation_snmm`." ] }, { "cell_type": "code", "id": "9df3779e", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:19.994450500Z", "start_time": "2026-04-07T16:31:19.983727800Z" } }, "source": [ "def psi_snm1a(theta):\n", " # Dividing parameters into corresponding estimation equations\n", " alpha = theta[:snm.shape[1]+W.shape[1]] # SNM and PS\n", " gamma = theta[snm.shape[1]+W.shape[1]:] # Missing scores\n", " \n", " # Estimating equation for IPMW\n", " ee_ms = ee_regression(theta=gamma, # Missing score\n", " X=X, y=r, # ... observed data\n", " model='logistic') # ... logit model \n", " pi_m = inverse_logit(np.dot(X, gamma)) # Predicted prob\n", " ipmw = r / pi_m # Missing weights\n", "\n", " # Estimating equations for PS \n", " ee_snm = ee_gestimation_snmm(theta=alpha,\n", " y=y, A=a, \n", " W=W, V=snm,\n", " weights=ipmw)\n", " # Setting rows with missing Y's as zero (no contribution)\n", " ee_snm = np.nan_to_num(ee_snm, copy=False, nan=0.)\n", "\n", " return np.vstack([ee_snm, ee_ms])" ], "outputs": [], "execution_count": 71 }, { "cell_type": "code", "id": "64e33d42", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.036675700Z", "start_time": "2026-04-07T16:31:19.995954300Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., ] + init_ps + [0., ]*X.shape[1]\n", "estr = MEstimator(psi_snm1a, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 72 }, { "cell_type": "code", "id": "d0ebdbfd", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.048551700Z", "start_time": "2026-04-07T16:31:20.038676400Z" } }, "source": [ "estr.print_results(subset=[0, ])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1629 | No. Parameters: 40\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 3.45 0.47 7.35 2.53 4.36 0.00 42.21 \n", "==============================================================\n" ] } ], "execution_count": 73 }, { "cell_type": "markdown", "id": "589dfe25", "metadata": {}, "source": [ "The structural nested model parameters match between implementations, as expected.\n", "\n", "### 14.6: Structural nested models with two or more parameters\n", "\n", "We now adapt the previous code to consider more than 2 parameters in the structural nested model. Thankfully, this is pretty straightforward for how we coded the estimating equations since `snm` is a global object." ] }, { "cell_type": "code", "id": "ba0ca7c4", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.066949900Z", "start_time": "2026-04-07T16:31:20.059055600Z" } }, "source": [ "snm = np.asarray(df[['I', 'smokeintensity']])" ], "outputs": [], "execution_count": 74 }, { "cell_type": "code", "id": "4849b70e", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.075402900Z", "start_time": "2026-04-07T16:31:20.067953400Z" } }, "source": [ "def psi_snm2(theta):\n", " # Dividing parameters into corresponding estimation equations\n", " alpha = theta[:snm.shape[1]] # SNM parameters\n", " beta = theta[snm.shape[1]: # Propensity score\n", " snm.shape[1]+W.shape[1]]\n", " gamma = theta[snm.shape[1]+W.shape[1]:] # Missing score\n", " \n", " # Estimating equation for IPMW\n", " ee_ms = ee_regression(theta=gamma, # Missing score\n", " X=X, y=r, # ... observed data\n", " model='logistic') # ... logit model\n", " pi_m = inverse_logit(np.dot(X, gamma)) # Predicted prob\n", " ipmw = r / pi_m # Missing weights\n", "\n", " # Estimating equations for PS \n", " ee_log = ee_regression(theta=beta, # Propensity score\n", " X=W, y=a, # ... observed data\n", " model='logistic', # ... logit model\n", " weights=ipmw) # ... weighted\n", " pi = inverse_logit(np.dot(W, beta)) # Predicted prob\n", " \n", " # H(psi) equation for linear models\n", " h_psi = y - np.dot(snm*a[:, None], alpha)\n", "\n", " # Estimating equation for the structural nested mean model\n", " ee_snm = ipmw*(h_psi[:, None] * (a - pi)[:, None] * snm).T\n", " # Setting rows with missing Y's as zero (no contribution)\n", " ee_snm = np.nan_to_num(ee_snm, copy=False, nan=0.)\n", "\n", " return np.vstack([ee_snm, ee_log, ee_ms])" ], "outputs": [], "execution_count": 75 }, { "cell_type": "code", "id": "744e9a11", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.115361Z", "start_time": "2026-04-07T16:31:20.076406400Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., ] + init_ps + [0., ]*X.shape[1]\n", "estr = MEstimator(psi_snm2, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 76 }, { "cell_type": "code", "id": "9f097670", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.128388500Z", "start_time": "2026-04-07T16:31:20.118368100Z" } }, "source": [ "estr.print_results(subset=[0, 1])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1629 | No. Parameters: 41\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 2.86 0.93 3.06 1.03 4.69 0.00 8.83 \n", " 0.03 0.05 0.66 -0.06 0.12 0.51 0.97 \n", "==============================================================\n" ] } ], "execution_count": 77 }, { "cell_type": "markdown", "id": "f413273c", "metadata": {}, "source": [ "which provides the same results as the book (2.86 and 0.03). Unlike the book, we provide confidence intervals. Inverting the test is not simple in this case, and the book does not report variance estimates from a bootstrapping procedure.\n", "\n", "Again, we can apply the built-in estimating equations for g-estimation. In truth, the new estimating equation is the same as the previous g-estimation version (since `snm` is a global object as called). Regardless, we provide again here." ] }, { "cell_type": "code", "id": "6092fe6b", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.147661900Z", "start_time": "2026-04-07T16:31:20.140895800Z" } }, "source": [ "def psi_snm2a(theta):\n", " # Dividing parameters into corresponding estimation equations\n", " alpha = theta[:snm.shape[1]+W.shape[1]] # SNM and PS\n", " gamma = theta[snm.shape[1]+W.shape[1]:] # Missing scores\n", " \n", " # Estimating equation for IPMW\n", " ee_ms = ee_regression(theta=gamma, # Missing score\n", " X=X, y=r, # ... observed data\n", " model='logistic') # ... logit model \n", " pi_m = inverse_logit(np.dot(X, gamma)) # Predicted prob\n", " ipmw = r / pi_m # Missing weights\n", "\n", " # Estimating equations for PS \n", " ee_snm = ee_gestimation_snmm(theta=alpha,\n", " y=y, A=a, \n", " W=W, V=snm,\n", " weights=ipmw)\n", " # Setting rows with missing Y's as zero (no contribution)\n", " ee_snm = np.nan_to_num(ee_snm, copy=False, nan=0.)\n", "\n", " return np.vstack([ee_snm, ee_ms])" ], "outputs": [], "execution_count": 78 }, { "cell_type": "code", "id": "019d5c2f", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.194350500Z", "start_time": "2026-04-07T16:31:20.150664600Z" } }, "source": [ "# M-estimator\n", "init_vals = [0., 0., ] + init_ps + [0., ]*X.shape[1]\n", "estr = MEstimator(psi_snm2a, init=init_vals)\n", "estr.estimate(solver='hybr', maxiter=5000)" ], "outputs": [], "execution_count": 79 }, { "cell_type": "code", "id": "12cb56e3", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.206529300Z", "start_time": "2026-04-07T16:31:20.196860900Z" } }, "source": [ "estr.print_results(subset=[0, 1])" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1629 | No. Parameters: 41\n", "Solving algorithm: hybr | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 2.86 0.93 3.06 1.03 4.69 0.00 8.83 \n", " 0.03 0.05 0.66 -0.06 0.12 0.51 0.97 \n", "==============================================================\n" ] } ], "execution_count": 80 }, { "cell_type": "markdown", "id": "a8d8ec22", "metadata": {}, "source": [ "This concludes chapter 14.\n", "\n", "## Chapter 16: Instrumental Variable Analysis\n", "\n", "Chapter 16 focuses on instrumental variable (IV) analysis. Here, we are going to utilize an instrument (high state cigarette prices, variable name: `highprice`) to estimate the effect of quitting smoking on weight gain.\n", "\n", "Here, we will reload the data set, since the rows we need to drop due to missing data differ from the previous examples." ] }, { "cell_type": "code", "id": "0737c485-f3c3-4fb7-8a3a-4a45d3c2afbe", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.232994500Z", "start_time": "2026-04-07T16:31:20.217594600Z" } }, "source": [ "df = pd.read_csv('data/nhefs.csv')\n", "df['highprice'] = (df['price82'] >= 1.5).astype('int')\n", "df.dropna(subset=['wt82', 'price82'], \n", " inplace=True)\n", "df = df[['highprice', 'qsmk', 'wt82_71', 'price82']].copy()\n", "df['I'] = 1" ], "outputs": [], "execution_count": 81 }, { "cell_type": "code", "id": "9f536b09-0af0-4db6-8966-04c826fd76c7", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.242990300Z", "start_time": "2026-04-07T16:31:20.235502200Z" } }, "source": [ "z = np.asarray(df['highprice'])\n", "a = np.asarray(df['qsmk'])\n", "y = np.asarray(df['wt82_71'])" ], "outputs": [], "execution_count": 82 }, { "cell_type": "markdown", "id": "8be4c0ef-45f5-4e0f-908b-dd019bb21a03", "metadata": {}, "source": [ "### 16.1: The three instrumental conditions\n", "\n", "As described in the book, we will first examine the relationship between our instrument $Z$ and the exposure $A$. Below we present estimating equations for $\\Pr(A=1 | Z=1) - \\Pr(A=1 | Z=0)$" ] }, { "cell_type": "code", "id": "81f181a8-4287-4084-8a5b-e9e0e45a0b2b", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.250613300Z", "start_time": "2026-04-07T16:31:20.242990300Z" } }, "source": [ "def psi_iv_strength(theta):\n", " ee_az1 = z*(a - theta[0])\n", " ee_az0 = (1-z)*(a - theta[1])\n", " ee_rd = np.ones(a.shape[0]) * (theta[0] - theta[1] - theta[2])\n", " return np.vstack([ee_az1, ee_az0, ee_rd])" ], "outputs": [], "execution_count": 83 }, { "cell_type": "code", "id": "86b860ef-4589-4e2b-b812-b895d88d0575", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.258459100Z", "start_time": "2026-04-07T16:31:20.250613300Z" } }, "source": [ "estr = MEstimator(psi_iv_strength, init=[0.5, 0.5, 0.])\n", "estr.estimate()" ], "outputs": [], "execution_count": 84 }, { "cell_type": "code", "id": "f770abbb-6101-4440-ac5e-bc59042e4476", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.295678300Z", "start_time": "2026-04-07T16:31:20.258459100Z" } }, "source": [ "estr.print_results()" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1476 | No. Parameters: 3\n", "Solving algorithm: lm | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 0.26 0.01 22.33 0.24 0.28 0.00 364.43 \n", " 0.20 0.06 3.15 0.07 0.32 0.00 9.27 \n", " 0.06 0.06 1.00 -0.06 0.19 0.32 1.65 \n", "==============================================================\n" ] } ], "execution_count": 85 }, { "cell_type": "markdown", "id": "612490de-9ef2-40fc-816f-c298e21075c2", "metadata": {}, "source": [ "These are the same (point) estimates as those reported in the book. Note that our results also provide the confidence intervals. Here, $Z$ is considered to be a *weak* instrument since the risk difference is 6.3%. Our results also show that the corresponding confidence intervals are also quite wide.\n", "\n", "### 16.2: The usual IV estimand\n", "\n", "The usual IV is defined as\n", "$$ \\beta = \\frac{E[Y \\mid Z=1] - E[Y \\mid Z=0]}{E[A \\mid Z=1] - E[A \\mid Z=0]} $$\n", "\n", "There are a few ways to implement this IV estimator with estimating equations. The first is to estimate each of the pieces and combine them as shown in the previous equation. The estimating equations for this approach are\n", "$$ \\sum_{i=1}^n \n", "\\begin{bmatrix}\n", " (\\alpha_1 - \\alpha_2) - \\beta(\\alpha_3 - \\alpha_4) \\\\\n", " Z_i (Y_i - \\alpha_1) \\\\\n", " (1-Z_i) \\times (Y_i - \\alpha_2) \\\\\n", " Z_i (A_i - \\alpha_3) \\\\\n", " (1-Z_i) \\times (A_i - \\alpha_4) \\\\\n", "\\end{bmatrix}\n", "= 0 $$\n", "where the first equation is the usual IV, and the following 4 are the $Z$ conditional means for $Y$ and $A$ respectively. This estimating equation can be implemented by" ] }, { "cell_type": "code", "id": "07a5066f-22fa-4065-8130-3be91926400b", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.307111700Z", "start_time": "2026-04-07T16:31:20.296683100Z" } }, "source": [ "def psi_usual_iv1(theta):\n", " ee_uiv = np.ones(y.shape[0]) * ((theta[1] - theta[2]) \n", " - theta[0]*(theta[3] - theta[4]))\n", " ee_yz1 = z*(y - theta[1]) # for E[Y | Z=1]\n", " ee_yz0 = (1-z)*(y - theta[2]) # for E[Y | Z=0]\n", " ee_az1 = z*(a - theta[3]) # for E[A | Z=1]\n", " ee_az0 = (1-z)*(a - theta[4]) # for E[A | Z=0]\n", " return np.vstack([ee_uiv, ee_yz1, ee_yz0, ee_az1, ee_az0])" ], "outputs": [], "execution_count": 86 }, { "cell_type": "code", "id": "6ae918b2-d735-4ae2-9381-c8cfe06709a5", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.318618300Z", "start_time": "2026-04-07T16:31:20.310111400Z" } }, "source": [ "estr = MEstimator(psi_usual_iv1, init=[0., 0., 0., 0.5, 0.5])\n", "estr.estimate()" ], "outputs": [], "execution_count": 87 }, { "cell_type": "code", "id": "e7b8ec6a-bd94-4ab8-95a5-5fe75e616b1c", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.340247200Z", "start_time": "2026-04-07T16:31:20.318618300Z" } }, "source": [ "estr.print_results(decimals=3)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1476 | No. Parameters: 5\n", "Solving algorithm: lm | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 2.396 22.341 0.107 -41.391 46.183 0.915 0.129 \n", " 2.686 0.208 12.888 2.278 3.095 0.000 123.834 \n", " 2.536 1.444 1.756 -0.294 5.365 0.079 3.662 \n", " 0.258 0.012 22.328 0.235 0.280 0.000 364.433 \n", " 0.195 0.062 3.153 0.074 0.316 0.002 9.272 \n", "==============================================================\n" ] } ], "execution_count": 88 }, { "cell_type": "markdown", "id": "05fe8ad6-5f77-47ae-a5b3-b51a055bc7d9", "metadata": {}, "source": [ "The $Z$ conditional means match those reported in the book (2.686, 2.536, 0.2578, 0.1951), which gives us the same IV estimate as the book (2.4 kg).\n", "\n", "Another way is shown in Boos & Stefanski (2013). The usual IV can be expressed as the following pair of estimating equations\n", "$$ \\sum_{i=1}^n \n", "\\begin{bmatrix}\n", " (Y_i - A_i \\beta) \\times (Z_i - \\alpha) \\\\\n", " Z_i - \\alpha \\\\\n", "\\end{bmatrix}\n", "= 0 $$\n", "so there is only one nuisance parameter (the mean of $Z$) that needs to be estimated. The following implements this estimating equation by-hand" ] }, { "cell_type": "code", "id": "329cfc67-3879-456b-934a-b777489613c3", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.349518900Z", "start_time": "2026-04-07T16:31:20.341265800Z" } }, "source": [ "def psi_usual_iv2(theta):\n", " ee_uiv = (y - a*theta[0]) * (z - theta[1])\n", " ee_z = z - theta[1]\n", " return np.vstack([ee_uiv, ee_z])" ], "outputs": [], "execution_count": 89 }, { "cell_type": "code", "id": "a19a3d89-7809-43bd-aec9-8b5ed40d6e17", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.358751500Z", "start_time": "2026-04-07T16:31:20.349518900Z" } }, "source": [ "estr = MEstimator(psi_usual_iv2, init=[0., 0.5])\n", "estr.estimate()" ], "outputs": [], "execution_count": 90 }, { "cell_type": "code", "id": "466934d2-e46a-43a5-9a0e-d5f5ec1bd708", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.368268800Z", "start_time": "2026-04-07T16:31:20.358751500Z" } }, "source": [ "estr.print_results(decimals=3)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1476 | No. Parameters: 2\n", "Solving algorithm: lm | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 2.396 22.341 0.107 -41.391 46.184 0.915 0.129 \n", " 0.972 0.004 227.288 0.964 0.981 0.000 inf \n", "==============================================================\n" ] } ], "execution_count": 91 }, { "cell_type": "markdown", "id": "913aedc6-2745-4175-b022-52a74d879597", "metadata": {}, "source": [ "There is also a built-in version of the usual IV estimator, shown below" ] }, { "cell_type": "code", "id": "7349ab6e-b123-4200-89b1-fbbb407b30df", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.386605400Z", "start_time": "2026-04-07T16:31:20.378781700Z" } }, "source": [ "def psi_usual_iv3(theta):\n", " return ee_iv_causal(theta, y=y, A=a, Z=z)" ], "outputs": [], "execution_count": 92 }, { "cell_type": "code", "id": "c88d9755-a4d2-4dfc-94ae-78446f02379a", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.396311200Z", "start_time": "2026-04-07T16:31:20.387608600Z" } }, "source": [ "estr = MEstimator(psi_usual_iv3, init=[0., 0.5])\n", "estr.estimate()" ], "outputs": [], "execution_count": 93 }, { "cell_type": "code", "id": "b35a301e-23bd-4a37-8017-0ad1a1b68bee", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.420393Z", "start_time": "2026-04-07T16:31:20.397311300Z" } }, "source": [ "estr.print_results(decimals=3)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1476 | No. Parameters: 2\n", "Solving algorithm: lm | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 2.396 22.341 0.107 -41.391 46.184 0.915 0.129 \n", " 0.972 0.004 227.288 0.964 0.981 0.000 inf \n", "==============================================================\n" ] } ], "execution_count": 94 }, { "cell_type": "markdown", "id": "caae9f4c-2cb0-4f67-90f4-2f055d926a18", "metadata": {}, "source": [ "As expected, the built-in implementation provides the same result.\n", "\n", "In the book, another approach for estimation is also described: two-stage least squares (2SLS). 2SLS works by estimating two linear regression models. The first is for $A$ given $Z$. From that model, we can get the predicted value of $A$ given $Z$ and an intercept, denoted by $\\hat{A}$. Next, we can fit a model for $Y$ given $\\hat{A}$ and an intercept. The coefficient for $\\hat{A}$ from this model is the IV estimate. Below is code to implement this approach by-hand" ] }, { "cell_type": "code", "id": "11d6232e-cb91-4ccf-92cc-c0365fbdd059", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.432306300Z", "start_time": "2026-04-07T16:31:20.420393Z" } }, "source": [ "Z = np.asarray(df[['I', 'highprice']])" ], "outputs": [], "execution_count": 95 }, { "cell_type": "code", "id": "05c409fe-3a68-4a3b-b723-b7caf04bd425", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.441254200Z", "start_time": "2026-04-07T16:31:20.432306300Z" } }, "source": [ "def psi_2sls1(theta):\n", " beta = theta[:2]\n", " alpha = theta[2:]\n", "\n", " # First-stage regression\n", " ee_stage1 = ee_regression(theta=alpha, y=a, X=Z, model='linear')\n", "\n", " # Second-stage regression\n", " a_hat = np.dot(Z, alpha)\n", " A_hat = np.asarray([a_hat, np.ones(a.shape[0])]).T\n", " ee_stage2 = ee_regression(theta=beta, y=y, X=A_hat, model='linear')\n", "\n", " # Returning stacked estimating equations\n", " return np.vstack([ee_stage2, ee_stage1])" ], "outputs": [], "execution_count": 96 }, { "cell_type": "code", "id": "83587237-a0d5-4923-a22f-682a6998d8e5", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.455224Z", "start_time": "2026-04-07T16:31:20.442255500Z" } }, "source": [ "estr = MEstimator(psi_2sls1, init=[0., 0., 0., 0.])\n", "estr.estimate()" ], "outputs": [], "execution_count": 97 }, { "cell_type": "code", "id": "5c491c7d-001b-403b-b585-fe25935668bb", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.480031600Z", "start_time": "2026-04-07T16:31:20.456775800Z" } }, "source": "estr.print_results(decimals=3)", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1476 | No. Parameters: 4\n", "Solving algorithm: lm | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 2.396 22.343 0.107 -41.395 46.188 0.915 0.129 \n", " 2.068 5.734 0.361 -9.171 13.307 0.718 0.477 \n", " 0.195 0.062 3.153 0.074 0.316 0.002 9.272 \n", " 0.063 0.063 0.996 -0.061 0.186 0.319 1.648 \n", "==============================================================\n" ] } ], "execution_count": 98 }, { "cell_type": "markdown", "id": "c76cfab4-cd38-43a0-83c6-6d3adabb3f42", "metadata": {}, "source": [ "Here, the first element is the IV estimate (note how the `A_hat` matrix is built). This estimate matches the previous methods and the book. The book reports their 95% CI as −36.5 to 41.3, which differs slightly from what we have. This is because we are using a different variance estimator. The one reported in the book assumes homoskedasticity, whereas the empirical sandwich variance estimator does not.\n", "\n", "There is also a built-in procedure for 2SLS. The following is how that built-in procedure can be used to replicate the by-hand code" ] }, { "cell_type": "code", "id": "6a1aad24-7356-4f02-b6ae-267ef0fb5d4e", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.501204500Z", "start_time": "2026-04-07T16:31:20.490547600Z" } }, "source": [ "def psi_2sls2(theta):\n", " return ee_2sls(theta, y=df['wt82_71'], A=df['qsmk'], \n", " Z=df[['highprice', ]], W=df[['I', ]])" ], "outputs": [], "execution_count": 99 }, { "cell_type": "code", "id": "6cc2eaa6-97d5-4f02-a5f1-84624a898c46", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.562726200Z", "start_time": "2026-04-07T16:31:20.505202400Z" } }, "source": [ "estr = MEstimator(psi_2sls2, init=[0., 0., 0., 0.])\n", "estr.estimate()" ], "outputs": [], "execution_count": 100 }, { "cell_type": "code", "id": "0494914a-3f70-40f5-805e-1fb0b4ac5401", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.583287700Z", "start_time": "2026-04-07T16:31:20.565381200Z" } }, "source": "estr.print_results(subset=[0, ], decimals=3)", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1476 | No. Parameters: 4\n", "Solving algorithm: lm | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 2.396 22.336 0.107 -41.381 46.173 0.915 0.129 \n", "==============================================================\n" ] } ], "execution_count": 101 }, { "cell_type": "markdown", "id": "d60ac80e-4bf4-4c3b-aba7-d7a340ae1c2e", "metadata": {}, "source": [ "The built-in procedure matches the by-hand implementation, as expected. Note that the 2SLS implementation is more general and allows for one to adjust for other variables or use multiple instruments. Therefore, it is likely to be the preferred implementation over the usual IV expression.\n", "\n", "As mentioned in Technical Point 16.3, one can also use g-estimation with instrumental variables. In particular, 2SLS is a special case of g-estimation with an additive structural mean model. Therefore, we also illustrate how to use g-estimation with instrumental variables. Note that this g-estimator differs a bit from the one considered in Chapter 14. In particular, we will model the expected value of the *instrument* rather than the action variable. Otherwise the procedure remains the same. For g-estimation with instrumental variables, `delicatessen` provides the `ee_gestimation_snmm_iv` function. Here, `V` corresponds to the specified structural mean model, which will only reflect the affect of `A`." ] }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.599497200Z", "start_time": "2026-04-07T16:31:20.585293900Z" } }, "cell_type": "code", "source": [ "def psi_gestr_iv(theta):\n", " return ee_gestimation_snmm_iv(theta=theta, y=df['wt82_71'], Z=df['highprice'],\n", " A=df['qsmk'], W=df[['I', ]], V=df[['I', ]])" ], "id": "7d9396df9a89a19a", "outputs": [], "execution_count": 102 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.625155200Z", "start_time": "2026-04-07T16:31:20.603507Z" } }, "cell_type": "code", "source": [ "estr = MEstimator(psi_gestr_iv, init=[0., 0.])\n", "estr.estimate()" ], "id": "445cca507bbb81e5", "outputs": [], "execution_count": 103 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.648589600Z", "start_time": "2026-04-07T16:31:20.627486300Z" } }, "cell_type": "code", "source": "estr.print_results(subset=[0, ], decimals=3)", "id": "11c7a284313906d", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 1476 | No. Parameters: 2\n", "Solving algorithm: lm | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 2.396 22.341 0.107 -41.391 46.183 0.915 0.129 \n", "==============================================================\n" ] } ], "execution_count": 104 }, { "metadata": {}, "cell_type": "markdown", "source": [ "As we would expect, this matches the 2SLS estimator. The advantage of the g-estimator is that we can allow effects by exogenous variables to vary and we can estimate non-linear parameters (e.g., ratios). \n", "\n", "### 16.5: The three instrument conditions revisited\n", "\n", "Below is code to run the analysis at different thresholds for the 'high price' cut-off, as described in the book." ], "id": "f3aef522cbdf0ba" }, { "cell_type": "code", "id": "6845f5c8-9410-46e5-bf22-348d92c8cb2a", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.655353300Z", "start_time": "2026-04-07T16:31:20.648589600Z" } }, "source": [ "def psi_2sls3(theta):\n", " return ee_2sls(theta, y=df['wt82_71'], A=df['qsmk'], \n", " Z=z_new[:, None], W=df[['I', ]])" ], "outputs": [], "execution_count": 105 }, { "cell_type": "code", "id": "5d4f6101-3164-439d-bf1c-8ef8769a1999", "metadata": { "ExecuteTime": { "end_time": "2026-04-07T16:31:20.855170200Z", "start_time": "2026-04-07T16:31:20.655353300Z" } }, "source": [ "for c in [1.60, 1.70, 1.80, 1.90]:\n", " z_new = np.where(df['price82'] >= c, 1, 0)\n", " estr = MEstimator(psi_2sls3, init=[0., 0., 0., 0.])\n", " estr.estimate()\n", " ci = estr.confidence_intervals()[0, :]\n", " print(\"Cutoff: $\"+str(c))\n", " print(np.round([estr.theta[0], ci[0], ci[1]], 3))" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cutoff: $1.6\n", "[ 41.281 -278.122 360.685]\n", "Cutoff: $1.7\n", "[ -40.912 -428.146 346.322]\n", "Cutoff: $1.8\n", "[-21.103 -77.16 34.953]\n", "Cutoff: $1.9\n", "[-12.811 -54.941 29.318]\n" ] } ], "execution_count": 106 }, { "cell_type": "markdown", "id": "1c3edbe5-a0de-40b2-8094-b9bd499067be", "metadata": {}, "source": [ "These results are the same as those reported in the book (41.3, −40.9, −21.1, −12.8). Again confidence intervals may differ slightly due to differences between variance estimators. \n", "\n", "## Chapter 17: Causal Survival Analysis\n", "\n", "Replication of the following chapter 17 is not available yet.\n", "\n", "## References\n", "\n", "Hernán MA, & Robins JM. (2010). *Causal inference: What If*. Boca Raton: Chapman & Hall/CRC.\n", "\n", "Shook-Sa BE, Zivich PN, Lee C, Xue K, Ross RK, Edwards JK, Stringer JSA, & Cole SR. (2025). Double robust variance estimation with parametric working models. *Biometrics*, 81(2), ujaf054." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }