{ "cells": [ { "cell_type": "markdown", "id": "33646d20", "metadata": {}, "source": [ "# Ross et al. (2024): Introduction to M-estimation\n", "\n", "The following is a replication of the cases described in Ross et al. (2024). The original paper provides a tutorial on M-estimation and provides several introductory examples. Examples are provided in the context of regression, standardization, and measurement error. Here, we recreate the cases described in the paper. For finer details, see the original publication.\n", "\n", "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "127d7517", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Versions\n", "NumPy: 2.3.5\n", "pandas: 2.3.3\n", "Delicatessen: 4.1\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "import delicatessen\n", "from delicatessen import MEstimator\n", "from delicatessen.estimating_equations import ee_regression, ee_rogan_gladen\n", "from delicatessen.utilities import inverse_logit\n", "\n", "print(\"Versions\")\n", "print(\"NumPy: \", np.__version__)\n", "print(\"pandas: \", pd.__version__)\n", "print(\"Delicatessen: \", delicatessen.__version__)" ] }, { "cell_type": "markdown", "id": "39733cb2", "metadata": {}, "source": [ "The following data is used for the first and second cases." ] }, { "cell_type": "code", "execution_count": 2, "id": "47ddfd6c", "metadata": {}, "outputs": [], "source": [ "# From Table 1\n", "d = pd.DataFrame()\n", "d['X'] = [0, 0, 0, 0, 1, 1, 1, 1] # X values\n", "d['W'] = [0, 0, 1, 1, 0, 0, 1, 1] # W values\n", "d['Y'] = [0, 1, 0, 1, 0, 1, 0, 1] # Y values\n", "d['n'] = [496, 74, 113, 25, 85, 15, 15, 3] # Counts\n", "d['intercept'] = 1 # Intercept term (always 1)\n", "\n", "# Expanding rows by n\n", "d = pd.DataFrame(np.repeat(d.values, # Converting tabled data\n", " d['n'], axis=0), # ... by replicating counts\n", " columns=d.columns) # ... into rows for each X,W,Y\n", "d = d[['intercept', 'X', 'W', 'Y']].copy() # Dropping extra column" ] }, { "cell_type": "markdown", "id": "4324be23", "metadata": {}, "source": [ "## Example 1: Logistic Regression\n", "\n", "For the first example, we fit a logistic regression model for the variable $Y$ given $X,W$. " ] }, { "cell_type": "code", "execution_count": 3, "id": "13a73472", "metadata": {}, "outputs": [], "source": [ "# Extracting arrays for easier coding later on\n", "X = np.asarray(d[['intercept', 'X', 'W']]) # Design matrix for regression\n", "y = np.asarray(d['Y']) # Outcome in regression" ] }, { "cell_type": "markdown", "id": "08cfae8f", "metadata": {}, "source": [ "For estimation, we can use the built-in estimating functions for select canonical regression, ``ee_regression``" ] }, { "cell_type": "code", "execution_count": 4, "id": "046f1c7a", "metadata": {}, "outputs": [], "source": [ "def psi(theta):\n", " return ee_regression(theta=theta, \n", " y=y, X=X, \n", " model='logistic')" ] }, { "cell_type": "code", "execution_count": 5, "id": "444f32ef", "metadata": {}, "outputs": [], "source": [ "estr = MEstimator(psi, init=[0, 0, 0,])\n", "estr.estimate()" ] }, { "cell_type": "code", "execution_count": 6, "id": "c8a01022", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CoefLCLUCL
Param
beta_0-1.89-2.13-1.66
beta_10.12-0.430.67
beta_20.36-0.110.83
\n", "
" ], "text/plain": [ " Coef LCL UCL\n", "Param \n", "beta_0 -1.89 -2.13 -1.66\n", "beta_1 0.12 -0.43 0.67\n", "beta_2 0.36 -0.11 0.83" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Formatting results into a nice table\n", "result = pd.DataFrame()\n", "result['Param'] = ['beta_0', 'beta_1', 'beta_2']\n", "result['Coef'] = estr.theta\n", "ci = estr.confidence_intervals()\n", "result['LCL'] = ci[:, 0]\n", "result['UCL'] = ci[:, 1]\n", "result.set_index('Param').round(2)" ] }, { "cell_type": "markdown", "id": "a8ea3739", "metadata": {}, "source": [ "These results match those reported in the paper. They also match the results if you were to fit this moel using `statsmodels` GLM instead.\n", "\n", "## Example 2: Estimating the marginal risk difference\n", "\n", "In this example, we build on the previous. Using a logistic model, we can generate predictions and use those predictions to estimate the marginal risk difference by $X$ via g-computation. Alternatively, we can use a logistic model to estimate propensity score and then estimate the marginal risk difference with inverse probability weighting. \n", "\n", "First, let's apply g-computation" ] }, { "cell_type": "code", "execution_count": 7, "id": "6b493f2a", "metadata": {}, "outputs": [], "source": [ "# Copies of data with policies applied\n", "d1 = d.copy()\n", "d1['X'] = 1\n", "X1 = np.asarray(d1[['intercept', 'X', 'W']])\n", "d0 = d.copy()\n", "d0['X'] = 0\n", "X0 = np.asarray(d0[['intercept', 'X', 'W']])" ] }, { "cell_type": "code", "execution_count": 8, "id": "e9fce370", "metadata": {}, "outputs": [], "source": [ "def psi(theta):\n", " # Dividing parameters into corresponding parts and labels from slides\n", " beta = theta[0:3] # Logistic model coefficients\n", " mu0, mu1 = theta[3], theta[4] # Causal risks\n", " delta1 = theta[5] # Causal contrast\n", "\n", " # Logistic regression model for outcome\n", " ee_logit = ee_regression(theta=beta,\n", " y=y, X=X,\n", " model='logistic')\n", "\n", " # Transforming logistic model coefficients into causal parameters\n", " y0_hat = inverse_logit(np.dot(X0, beta)) # Prediction under a=0\n", " y1_hat = inverse_logit(np.dot(X1, beta)) # Prediction under a=1\n", "\n", " # Estimating function for causal risk under a=1\n", " ee_r1 = y1_hat - mu1 # Simple mean\n", "\n", " # Estimating function for causal risk under a=0\n", " ee_r0 = y0_hat - mu0 # Simple mean\n", " \n", " # Estimating function for causal risk difference\n", " ee_rd = np.ones(d.shape[0])*((mu1 - mu0) - delta1)\n", "\n", " # Returning stacked estimating functions in order of parameters\n", " return np.vstack([ee_logit, # EF of logistic model\n", " ee_r0, # EF of causal risk a=0\n", " ee_r1, # EF of causal risk a=1\n", " ee_rd]) # EF of causal contrast" ] }, { "cell_type": "code", "execution_count": 9, "id": "0ea2e8bb", "metadata": {}, "outputs": [], "source": [ "estr = MEstimator(psi, init=[0, 0, 0, 0.5, 0.5, 0])\n", "estr.estimate()" ] }, { "cell_type": "code", "execution_count": 10, "id": "3c667d96", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CoefLCLUCL
Param
beta_0-1.89-2.13-1.66
beta_10.12-0.430.67
beta_20.36-0.110.83
mu_00.140.110.17
mu_10.150.090.22
delta0.01-0.060.09
\n", "
" ], "text/plain": [ " Coef LCL UCL\n", "Param \n", "beta_0 -1.89 -2.13 -1.66\n", "beta_1 0.12 -0.43 0.67\n", "beta_2 0.36 -0.11 0.83\n", "mu_0 0.14 0.11 0.17\n", "mu_1 0.15 0.09 0.22\n", "delta 0.01 -0.06 0.09" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Formatting results into a nice table\n", "result = pd.DataFrame()\n", "result['Param'] = ['beta_0', 'beta_1', 'beta_2', 'mu_0', 'mu_1', 'delta']\n", "result['Coef'] = estr.theta\n", "ci = estr.confidence_intervals()\n", "result['LCL'] = ci[:, 0]\n", "result['UCL'] = ci[:, 1]\n", "result.set_index('Param').round(2)" ] }, { "cell_type": "markdown", "id": "86fe06a6", "metadata": {}, "source": [ "These results match those reported in the publication (some differences in rounding may appear due to the chosen root-finding algorithm).\n", "\n", "Now consider estimating the marginal risk difference using inverse probability weighting. The following code implements this estimator by-hand" ] }, { "cell_type": "code", "execution_count": 11, "id": "12e678a7", "metadata": {}, "outputs": [], "source": [ "a = np.asarray(d['X'])\n", "W = np.asarray(d[['intercept', 'W']])" ] }, { "cell_type": "code", "execution_count": 12, "id": "e2906199", "metadata": {}, "outputs": [], "source": [ "def psi(theta):\n", " # Dividing parameters into corresponding parts and labels from slides\n", " alpha = theta[0:2] # Logistic model coefficients\n", " mu0, mu1 = theta[2], theta[3] # Causal risks\n", " delta1 = theta[4] # Causal contrast\n", "\n", " # Logistic regression model for propensity score\n", " ee_logit = ee_regression(theta=alpha, # Regression model\n", " y=a, # ... for exposure\n", " X=W, # ... given confounders\n", " model='logistic') # ... logistic model\n", "\n", " # Transforming logistic model coefficients into causal parameters\n", " pscore = inverse_logit(np.dot(W, alpha)) # Propensity score\n", " wt = d['X']/pscore + (1-d['X'])/(1-pscore) # Corresponding weights\n", "\n", " # Estimating function for causal risk under a=1\n", " ee_r1 = d['X']*d['Y']*wt - mu1 # Weighted conditional mean\n", " \n", " # Estimating function for causal risk under a=0\n", " ee_r0 = (1-d['X'])*d['Y']*wt - mu0 # Weighted conditional mean\n", " \n", " # Estimating function for causal risk difference\n", " ee_rd = np.ones(d.shape[0])*((mu1 - mu0) - delta1)\n", "\n", " # Returning stacked estimating functions in order of parameters\n", " return np.vstack([ee_logit, # EF of logistic model\n", " ee_r0, # EF of causal risk a=0\n", " ee_r1, # EF of causal risk a=1\n", " ee_rd]) # EF of causal contrast" ] }, { "cell_type": "code", "execution_count": 13, "id": "95c791b2", "metadata": {}, "outputs": [], "source": [ "estr = MEstimator(psi, init=[0, 0, 0.5, 0.5, 0])\n", "estr.estimate()" ] }, { "cell_type": "code", "execution_count": 14, "id": "e53cb2ca", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CoefLCLUCL
Param
alpha_0-1.74-1.95-1.53
alpha_1-0.30-0.830.24
mu_00.140.110.17
mu_10.150.090.22
delta0.01-0.060.08
\n", "
" ], "text/plain": [ " Coef LCL UCL\n", "Param \n", "alpha_0 -1.74 -1.95 -1.53\n", "alpha_1 -0.30 -0.83 0.24\n", "mu_0 0.14 0.11 0.17\n", "mu_1 0.15 0.09 0.22\n", "delta 0.01 -0.06 0.08" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Formatting results into a nice table\n", "result = pd.DataFrame()\n", "result['Param'] = ['alpha_0', 'alpha_1', 'mu_0', 'mu_1', 'delta']\n", "result['Coef'] = estr.theta\n", "ci = estr.confidence_intervals()\n", "result['LCL'] = ci[:, 0]\n", "result['UCL'] = ci[:, 1]\n", "result.set_index('Param').round(2)" ] }, { "cell_type": "markdown", "id": "c81ea84b", "metadata": {}, "source": [ "Again, these results match those reported in the publication\n", "\n", "## Example 3: Outcome misclassification\n", "\n", "For the final example, we will correct for mismeasurement of the outcome variable with the Rogan-Gladen estimator for the proportion. To estimate sensitivity and specificity for the outcome, we have access to an external data set (denoted by $R=0$). The following code loads the corresponding data form the paper and then applies the Rogan-Gladen estimator using the built-in estimating function, `ee_rogan_gladen`" ] }, { "cell_type": "code", "execution_count": 15, "id": "ae7d0a9e", "metadata": {}, "outputs": [], "source": [ "# Loading in data for the fusion example\n", "d = pd.DataFrame()\n", "d['R'] = [1, 1, 0, 0, 0, 0] # R or population indicator\n", "d['Y'] = [0, 0, 1, 1, 0, 0] # True outcome\n", "d['W'] = [1, 0, 1, 0, 1, 0] # Measured outcome\n", "d['n'] = [680, 270, 204, 38, 18, 71] # Counts\n", "d['intercept'] = 1 # Intercept is always 1\n", "\n", "# Expanding out data \n", "d = pd.DataFrame(np.repeat(d.values, d['n'], axis=0), # Expanding compact data frame\n", " columns=d.columns) # ... keeping column names\n", "d = d[['intercept', 'R', 'W', 'Y']].copy() # Dropping the n column\n", "\n", "# Converting to arrays to simplify process\n", "r = np.asarray(d['R'])\n", "w = np.asarray(d['W'])\n", "y = np.asarray(d['Y'])" ] }, { "cell_type": "code", "execution_count": 16, "id": "a89cb741", "metadata": {}, "outputs": [], "source": [ "def psi(theta):\n", " return ee_rogan_gladen(theta=theta, # Parameter of interest\n", " y=d['Y'], # ... correct measure\n", " y_star=d['W'], # ... mismeasure\n", " r=d['R']) # ... sample indicator" ] }, { "cell_type": "code", "execution_count": 17, "id": "a5a1df56", "metadata": {}, "outputs": [], "source": [ "estr = MEstimator(psi, init=[0.75, 0.75, 0.75, 0.75])\n", "estr.estimate()" ] }, { "cell_type": "code", "execution_count": 18, "id": "216cf37b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CoefLCLUCL
Param
Corrected0.800.720.88
Mismeasured0.720.690.74
Sensitivity0.840.800.89
Specificity0.800.710.88
\n", "
" ], "text/plain": [ " Coef LCL UCL\n", "Param \n", "Corrected 0.80 0.72 0.88\n", "Mismeasured 0.72 0.69 0.74\n", "Sensitivity 0.84 0.80 0.89\n", "Specificity 0.80 0.71 0.88" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Formatting results into a nice table\n", "result = pd.DataFrame()\n", "result['Param'] = ['Corrected', 'Mismeasured', 'Sensitivity', 'Specificity']\n", "result['Coef'] = estr.theta\n", "ci = estr.confidence_intervals()\n", "result['LCL'] = ci[:, 0]\n", "result['UCL'] = ci[:, 1]\n", "result.set_index('Param').round(2)" ] }, { "cell_type": "markdown", "id": "fffd12f1", "metadata": {}, "source": [ "These results match those described in Ross et al.\n", "\n", "Here, we replicated the introduction to M-estimation tutorial provided in Ross et al. The basic ideas illustrated in that paper and replicated here can be extended in numerous directions, as showcased in the other applied examples provided.\n", "\n", "## References\n", "\n", "Ross RK, Zivich PN, Stringer JS, & Cole SR. (2024). M-estimation for common epidemiological measures: introduction and applied examples. *International Journal of Epidemiology*, 53(2), dyae030." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }