{ "cells": [ { "cell_type": "markdown", "id": "32cacaf6-e57c-4f40-8f55-5d2f409a99bd", "metadata": {}, "source": [ "# Cole et al. (2022): Addressing Missing Outcome Data\n", "\n", "Cole et al. (2022) reviewed a way to address missing data with several different missing data mechanisms. This approach is commonly referred to as direct imputation or g-computation. This was done in the context of a randomized trial on 17P injection for the prevention of preterm birth. All data was described in a table in the paper, so there is no need to track down any data set. \n", "\n", "Here, we review selected examples and how g-computation for missing data can be implemented using `delicatessen`.\n", "\n", "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "e012b18f-b25d-4323-872a-09c9a92c13c1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Versions\n", "NumPy: 2.3.5\n", "SciPy: 1.16.3\n", "pandas: 2.3.3\n", "Matplotlib: 3.10.8\n", "Delicatessen: 4.1\n" ] } ], "source": [ "import numpy as np\n", "import scipy as sp\n", "import pandas as pd\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "\n", "import delicatessen as deli\n", "from delicatessen import MEstimator\n", "from delicatessen.estimating_equations import ee_regression, ee_glm\n", "from delicatessen.utilities import inverse_logit\n", "\n", "print(\"Versions\")\n", "print(\"NumPy: \", np.__version__)\n", "print(\"SciPy: \", sp.__version__)\n", "print(\"pandas: \", pd.__version__)\n", "print(\"Matplotlib: \", mpl.__version__)\n", "print(\"Delicatessen:\", deli.__version__)" ] }, { "cell_type": "code", "execution_count": 2, "id": "e25be103-9934-40e2-b19a-90be162effa5", "metadata": {}, "outputs": [], "source": [ "d = pd.DataFrame()\n", "d['short_cervix'] = [0]*215 + [0]*222 + [1]*186 + [1]*177\n", "d['17P'] = [0]*215 + [1]*222 + [0]*186 + [1]*177\n", "d['placebo'] = 1 - d['17P']\n", "d['preterm0'] = ([1]*15 + [0]*(215-15) + [1]*13 + [0]*(222-13) \n", " + [1]*21 + [0]*(186-21) + [1]*23 + [0]*(177-23))\n", "d['preterm1'] = ([1]*11 + [0]*(161-11) + [np.nan]*54\n", " + [1]*10 + [0]*(167-10) + [np.nan]*55\n", " + [1]*16 + [0]*(140-16) + [np.nan]*46\n", " + [1]*17 + [0]*(133-17) + [np.nan]*44)\n", "d['preterm2'] = ([1]*8 + [0]*(108-8) + [np.nan]*107\n", " + [1]*13 + [0]*(222-13) + [np.nan]*0\n", " + [1]*21 + [0]*(186-21) + [np.nan]*0\n", " + [1]*12 + [0]*(89-12) + [np.nan]*88)\n", "d['preterm3'] = ([1]*15 + [0]*(215-15) + [np.nan]*0\n", " + [1]*13 + [0]*(222-13) + [np.nan]*0\n", " + [1]*21 + [0]*(186-21) + [np.nan]*0\n", " + [1]*0 + [0]*0 + [np.nan]*177)\n", "d['preterm4'] = ([1]*15 + [0]*(100-15) + [np.nan]*115\n", " + [1]*7 + [0]*(216-7) + [np.nan]*6\n", " + [1]*21 + [0]*(186-21) + [np.nan]*0\n", " + [1]*12 + [0]*(81-12) + [np.nan]*96)\n", "d['C'] = 1" ] }, { "cell_type": "markdown", "id": "98c93fbf-477d-4ca3-8011-0c3d33c6254b", "metadata": {}, "source": [ "To start, we will estimate the risk ratio using the complete outcome data (`preterm0`). \n", "\n", "The first estimating function computes the risk ratio by computing each risk and then finding the ratio (or the difference on the log-scale)" ] }, { "cell_type": "code", "execution_count": 3, "id": "327b3537-1cad-464d-8794-a9ffdb70d03d", "metadata": {}, "outputs": [], "source": [ "def psi_rr(theta):\n", " a = np.asarray(d['placebo'])\n", " y = np.asarray(d['preterm0'])\n", " ee_r1 = a*(y - theta[1])\n", " ee_r0 = (1-a)*(y - theta[2])\n", " ee_rr = np.ones(y.shape[0]) * (np.log(theta[1]) - np.log(theta[2]) - theta[0])\n", " return np.vstack([ee_rr, ee_r1, ee_r0])" ] }, { "cell_type": "code", "execution_count": 4, "id": "d61c1845-f167-43b0-8e24-04d54c759be2", "metadata": {}, "outputs": [], "source": [ "estr = MEstimator(psi_rr, init=[0., 0.5, 0.5])\n", "estr.estimate()\n", "ci = estr.confidence_intervals()" ] }, { "cell_type": "code", "execution_count": 5, "id": "d678629a-8a62-408d-ad1a-53aa95e6cfba", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RR: 0.9950124688279302\n", "95% CI: [0.64038277 1.54602818]\n" ] } ], "source": [ "print(\"RR: \", np.exp(estr.theta[0]))\n", "print(\"95% CI:\", np.exp(ci[0, :]))" ] }, { "cell_type": "markdown", "id": "0ca91bb2-e35d-4151-9ee1-d272fe73a55e", "metadata": {}, "source": [ "In the paper, the risk ratio is estimated using a log-binomial model. We can also do this easily via the `ee_glm` built-in estimating function. One caution is that the log-binomial can be a bit finicky with the starting values. In particular, we want to provide a good starting value for the intercept term (can be done using the mean of $Y$ in the data)." ] }, { "cell_type": "code", "execution_count": 6, "id": "8ffac7e7-8d05-404e-a44d-ea64de90d621", "metadata": {}, "outputs": [], "source": [ "def psi_lb(theta):\n", " return ee_glm(theta=theta, X=d[['C', 'placebo']], y=d['preterm0'], \n", " distribution='binomial', link='log')" ] }, { "cell_type": "code", "execution_count": 7, "id": "6604ab15-96e1-47a9-a524-b5d5aa470c41", "metadata": {}, "outputs": [], "source": [ "estr = MEstimator(psi_lb, init=[np.log(0.08), 0.])\n", "estr.estimate()\n", "ci = estr.confidence_intervals()" ] }, { "cell_type": "code", "execution_count": 8, "id": "4ff1f07c-de41-4da7-9fed-96fc8cc65533", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RR: 0.9950124688279302\n", "95% CI: [0.64038273 1.54602829]\n" ] } ], "source": [ "print(\"RR: \", np.exp(estr.theta[1]))\n", "print(\"95% CI:\", np.exp(ci[1, :]))" ] }, { "cell_type": "markdown", "id": "31d3f96c-777b-49fc-863c-cf2491c9d5cd", "metadata": {}, "source": [ "which give the same result quite far into the shown decimal places (some differences would be expected further out due to floating point errors and tolerances). Having established the best case, let's turn our attention to the missing data examples.\n", "\n", "## Scenario A\n", "\n", "The first scenario induces non-informative missing outcome data (i.e., missing completely at random). Here, we expect a complete-case analysis and g-computation to both be unbiased. The following is the complete-case estimating function and the g-computation estimating function" ] }, { "cell_type": "code", "execution_count": 9, "id": "8951e885-d4d5-4716-895a-1e4aee92724a", "metadata": {}, "outputs": [], "source": [ "a = np.asarray(d['placebo'])\n", "r = np.where(np.isnan(d['preterm1']), 0, 1)\n", "y_no_nan = np.nan_to_num(d['preterm1'], nan=-999)" ] }, { "cell_type": "code", "execution_count": 10, "id": "b6b20610-9cba-4832-ab1b-c0cba087d6bd", "metadata": {}, "outputs": [], "source": [ "def psi_cc(theta):\n", " return ee_glm(theta=theta, X=d[['C', 'placebo']], y=y_no_nan, \n", " distribution='binomial', link='log', weights=r)" ] }, { "cell_type": "code", "execution_count": 11, "id": "2bff60ac-85ad-422d-93a0-e68c8131574a", "metadata": {}, "outputs": [], "source": [ "def psi_gcomp(theta):\n", " beta1 = theta[3:5]\n", " beta0 = theta[5:]\n", " X = np.asarray(d[['C', 'short_cervix']])\n", "\n", " # Outcome nuisance model for 17P=1\n", " ee_reg1 = ee_regression(theta=beta1, X=X,\n", " y=y_no_nan, model='logistic')\n", " ee_reg1 = ee_reg1 * r * a # Only those with observed Y contribute\n", "\n", " # Outcome nuisance model for 17P=0\n", " ee_reg0 = ee_regression(theta=beta0, X=X,\n", " y=y_no_nan, model='logistic')\n", " ee_reg0 = ee_reg0 * r * (1-a) # Only those with observed Y contribute\n", "\n", " # Computing risk ratios\n", " y1hat = inverse_logit(np.dot(X, beta1))\n", " y0hat = inverse_logit(np.dot(X, beta0))\n", " ee_r1 = y1hat - theta[1]\n", " ee_r0 = y0hat - theta[2]\n", " ee_rr = np.ones(X.shape[0]) * (np.log(theta[1]) - np.log(theta[2]) - theta[0])\n", "\n", " return np.vstack([ee_rr, ee_r1, ee_r0, ee_reg1, ee_reg0])" ] }, { "cell_type": "markdown", "id": "31724b29-2386-41ce-a9ff-17d51c598f8c", "metadata": {}, "source": [ "Note that g-computation involves fitting several regression models for $Y$ among the complete-cases. This process is what allows us to correct for missing data in informative scenarios. In settings with non-informative missing, we also expect the g-computation results to be more efficient (narrower confidence intervals). For further details on the estimator itself, see the corresponding publication.\n", "\n", "The following computes the complete-case results" ] }, { "cell_type": "code", "execution_count": 12, "id": "e4f799cc-6d0c-4fa4-ab40-b265c2d62b12", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RR: 0.9966777408637874\n", "95% CI: [0.59915561 1.65794412]\n" ] } ], "source": [ "estr = MEstimator(psi_cc, init=[np.log(0.09), 0.])\n", "estr.estimate()\n", "ci = estr.confidence_intervals()\n", "\n", "print(\"RR: \", np.exp(estr.theta[1]))\n", "print(\"95% CI:\", np.exp(ci[1, :]))" ] }, { "cell_type": "markdown", "id": "25970e68-fdba-4c2d-a73c-14358f0dae15", "metadata": {}, "source": [ "As we would expect in this scenario, the complete-case results are similar to the full-data results. Note that the confidence intervals are wider than the full-data results." ] }, { "cell_type": "code", "execution_count": 13, "id": "7da16f61-e1d2-4a1b-9a03-3d7e0fcd6c59", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RR: 0.9831422283742006\n", "95% CI: [0.59247203 1.63141648]\n" ] } ], "source": [ "init_vals = [0., 0.5, 0.5] + [0., ]*2 + [0., ]*2\n", "estr = MEstimator(psi_gcomp, init=init_vals)\n", "estr.estimate()\n", "ci = estr.confidence_intervals()\n", "\n", "print(\"RR: \", np.exp(estr.theta[0]))\n", "print(\"95% CI:\", np.exp(ci[0, :]))" ] }, { "cell_type": "markdown", "id": "2c230e60-4a25-43c3-bbe7-c3cce742da83", "metadata": {}, "source": [ "The g-computation results are similar. Here, the confidence intervals are quite similar.\n", "\n", "## Scenario B\n", "\n", "In scenario B, we consider informative missing data. Specifically, cervix length is related to missingness. Here, a complete-case analysis will be biased, but we would expect g-computation to correct for this bias. The following apply the estimating functions to this second scenario" ] }, { "cell_type": "code", "execution_count": 14, "id": "bd779fcd-5735-44fd-8af3-7ef6ce54dead", "metadata": {}, "outputs": [], "source": [ "a = np.asarray(d['placebo'])\n", "r = np.where(np.isnan(d['preterm2']), 0, 1)\n", "y_no_nan = np.nan_to_num(d['preterm2'], nan=-999)" ] }, { "cell_type": "code", "execution_count": 15, "id": "c9ed8962-5a37-46df-904e-fafec82905d0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RR: 1.2270748299319727\n", "95% CI: [0.73641679 2.04464736]\n" ] } ], "source": [ "estr = MEstimator(psi_cc, init=[np.log(0.09), 0.])\n", "estr.estimate()\n", "ci = estr.confidence_intervals()\n", "\n", "print(\"RR: \", np.exp(estr.theta[1]))\n", "print(\"95% CI:\", np.exp(ci[1, :]))" ] }, { "cell_type": "code", "execution_count": 16, "id": "4995acd7-04ea-4e82-8b34-39eed6c80579", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RR: 0.9841727212256232\n", "95% CI: [0.5745667 1.68578503]\n" ] } ], "source": [ "init_vals = [0., 0.5, 0.5] + [0., ]*2 + [0., ]*2\n", "estr = MEstimator(psi_gcomp, init=init_vals)\n", "estr.estimate()\n", "ci = estr.confidence_intervals()\n", "\n", "print(\"RR: \", np.exp(estr.theta[0]))\n", "print(\"95% CI:\", np.exp(ci[0, :]))" ] }, { "cell_type": "markdown", "id": "81de068f-aca2-424a-84cb-202a329899f2", "metadata": {}, "source": [ "Here, we see the complete-case analysis indicates that there is a relationship between 17P and preterm birth. However, this (correctly) disappears when we adjust for missingness by cervix length. \n", "\n", "Unlike in the corresponding paper, we are able to avoid bootstrapping to estimate the variance using the sandwich instead. This makes this code much more efficient. This computational advantage of M-estimation is noted in the Discussion of Cole et al. (2022), but has not been shown elsewhere. Other scenarios are replicated in the paper but are not done here. However, the table data is provided above as `preterm3` and `preterm4`.\n", "\n", "## References\n", "\n", "Cole, S. R., Zivich, P. N., Edwards, J. K., Ross, R. K., Shook-Sa, B. E., Price, J. T., & Stringer, J. S. (2023). Missing outcome data in epidemiologic studies. *American Journal of Epidemiology*, 192(1), 6-10." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }