{ "cells": [ { "cell_type": "markdown", "id": "6f850c92", "metadata": {}, "source": [ "# Cole et al. (2023): Fusion Designs\n", "\n", "The following is a replication of the cases described in Cole et al. (2023) and the rejoinder. The original paper considered fusion study designs, whereby multiple data sources are used together to address a single question. Examples are provided in the context of transporting a proportion, measurement error, and the joint occurrence of measurement error and transporting. In that paper, M-estimators are introduced and proposed as a general solution to estimation with fusion designs. Importantly, the empirical sandwich variance estimator allows for uncertainty estimation in this setting, unlike standard methods or approximations. Further, the empirical sandwich variance estimator is less computationally demanding than competing methods that provide appropriate statistical inference (e.g., bootstrapping, Monte Carlo methods). \n", "\n", "Here, we recreate the cases described in the paper. For finer details on the approaches, see the original publications.\n", "\n", "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "75b0db4f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Versions\n", "NumPy: 2.3.5\n", "pandas: 2.3.3\n", "Delicatessen: 4.1\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "import delicatessen as deli\n", "from delicatessen import MEstimator\n", "from delicatessen.estimating_equations import ee_regression, ee_rogan_gladen\n", "from delicatessen.utilities import inverse_logit\n", "\n", "print(\"Versions\")\n", "print(\"NumPy: \", np.__version__)\n", "print(\"pandas: \", pd.__version__)\n", "print(\"Delicatessen: \", deli.__version__)" ] }, { "cell_type": "markdown", "id": "deaf49e4", "metadata": {}, "source": [ "## Case 1: Transporting the proportion\n", "\n", "The goal is to estimate $\\Pr(Y=1)$ in the target population. However, we only have the outcome measured in a sample of a secondary population ($S=1$). To estimate the proportion in the target population, we will transport from the secondary population to the target population conditional on a covariate, $W$. For estimation, we use inverse odds of sampling weights.\n", "\n", "First, we build the data set provided in Cole et al." ] }, { "cell_type": "code", "execution_count": 2, "id": "25c89f4f", "metadata": {}, "outputs": [], "source": [ "d = pd.DataFrame()\n", "d['Y'] = [0, 1, 0, 1] + [np.nan, np.nan]\n", "d['W'] = [0, 0, 1, 1] + [0, 1]\n", "d['n'] = [266, 67, 400, 267] + [333, 167]\n", "d['S'] = [0, 0, 0, 0] + [1, 1]\n", "\n", "# Expanding data set out \n", "d = pd.DataFrame(np.repeat(d.values, d['n'], axis=0), columns=d.columns)\n", "d['C'] = 1\n", "\n", "# Setting up for delicatessen\n", "y_no_nan = np.asarray(d['Y'].fillna(-999))\n", "s = np.asarray(d['S'])\n", "W = np.asarray(d[['C', 'W']])" ] }, { "cell_type": "markdown", "id": "dea2921e", "metadata": {}, "source": [ "For estimation, we stack the naive proportion, inverse odds weighted proportion, and sampling models together. " ] }, { "cell_type": "code", "execution_count": 3, "id": "0a6d6de5", "metadata": {}, "outputs": [], "source": [ "def psi(theta):\n", " # Subsetting the input parameters\n", " mu_w = theta[0] # Weighted proportion\n", " mu_n = theta[1] # Naive proportion\n", " beta = theta[2:] # Sampling model parameters\n", " \n", " # Sampling model for transporting\n", " ee_sm = ee_regression(beta, \n", " X=W, y=d['S'],\n", " model='logistic')\n", " \n", " # Calculating inverse odds of sampling weights\n", " pi_s = inverse_logit(np.dot(W, beta))\n", " iosw = (1-s) * pi_s / (1-pi_s)\n", " \n", " # Weighted mean\n", " ee_wprop = iosw * (y_no_nan - mu_w)\n", " \n", " # Naive mean\n", " ee_nprop = (1-s) * (y_no_nan - mu_n)\n", " \n", " # Returning the stacked estimating equations\n", " return np.vstack([ee_wprop, ee_nprop, ee_sm])" ] }, { "cell_type": "code", "execution_count": 4, "id": "cb3d81ae", "metadata": {}, "outputs": [], "source": [ "estr = MEstimator(psi, init=[0.5, 0.5, 0., 0.])\n", "estr.estimate()\n", "ci = estr.confidence_intervals()" ] }, { "cell_type": "code", "execution_count": 5, "id": "4729e9c4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Weighted proportion: 0.27, (95% CI: 0.24, 0.30)\n", "Naive proportion: 0.33, (95% CI: 0.30, 0.36)\n" ] } ], "source": [ "fmt = '{:.2f}, (95% CI: {:.2f}, {:.2f})'\n", "print(\"Weighted proportion:\", fmt.format(estr.theta[0], ci[0, 0], ci[0, 1]))\n", "print(\"Naive proportion: \", fmt.format(estr.theta[1], ci[1, 0], ci[1, 1]))" ] }, { "cell_type": "markdown", "id": "00204705", "metadata": {}, "source": [ "\n", "## Case 2: Estimating a Misclassified Proportion\n", "\n", "The goal is to use external sensitivity and specificity data to correct for measurement error in the main sample. We do this by using the Rogan-Gladen correction\n", "$$ \\hat{\\mu} = \\frac{\\hat{\\mu}^* + \\widehat{Sp} - 1}{\\widehat{Se}+ \\widehat{Sp} - 1} $$\n", "where $\\mu$ is the corrected mean, $\\mu^*$ is the mismeasured mean, $Se$ is the sensitivity, and $Sp$ is the specificity. Hats indicate estimated quantities. \n", "\n", "First, we build the data set described in Cole et al." ] }, { "cell_type": "code", "execution_count": 6, "id": "3d7e564b", "metadata": {}, "outputs": [], "source": [ "# Compact data frame expression of data\n", "d = pd.DataFrame()\n", "d['R'] = [1, 1, 0, 0, 0, 0]\n", "d['Y'] = [np.nan, np.nan, 1, 1, 0, 0]\n", "d['Y_star'] = [1, 0, 1, 0, 1, 0]\n", "d['n'] = [680, 270, 204, 38, 18, 71]\n", "\n", "# Expanding data set out \n", "d = pd.DataFrame(np.repeat(d.values, d['n'], axis=0),\n", " columns=d.columns)\n", "d = d[['R', 'Y_star', 'Y']].copy()" ] }, { "cell_type": "markdown", "id": "bc742a54", "metadata": {}, "source": [ "For estimation, we use the built-in estimating equation for the Rogan-Gladen correction, `ee_rogan_gladen`" ] }, { "cell_type": "code", "execution_count": 7, "id": "8f811703", "metadata": {}, "outputs": [], "source": [ "def psi(theta):\n", " return ee_rogan_gladen(theta=theta, y=d['Y'],\n", " y_star=d['Y_star'], r=d['R'])" ] }, { "cell_type": "code", "execution_count": 8, "id": "5edd5380", "metadata": {}, "outputs": [], "source": [ "estr = MEstimator(psi, init=[0.5, 0.5, .75, .75])\n", "estr.estimate()\n", "ci = estr.confidence_intervals()" ] }, { "cell_type": "code", "execution_count": 9, "id": "f791d522", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Corrected proportion: 0.80, (95% CI: 0.72, 0.88)\n", "Naive proportion: 0.72, (95% CI: 0.69, 0.74)\n", "Sensitivity: 0.84, (95% CI: 0.80, 0.89)\n", "Specificity: 0.80, (95% CI: 0.71, 0.88)\n" ] } ], "source": [ "fmt = '{:.2f}, (95% CI: {:.2f}, {:.2f})'\n", "print(\"Corrected proportion:\", fmt.format(estr.theta[0], ci[0, 0], ci[0, 1]))\n", "print(\"Naive proportion: \", fmt.format(estr.theta[1], ci[1, 0], ci[1, 1]))\n", "print(\"Sensitivity: \", fmt.format(estr.theta[2], ci[2, 0], ci[2, 1]))\n", "print(\"Specificity: \", fmt.format(estr.theta[3], ci[3, 0], ci[3, 1]))" ] }, { "cell_type": "markdown", "id": "3e265cd7", "metadata": {}, "source": [ "These results match those provided in Cole et al.\n", "\n", "## Case 3: Estimating a Misclassified and Transported Proportion\n", "\n", "In the rejoinder, Cole et al. combine the previous two examples. Specifically, we now transport the proportion from the main sample and then correct for measurement error. \n", "\n", "First, we build the data provided in Cole et al." ] }, { "cell_type": "code", "execution_count": 10, "id": "216d9211", "metadata": {}, "outputs": [], "source": [ "# Compact data frame expression of data\n", "d = pd.DataFrame()\n", "d['Y_star'] = [0, 1, 0, 1] + [np.nan, np.nan] + [1, 0, 1, 0]\n", "d['Y'] = [np.nan, np.nan, np.nan, np.nan] + [np.nan, np.nan] + [1, 1, 0, 0]\n", "d['W'] = [0, 0, 1, 1] + [0, 1] + [np.nan, np.nan, np.nan, np.nan]\n", "d['n'] = [266, 67, 400, 267] + [333, 167] + [180, 20, 60, 240]\n", "d['S'] = [1, 1, 1, 1] + [2, 2] + [3, 3, 3, 3]\n", "\n", "# Expanding data set out \n", "d = pd.DataFrame(np.repeat(d.values, d['n'], axis=0), columns=d.columns)\n", "d['C'] = 1\n", "\n", "# Some extra data processing \n", "y_no_nan = np.asarray(d['Y'].fillna(-1))\n", "ystar_no_nan = np.asarray(d['Y_star'].fillna(-1))\n", "w_no_nan = np.asarray(d[['C', 'W']].fillna(-1))\n", "s1 = np.where(d['S'] == 1, 1, 0)\n", "s2 = np.where(d['S'] == 2, 1, 0)\n", "s3 = np.where(d['S'] == 3, 1, 0)" ] }, { "cell_type": "markdown", "id": "b43f5317-a2e4-426e-a9ce-933bfdbdf170", "metadata": {}, "source": [ "Next, we program the estimating functions described in the main paper. This is simply a combination of the previous two examples." ] }, { "cell_type": "code", "execution_count": 11, "id": "bfc960d3", "metadata": {}, "outputs": [], "source": [ "def psi(theta):\n", " # Subsetting the input parameters\n", " param = theta[:4] # Measurement error parameters\n", " beta = theta[4:] # Sampling model parameters\n", "\n", " # Sampling model for transporting\n", " ee_sm = ee_regression(beta, \n", " X=w_no_nan, \n", " y=s2,\n", " model='logistic')\n", " ee_sm = ee_sm * (1 - s3) # Only S=1 or S=2 contribute\n", " \n", " # Calculating inverse odds of sampling weights\n", " pi_s = inverse_logit(np.dot(w_no_nan, beta))\n", " # Note: iosw is the odds weight if S=1, zero if S=2 and 1 if S=3.\n", " # So S=2 don't contribute to measurement error model, the\n", " # naive mean among S=1 is reweighted appropriately, and S=3\n", " # observations all contribute equally (we can't estimate\n", " # weights for them since W was not measured)\n", " iosw = s1 * pi_s / (1-pi_s) + s3\n", "\n", " # Rogan-Gladen correction\n", " ee_rg = ee_rogan_gladen(param,\n", " y=y_no_nan,\n", " y_star=ystar_no_nan,\n", " r=s1,\n", " weights=iosw)\n", " ee_rg = ee_rg * (1 - s2) # Only S=1 or S=3 contribute\n", " \n", " # Returning the stacked estimating equations\n", " return np.vstack([ee_rg, ee_sm])" ] }, { "cell_type": "code", "execution_count": 12, "id": "b907c38a", "metadata": {}, "outputs": [], "source": [ "estr = MEstimator(psi, init=[0.5, 0.5, .75, .75, 0., 0.])\n", "estr.estimate(solver='lm')\n", "se = np.diag(estr.variance)**0.5" ] }, { "cell_type": "code", "execution_count": 13, "id": "503cef27", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Corrected weighted proportion: 0.097 (SE=0.038)\n", "Naive weighted proportion: 0.268 (SE=0.016)\n", "Sensitivity: 0.900 (SE=0.021)\n", "Specificity: 0.800 (SE=0.023)\n" ] } ], "source": [ "fmt = '{:.3f} (SE={:.3f})'\n", "print(\"Corrected weighted proportion:\", fmt.format(estr.theta[0], se[0]))\n", "print(\"Naive weighted proportion: \", fmt.format(estr.theta[1], se[1]))\n", "print(\"Sensitivity: \", fmt.format(estr.theta[2], se[2]))\n", "print(\"Specificity: \", fmt.format(estr.theta[3], se[3]))" ] }, { "cell_type": "markdown", "id": "976ee485", "metadata": {}, "source": [ "These results match those reported in the rejoinder.\n", "\n", "We replicated the cases from Cole et al. (2023) to showcase the basics of M-estimators for fusion designs and how they can be implemented in `delicatessen`.\n", "\n", "## References\n", "\n", "Cole SR, Edwards JK, Breskin A, Rosin S, Zivich PN, Shook-Sa BE, & Hudgens MG. (2023). \"Illustration of 2 Fusion Designs and Estimators\". *American Journal of Epidemiology*, 192(3), 467-474.\n", "\n", "Cole SR, Edwards JK, Breskin A, Rosin S, Zivich PN, Shook-Sa BE, & Hudgens MG. (2023). \"Rejoinder: Combining Information with Fusion Designs, and Roses by Other Names\". *American Journal of Epidemiology*, kwad084." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }