Cole et al. (2023): Fusion Designs

The following is a replication of the cases described in Cole et al. (2023) and the rejoinder. The original paper considered fusion study designs, whereby multiple data sources are used together to address a single question. Examples are provided in the context of transporting a proportion, measurement error, and the joint occurrence of measurement error and transporting. In that paper, M-estimators are introduced and proposed as a general solution to estimation with fusion designs. Importantly, the empirical sandwich variance estimator allows for uncertainty estimation in this setting, unlike standard methods or approximations. Further, the empirical sandwich variance estimator is less computationally demanding than competing methods that provide appropriate statistical inference (e.g., bootstrapping, Monte Carlo methods).

Here, we recreate the cases described in the paper. For finer details on the approaches, see the original publications.

Setup

[1]:
import numpy as np
import pandas as pd

import delicatessen as deli
from delicatessen import MEstimator
from delicatessen.estimating_equations import ee_regression, ee_rogan_gladen
from delicatessen.utilities import inverse_logit

print("Versions")
print("NumPy:        ", np.__version__)
print("pandas:       ", pd.__version__)
print("Delicatessen: ", deli.__version__)
Versions
NumPy:         2.3.5
pandas:        2.3.3
Delicatessen:  4.1

Case 1: Transporting the proportion

The goal is to estimate \(\Pr(Y=1)\) in the target population. However, we only have the outcome measured in a sample of a secondary population (\(S=1\)). To estimate the proportion in the target population, we will transport from the secondary population to the target population conditional on a covariate, \(W\). For estimation, we use inverse odds of sampling weights.

First, we build the data set provided in Cole et al.

[2]:
d = pd.DataFrame()
d['Y'] = [0, 1, 0, 1] + [np.nan, np.nan]
d['W'] = [0, 0, 1, 1] + [0, 1]
d['n'] = [266, 67, 400, 267] + [333, 167]
d['S'] = [0, 0, 0, 0] + [1, 1]

# Expanding data set out
d = pd.DataFrame(np.repeat(d.values, d['n'], axis=0), columns=d.columns)
d['C'] = 1

# Setting up for delicatessen
y_no_nan = np.asarray(d['Y'].fillna(-999))
s = np.asarray(d['S'])
W = np.asarray(d[['C', 'W']])

For estimation, we stack the naive proportion, inverse odds weighted proportion, and sampling models together.

[3]:
def psi(theta):
    # Subsetting the input parameters
    mu_w = theta[0]    # Weighted proportion
    mu_n = theta[1]    # Naive proportion
    beta = theta[2:]   # Sampling model parameters

    # Sampling model for transporting
    ee_sm = ee_regression(beta,
                          X=W, y=d['S'],
                          model='logistic')

    # Calculating inverse odds of sampling weights
    pi_s = inverse_logit(np.dot(W, beta))
    iosw = (1-s) * pi_s / (1-pi_s)

    # Weighted mean
    ee_wprop = iosw * (y_no_nan - mu_w)

    # Naive mean
    ee_nprop = (1-s) * (y_no_nan - mu_n)

    # Returning the stacked estimating equations
    return np.vstack([ee_wprop, ee_nprop, ee_sm])
[4]:
estr = MEstimator(psi, init=[0.5, 0.5, 0., 0.])
estr.estimate()
ci = estr.confidence_intervals()
[5]:
fmt = '{:.2f}, (95% CI: {:.2f}, {:.2f})'
print("Weighted proportion:", fmt.format(estr.theta[0], ci[0, 0], ci[0, 1]))
print("Naive proportion:   ", fmt.format(estr.theta[1], ci[1, 0], ci[1, 1]))
Weighted proportion: 0.27, (95% CI: 0.24, 0.30)
Naive proportion:    0.33, (95% CI: 0.30, 0.36)

Case 2: Estimating a Misclassified Proportion

The goal is to use external sensitivity and specificity data to correct for measurement error in the main sample. We do this by using the Rogan-Gladen correction

\[\hat{\mu} = \frac{\hat{\mu}^* + \widehat{Sp} - 1}{\widehat{Se}+ \widehat{Sp} - 1}\]

where \(\mu\) is the corrected mean, \(\mu^*\) is the mismeasured mean, \(Se\) is the sensitivity, and \(Sp\) is the specificity. Hats indicate estimated quantities.

First, we build the data set described in Cole et al.

[6]:
# Compact data frame expression of data
d = pd.DataFrame()
d['R'] = [1, 1, 0, 0, 0, 0]
d['Y'] = [np.nan, np.nan, 1, 1, 0, 0]
d['Y_star'] = [1, 0, 1, 0, 1, 0]
d['n'] = [680, 270, 204, 38, 18, 71]

# Expanding data set out
d = pd.DataFrame(np.repeat(d.values, d['n'], axis=0),
                 columns=d.columns)
d = d[['R', 'Y_star', 'Y']].copy()

For estimation, we use the built-in estimating equation for the Rogan-Gladen correction, ee_rogan_gladen

[7]:
def psi(theta):
    return ee_rogan_gladen(theta=theta, y=d['Y'],
                           y_star=d['Y_star'], r=d['R'])
[8]:
estr = MEstimator(psi, init=[0.5, 0.5, .75, .75])
estr.estimate()
ci = estr.confidence_intervals()
[9]:
fmt = '{:.2f}, (95% CI: {:.2f}, {:.2f})'
print("Corrected proportion:", fmt.format(estr.theta[0], ci[0, 0], ci[0, 1]))
print("Naive proportion:    ", fmt.format(estr.theta[1], ci[1, 0], ci[1, 1]))
print("Sensitivity:         ", fmt.format(estr.theta[2], ci[2, 0], ci[2, 1]))
print("Specificity:         ", fmt.format(estr.theta[3], ci[3, 0], ci[3, 1]))
Corrected proportion: 0.80, (95% CI: 0.72, 0.88)
Naive proportion:     0.72, (95% CI: 0.69, 0.74)
Sensitivity:          0.84, (95% CI: 0.80, 0.89)
Specificity:          0.80, (95% CI: 0.71, 0.88)

These results match those provided in Cole et al.

Case 3: Estimating a Misclassified and Transported Proportion

In the rejoinder, Cole et al. combine the previous two examples. Specifically, we now transport the proportion from the main sample and then correct for measurement error.

First, we build the data provided in Cole et al.

[10]:
# Compact data frame expression of data
d = pd.DataFrame()
d['Y_star'] = [0, 1, 0, 1] + [np.nan, np.nan] + [1, 0, 1, 0]
d['Y'] = [np.nan, np.nan, np.nan, np.nan] + [np.nan, np.nan] + [1, 1, 0, 0]
d['W'] = [0, 0, 1, 1] + [0, 1] + [np.nan, np.nan, np.nan, np.nan]
d['n'] = [266, 67, 400, 267] + [333, 167] + [180, 20, 60, 240]
d['S'] = [1, 1, 1, 1] + [2, 2] + [3, 3, 3, 3]

# Expanding data set out
d = pd.DataFrame(np.repeat(d.values, d['n'], axis=0), columns=d.columns)
d['C'] = 1

# Some extra data processing
y_no_nan = np.asarray(d['Y'].fillna(-1))
ystar_no_nan = np.asarray(d['Y_star'].fillna(-1))
w_no_nan = np.asarray(d[['C', 'W']].fillna(-1))
s1 = np.where(d['S'] == 1, 1, 0)
s2 = np.where(d['S'] == 2, 1, 0)
s3 = np.where(d['S'] == 3, 1, 0)

Next, we program the estimating functions described in the main paper. This is simply a combination of the previous two examples.

[11]:
def psi(theta):
    # Subsetting the input parameters
    param = theta[:4]    # Measurement error parameters
    beta = theta[4:]     # Sampling model parameters

    # Sampling model for transporting
    ee_sm = ee_regression(beta,
                          X=w_no_nan,
                          y=s2,
                          model='logistic')
    ee_sm = ee_sm * (1 - s3)   # Only S=1 or S=2 contribute

    # Calculating inverse odds of sampling weights
    pi_s = inverse_logit(np.dot(w_no_nan, beta))
    # Note: iosw is the odds weight if S=1, zero if S=2 and 1 if S=3.
    #       So S=2 don't contribute to measurement error model, the
    #       naive mean among S=1 is reweighted appropriately, and S=3
    #       observations all contribute equally (we can't estimate
    #       weights for them since W was not measured)
    iosw = s1 * pi_s / (1-pi_s) + s3

    # Rogan-Gladen correction
    ee_rg = ee_rogan_gladen(param,
                            y=y_no_nan,
                            y_star=ystar_no_nan,
                            r=s1,
                            weights=iosw)
    ee_rg = ee_rg * (1 - s2)  # Only S=1 or S=3 contribute

    # Returning the stacked estimating equations
    return np.vstack([ee_rg, ee_sm])
[12]:
estr = MEstimator(psi, init=[0.5, 0.5, .75, .75, 0., 0.])
estr.estimate(solver='lm')
se = np.diag(estr.variance)**0.5
[13]:
fmt = '{:.3f} (SE={:.3f})'
print("Corrected weighted proportion:", fmt.format(estr.theta[0], se[0]))
print("Naive weighted proportion:    ", fmt.format(estr.theta[1], se[1]))
print("Sensitivity:                  ", fmt.format(estr.theta[2], se[2]))
print("Specificity:                  ", fmt.format(estr.theta[3], se[3]))
Corrected weighted proportion: 0.097 (SE=0.038)
Naive weighted proportion:     0.268 (SE=0.016)
Sensitivity:                   0.900 (SE=0.021)
Specificity:                   0.800 (SE=0.023)

These results match those reported in the rejoinder.

We replicated the cases from Cole et al. (2023) to showcase the basics of M-estimators for fusion designs and how they can be implemented in delicatessen.

References

Cole SR, Edwards JK, Breskin A, Rosin S, Zivich PN, Shook-Sa BE, & Hudgens MG. (2023). “Illustration of 2 Fusion Designs and Estimators”. American Journal of Epidemiology, 192(3), 467-474.

Cole SR, Edwards JK, Breskin A, Rosin S, Zivich PN, Shook-Sa BE, & Hudgens MG. (2023). “Rejoinder: Combining Information with Fusion Designs, and Roses by Other Names”. American Journal of Epidemiology, kwad084.