Cole et al. (2023): Fusion Designs

The following is a replication of the cases described in Cole et al. (2023) and the rejoinder. The original paper considered fusion study designs, whereby multiple data sources are used together to address a single question. Examples are provided in the context of transporting a proportion, measurement error, and the joint occurrence of measurement error and transporting. In that paper, M-estimators are proposed as a general solution to estimation in fusion designs. Importantly, the empirical sandwich variance estimator allows for uncertainty estimation in this setting, unlike standard methods or approximations (like the GEE trick for inverse probability weighting). Further, these procedures are less computationally demanding than competing methods that do provide appropriate inference (e.g., bootstrapping, Monte Carlo methods).

Here, we recreate the cases described in the paper. For finer details, see the original publications.

Setup

[1]:

import numpy as np
import pandas as pd

import delicatessen
from delicatessen import MEstimator
from delicatessen.estimating_equations import ee_regression, ee_rogan_gladen
from delicatessen.utilities import inverse_logit

print("Versions")
print("NumPy:        ", np.__version__)
print("pandas:       ", pd.__version__)
print("Delicatessen: ", delicatessen.__version__)

Versions
NumPy:         1.25.2
pandas:        1.4.1
Delicatessen:  2.1

Case 1: Transporting the proportion

The goal is to estimate \(\Pr(Y=1)\) in the target population. However, we only have the outcome measured in a sample of a secondary population (\(S=1\)). To estimate the proportion in the target population, we will transport from the secondary population to the target population conditional on a covariate, \(W\). For estimation, we use inverse odds of sampling weights.

First, we build the data set described in Cole et al.

[2]:

d = pd.DataFrame()
d['Y'] = [0, 1, 0, 1] + [np.nan, np.nan]
d['W'] = [0, 0, 1, 1] + [0, 1]
d['n'] = [266, 67, 400, 267] + [333, 167]
d['S'] = [0, 0, 0, 0] + [1, 1]

# Expanding data set out
d = pd.DataFrame(np.repeat(d.values, d['n'], axis=0), columns=d.columns)
d['C'] = 1

# Setting up for delicatessen
y_no_nan = np.asarray(d['Y'].fillna(-999))
s = np.asarray(d['S'])
W = np.asarray(d[['C', 'W']])

For estimation, we stack the naive proportion, inverse odds weighted proportion, and sampling models together.

[3]:

def psi(theta):
    # Subsetting the input parameters
    mu_w = theta[0]    # Weighted proportion
    mu_n = theta[1]    # Naive proportion
    beta = theta[2:]   # Sampling model parameters

    # Sampling model for transporting
    ee_sm = ee_regression(beta,
                          X=W, y=d['S'],
                          model='logistic')

    # Calculating inverse odds of sampling weights
    pi_s = inverse_logit(np.dot(W, beta))
    iosw = (1-s) * pi_s / (1-pi_s)

    # Weighted mean
    ee_wprop = iosw * (y_no_nan - mu_w)

    # Naive mean
    ee_nprop = (1-s) * (y_no_nan - mu_n)

    # Returning the stacked estimating equations
    return np.vstack([ee_wprop, ee_nprop, ee_sm])

[4]:

estr = MEstimator(psi, init=[0.5, 0.5, 0., 0.])
estr.estimate()
ci = estr.confidence_intervals()

[5]:

fmt = '{:.2f}, 95% CI: {:.2f}, {:.2f}'
print("Weighted proportion:", fmt.format(estr.theta[0], ci[0, 0], ci[0, 1]))
print("Naive proportion:   ", fmt.format(estr.theta[1], ci[1, 0], ci[1, 1]))
fmt = "{:.3f}, {:.3f}"
print("Model coefficients: ", fmt.format(estr.theta[2], estr.theta[3]))

Weighted proportion: 0.27, 95% CI: 0.24, 0.30
Naive proportion:    0.33, 95% CI: 0.30, 0.36
Model coefficients:  -0.000, -1.385

Note that this example differs in number from that reported in Cole et al. Since the corresponding data was not provided in full in the publication, we instead adapt the data from case three here. Regardless, the estimating equation setup is the same between examples.

Case 2: Estimating a Misclassified Proportion

The goal is to use external sensitivity and specificity data to correct for measurement error in the main sample. We do this by using the Rogan-Gladen correction

\[\hat{\mu} = \frac{\hat{\mu}^* + \widehat{Sp} - 1}{\widehat{Se}+ \widehat{Sp} - 1}\]

where \(\mu\) is the corrected mean, \(\mu^*\) is the mismeasured mean, \(Se\) is the sensitivity, and \(Sp\) is the specificity. Hats indicate estimated quantities.

First, we build the data set described in Cole et al.

[6]:

# Compact data frame expression of data
d = pd.DataFrame()
d['R'] = [1, 1, 0, 0, 0, 0]
d['Y'] = [np.nan, np.nan, 1, 1, 0, 0]
d['Y_star'] = [1, 0, 1, 0, 1, 0]
d['n'] = [680, 270, 204, 38, 18, 71]

# Expanding data set out
d = pd.DataFrame(np.repeat(d.values, d['n'], axis=0),
                 columns=d.columns)
d = d[['R', 'Y_star', 'Y']].copy()

For estimation, we use the built-in estimating equation for the Rogan-Gladen correction, ee_rogan_gladen

[7]:

def psi(theta):
    return ee_rogan_gladen(theta=theta, y=d['Y'],
                           y_star=d['Y_star'], r=d['R'])

[8]:

estr = MEstimator(psi, init=[0.5, 0.5, .75, .75])
estr.estimate()
ci = estr.confidence_intervals()

[9]:

fmt = '{:.2f}, 95% CI: {:.2f}, {:.2f}'
print("Corrected proportion:", fmt.format(estr.theta[0], ci[0, 0], ci[0, 1]))
print("Naive proportion:    ", fmt.format(estr.theta[1], ci[1, 0], ci[1, 1]))
print("Sensitivity:         ", fmt.format(estr.theta[2], ci[2, 0], ci[2, 1]))
print("Specificity:         ", fmt.format(estr.theta[3], ci[3, 0], ci[3, 1]))

Corrected proportion: 0.80, 95% CI: 0.72, 0.88
Naive proportion:     0.72, 95% CI: 0.69, 0.74
Sensitivity:          0.84, 95% CI: 0.80, 0.89
Specificity:          0.80, 95% CI: 0.71, 0.88

These results match those provided in Cole et al.

Case 3: Estimating a Misclassified and Transported Proportion

In the rejoinder, Cole et al. combine the previous examples. Specifically, we now transport the main sample and then correct for measurement error afterwards.

First, we build the data described

[10]:

# Compact data frame expression of data
d = pd.DataFrame()
d['Y_star'] = [0, 1, 0, 1] + [np.nan, np.nan] + [1, 0, 1, 0]
d['Y'] = [np.nan, np.nan, np.nan, np.nan] + [np.nan, np.nan] + [1, 1, 0, 0]
d['W'] = [0, 0, 1, 1] + [0, 1] + [np.nan, np.nan, np.nan, np.nan]
d['n'] = [266, 67, 400, 267] + [333, 167] + [180, 20, 60, 240]
d['S'] = [1, 1, 1, 1] + [2, 2] + [3, 3, 3, 3]

# Expanding data set out
d = pd.DataFrame(np.repeat(d.values, d['n'], axis=0), columns=d.columns)
d['C'] = 1

# Some extra data processing
y_no_nan = np.asarray(d['Y'].fillna(-1))
ystar_no_nan = np.asarray(d['Y_star'].fillna(-1))
w_no_nan = np.asarray(d[['C', 'W']].fillna(-1))
s1 = np.where(d['S'] == 1, 1, 0)
s2 = np.where(d['S'] == 2, 1, 0)
s3 = np.where(d['S'] == 3, 1, 0)

[11]:

def psi(theta):
    # Subsetting the input parameters
    param = theta[:4]    # Measurement error parameters
    beta = theta[4:]     # Sampling model parameters

    # Sampling model for transporting
    ee_sm = ee_regression(beta,
                          X=w_no_nan,
                          y=s2,
                          model='logistic')
    ee_sm = ee_sm * (1 - s3)   # Only S=1 or S=2 contribute

    # Calculating inverse odds of sampling weights
    pi_s = inverse_logit(np.dot(w_no_nan, beta))
    # Note: iosw is the odds weight if S=1, zero if S=2 and 1 if S=3.
    #       So S=2 don't contribute to measurement error model, the
    #       naive mean among S=1 is reweighted appropriately, and S=3
    #       observations all contribute equally (we can't estimate
    #       weights for them since W was not measured)
    iosw = s1 * pi_s / (1-pi_s) + s3

    # Rogan-Gladen correction
    ee_rg = ee_rogan_gladen(param,
                            y=y_no_nan,
                            y_star=ystar_no_nan,
                            r=s1,
                            weights=iosw)
    ee_rg = ee_rg * (1 - s2)  # Only S=1 or S=3 contribute

    # Returning the stacked estimating equations
    return np.vstack([ee_rg, ee_sm])

[12]:

estr = MEstimator(psi, init=[0.5, 0.5, .75, .75, 0., 0.])
estr.estimate(solver='lm')
se = np.diag(estr.variance)**0.5

[13]:

fmt = '{:.3f} (SE={:.3f})'
print("Corrected weighted proportion:", fmt.format(estr.theta[0], se[0]))
print("Naive weighted proportion:    ", fmt.format(estr.theta[1], se[1]))
print("Sensitivity:                  ", fmt.format(estr.theta[2], se[2]))
print("Specificity:                  ", fmt.format(estr.theta[3], se[3]))
fmt = "{:.3f}, {:.3f}"
print("Sampling model coefficients:  ", fmt.format(estr.theta[4], estr.theta[5]))

Corrected weighted proportion: 0.097 (SE=0.038)
Naive weighted proportion:     0.268 (SE=0.016)
Sensitivity:                   0.900 (SE=0.021)
Specificity:                   0.800 (SE=0.023)
Sampling model coefficients:   -0.000, -1.385

These results match those reported in the rejoinder. Note that the coefficients for the sampling model have a negative sign in the front since we are modeling \(\Pr(S=1 | W)\), but the paper models \(\Pr(S=0 | W)\) so there is no negative sign.

Conclusion

We replicated the cases from Cole et al. (2023) to showcase the basics of M-estimators for fusion designs and how they are implemented in delicatessen.

References

Cole SR, Edwards JK, Breskin A, Rosin S, Zivich PN, Shook-Sa BE, & Hudgens MG. (2023). “Illustration of 2 Fusion Designs and Estimators”. American Journal of Epidemiology, 192(3), 467-474.

Cole SR, Edwards JK, Breskin A, Rosin S, Zivich PN, Shook-Sa BE, & Hudgens MG. (2023). “Rejoinder: Combining Information with Fusion Designs, and Roses by Other Names”. American Journal of Epidemiology, kwad084.