Cole et al. (2022): Addressing Missing Outcome Data
Cole et al. (2022) reviewed a way to address missing data with several different missing data mechanisms. This approach is commonly referred to as direct imputation or g-computation. This was done in the context of a randomized trial on 17P injection for the prevention of preterm birth. All data was described in a table in the paper, so there is no need to track down any data set.
Here, we review selected examples and how g-computation for missing data can be implemented using delicatessen.
Setup
[1]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import delicatessen as deli
from delicatessen import MEstimator
from delicatessen.estimating_equations import ee_regression, ee_glm
from delicatessen.utilities import inverse_logit
print("Versions")
print("NumPy: ", np.__version__)
print("SciPy: ", sp.__version__)
print("pandas: ", pd.__version__)
print("Matplotlib: ", mpl.__version__)
print("Delicatessen:", deli.__version__)
Versions
NumPy: 2.3.5
SciPy: 1.16.3
pandas: 2.3.3
Matplotlib: 3.10.8
Delicatessen: 4.1
[2]:
d = pd.DataFrame()
d['short_cervix'] = [0]*215 + [0]*222 + [1]*186 + [1]*177
d['17P'] = [0]*215 + [1]*222 + [0]*186 + [1]*177
d['placebo'] = 1 - d['17P']
d['preterm0'] = ([1]*15 + [0]*(215-15) + [1]*13 + [0]*(222-13)
+ [1]*21 + [0]*(186-21) + [1]*23 + [0]*(177-23))
d['preterm1'] = ([1]*11 + [0]*(161-11) + [np.nan]*54
+ [1]*10 + [0]*(167-10) + [np.nan]*55
+ [1]*16 + [0]*(140-16) + [np.nan]*46
+ [1]*17 + [0]*(133-17) + [np.nan]*44)
d['preterm2'] = ([1]*8 + [0]*(108-8) + [np.nan]*107
+ [1]*13 + [0]*(222-13) + [np.nan]*0
+ [1]*21 + [0]*(186-21) + [np.nan]*0
+ [1]*12 + [0]*(89-12) + [np.nan]*88)
d['preterm3'] = ([1]*15 + [0]*(215-15) + [np.nan]*0
+ [1]*13 + [0]*(222-13) + [np.nan]*0
+ [1]*21 + [0]*(186-21) + [np.nan]*0
+ [1]*0 + [0]*0 + [np.nan]*177)
d['preterm4'] = ([1]*15 + [0]*(100-15) + [np.nan]*115
+ [1]*7 + [0]*(216-7) + [np.nan]*6
+ [1]*21 + [0]*(186-21) + [np.nan]*0
+ [1]*12 + [0]*(81-12) + [np.nan]*96)
d['C'] = 1
To start, we will estimate the risk ratio using the complete outcome data (preterm0).
The first estimating function computes the risk ratio by computing each risk and then finding the ratio (or the difference on the log-scale)
[3]:
def psi_rr(theta):
a = np.asarray(d['placebo'])
y = np.asarray(d['preterm0'])
ee_r1 = a*(y - theta[1])
ee_r0 = (1-a)*(y - theta[2])
ee_rr = np.ones(y.shape[0]) * (np.log(theta[1]) - np.log(theta[2]) - theta[0])
return np.vstack([ee_rr, ee_r1, ee_r0])
[4]:
estr = MEstimator(psi_rr, init=[0., 0.5, 0.5])
estr.estimate()
ci = estr.confidence_intervals()
[5]:
print("RR: ", np.exp(estr.theta[0]))
print("95% CI:", np.exp(ci[0, :]))
RR: 0.9950124688279302
95% CI: [0.64038277 1.54602818]
In the paper, the risk ratio is estimated using a log-binomial model. We can also do this easily via the ee_glm built-in estimating function. One caution is that the log-binomial can be a bit finicky with the starting values. In particular, we want to provide a good starting value for the intercept term (can be done using the mean of \(Y\) in the data).
[6]:
def psi_lb(theta):
return ee_glm(theta=theta, X=d[['C', 'placebo']], y=d['preterm0'],
distribution='binomial', link='log')
[7]:
estr = MEstimator(psi_lb, init=[np.log(0.08), 0.])
estr.estimate()
ci = estr.confidence_intervals()
[8]:
print("RR: ", np.exp(estr.theta[1]))
print("95% CI:", np.exp(ci[1, :]))
RR: 0.9950124688279302
95% CI: [0.64038273 1.54602829]
which give the same result quite far into the shown decimal places (some differences would be expected further out due to floating point errors and tolerances). Having established the best case, let’s turn our attention to the missing data examples.
Scenario A
The first scenario induces non-informative missing outcome data (i.e., missing completely at random). Here, we expect a complete-case analysis and g-computation to both be unbiased. The following is the complete-case estimating function and the g-computation estimating function
[9]:
a = np.asarray(d['placebo'])
r = np.where(np.isnan(d['preterm1']), 0, 1)
y_no_nan = np.nan_to_num(d['preterm1'], nan=-999)
[10]:
def psi_cc(theta):
return ee_glm(theta=theta, X=d[['C', 'placebo']], y=y_no_nan,
distribution='binomial', link='log', weights=r)
[11]:
def psi_gcomp(theta):
beta1 = theta[3:5]
beta0 = theta[5:]
X = np.asarray(d[['C', 'short_cervix']])
# Outcome nuisance model for 17P=1
ee_reg1 = ee_regression(theta=beta1, X=X,
y=y_no_nan, model='logistic')
ee_reg1 = ee_reg1 * r * a # Only those with observed Y contribute
# Outcome nuisance model for 17P=0
ee_reg0 = ee_regression(theta=beta0, X=X,
y=y_no_nan, model='logistic')
ee_reg0 = ee_reg0 * r * (1-a) # Only those with observed Y contribute
# Computing risk ratios
y1hat = inverse_logit(np.dot(X, beta1))
y0hat = inverse_logit(np.dot(X, beta0))
ee_r1 = y1hat - theta[1]
ee_r0 = y0hat - theta[2]
ee_rr = np.ones(X.shape[0]) * (np.log(theta[1]) - np.log(theta[2]) - theta[0])
return np.vstack([ee_rr, ee_r1, ee_r0, ee_reg1, ee_reg0])
Note that g-computation involves fitting several regression models for \(Y\) among the complete-cases. This process is what allows us to correct for missing data in informative scenarios. In settings with non-informative missing, we also expect the g-computation results to be more efficient (narrower confidence intervals). For further details on the estimator itself, see the corresponding publication.
The following computes the complete-case results
[12]:
estr = MEstimator(psi_cc, init=[np.log(0.09), 0.])
estr.estimate()
ci = estr.confidence_intervals()
print("RR: ", np.exp(estr.theta[1]))
print("95% CI:", np.exp(ci[1, :]))
RR: 0.9966777408637874
95% CI: [0.59915561 1.65794412]
As we would expect in this scenario, the complete-case results are similar to the full-data results. Note that the confidence intervals are wider than the full-data results.
[13]:
init_vals = [0., 0.5, 0.5] + [0., ]*2 + [0., ]*2
estr = MEstimator(psi_gcomp, init=init_vals)
estr.estimate()
ci = estr.confidence_intervals()
print("RR: ", np.exp(estr.theta[0]))
print("95% CI:", np.exp(ci[0, :]))
RR: 0.9831422283742006
95% CI: [0.59247203 1.63141648]
The g-computation results are similar. Here, the confidence intervals are quite similar.
Scenario B
In scenario B, we consider informative missing data. Specifically, cervix length is related to missingness. Here, a complete-case analysis will be biased, but we would expect g-computation to correct for this bias. The following apply the estimating functions to this second scenario
[14]:
a = np.asarray(d['placebo'])
r = np.where(np.isnan(d['preterm2']), 0, 1)
y_no_nan = np.nan_to_num(d['preterm2'], nan=-999)
[15]:
estr = MEstimator(psi_cc, init=[np.log(0.09), 0.])
estr.estimate()
ci = estr.confidence_intervals()
print("RR: ", np.exp(estr.theta[1]))
print("95% CI:", np.exp(ci[1, :]))
RR: 1.2270748299319727
95% CI: [0.73641679 2.04464736]
[16]:
init_vals = [0., 0.5, 0.5] + [0., ]*2 + [0., ]*2
estr = MEstimator(psi_gcomp, init=init_vals)
estr.estimate()
ci = estr.confidence_intervals()
print("RR: ", np.exp(estr.theta[0]))
print("95% CI:", np.exp(ci[0, :]))
RR: 0.9841727212256232
95% CI: [0.5745667 1.68578503]
Here, we see the complete-case analysis indicates that there is a relationship between 17P and preterm birth. However, this (correctly) disappears when we adjust for missingness by cervix length.
Unlike in the corresponding paper, we are able to avoid bootstrapping to estimate the variance using the sandwich instead. This makes this code much more efficient. This computational advantage of M-estimation is noted in the Discussion of Cole et al. (2022), but has not been shown elsewhere. Other scenarios are replicated in the paper but are not done here. However, the table data is provided above as preterm3 and preterm4.
References
Cole, S. R., Zivich, P. N., Edwards, J. K., Ross, R. K., Shook-Sa, B. E., Price, J. T., & Stringer, J. S. (2023). Missing outcome data in epidemiologic studies. American Journal of Epidemiology, 192(1), 6-10.