delicatessen.estimating_equations.measurement.ee_regression_calibration

ee_regression_calibration(theta, beta, a, a_star, r, X=None, weights=None)

Estimating equation for regression calibration with external data for a mismeasured binary action. Regression calibration is a simple to implement method to correct for measurement error.

The general form of the estimating equations are

\[\begin{split}\sum_{i=1}^n \begin{bmatrix} (\beta^* / \gamma_0) - \beta \\ (1-R_i) \left\{ A_i - \gamma_0 A_i^* + \gamma^T X_i \right\} X_i^T \end{bmatrix} = 0\end{split}\]

where \(A\) is the gold-standard measurement of the action, \(A^*\) is the mismeasured version of the binary action, \(X\) is some additional covariates (including at least an intercept), and \(R\) indicates whether someone was in the validation set (\(R=0\) if in the validation set).

The first estimating equation is for the corrected coefficient for \(A\) on \(Y\). This is done by scaling the coefficient for \(A^*\) on \(Y\) (which comes from a model external to ee_regression_calibration) by the predictiveness in terms of probability of \(A^*\) for \(A\), \(\gamma_0\). The second estimating equation is used to estimate \(\gamma\) using a linear probability model. Here, \(\gamma\) are the parameters of the calibration model. Notice that only the validation set contributes to the calibration model.

Note

For the second place in theta, (i.e., theta[1]), a starting value between between 0.5 and 1 is recommended.

One caution for application of regression calibration is that it is only valid for non-differential measurement error. In cases of differential measurement error, methods like Multiple Imputation for Measurement Error (MIME) should be considered instead (Cole et al., 2006).

If X=None then theta is a 1-by-3 array. Otherwise, theta is a 1-by-(2+`p`) array, where p is the is the dimension of \(X\).

Parameters
  • theta (ndarray, list, vector) – Theta consists of 1+2`p` values.

  • beta (float, int, ndarray) – Coefficient to correct from a model fit outside of ee_regression_calibration. This coefficient should be for the main effect of a_star on the outcome. Notice that regression calibration only requires the coefficient to apply the correction (i.e., y is not needed for this estimating equation).

  • a (ndarray, list, vector) – 1-dimensional vector of n observed values. These are the gold-standard \(A\) measurements in the external sample. All values should be either 0 or 1, and be non-missing among those with \(R=0\).

  • a_star (ndarray, list, vector) – 1-dimensional vector of n observed values. These are the mis-measured \(A\) in the external and internal sample. All values should be either 0 or 1, and must be non-missing among those with \(R=0\).

  • r (ndarray, list, vector) – 1-dimensional vector of n indicators regarding whether an observation was part of the external validation data. Indicator should designate if observations are the main data.

  • X (ndarray, list, vector, None, optional) – 2-dimensional vector of a design matrix for calibration model. Notice that this design matrix should not include a. Behind the scenes, a is added to this design matrix to make it easier to process the coefficients for the regression calibration step. Default is None, which automatically generates an intercept, so the calibration model a ~ a_star + 1 is fit by default.

  • weights (ndarray, list, vector, None, optional) – 1-dimensional vector of n weights. Default is None, which assigns a weight of 1 to all observations. Note that weights are only used in the calibration model fitting.

Returns

Returns a 3-by-n or (2+`p`)-by-n NumPy array evaluated for the input theta

Return type

array

Examples

Construction of a estimating equation(s) with ee_regression_calibration should be done similar to the following

>>> import numpy as np
>>> import pandas as pd
>>> from scipy.stats import logistic
>>> from delicatessen import MEstimator
>>> from delicatessen.estimating_equations import ee_regression
>>> from delicatessen.estimating_equations import ee_regression_calibration

Creating some example data for illustration

>>> d = pd.DataFrame()
>>> d['A_star'] = [0, 1, 0, 1] + [1, 0, 1, 0]
>>> d['A'] = [np.nan, np.nan, np.nan, np.nan] + [1, 1, 0, 0]
>>> d['Y'] = [0, 0, 1, 1] + [0, 1] + [np.nan, np.nan]
>>> d['n'] = [266, 67, 400, 267] + [90, 10, 20, 80]
>>> d['S'] = [1, 1, 1, 1] + [0, 0, 0, 0]
>>> d = pd.DataFrame(np.repeat(d.values, d['n'], axis=0), columns=d.columns)
>>> d['C'] = 1
>>> d1 = d.loc[d['S'] == 1].copy()

Suppose we are interested in the odds ratio for A on Y. In the main study, we only have a mismeasured version of A, indicated by A_star. As a starting point, we might examine A_star on Y, which can be done using a logistic regression model with delicatessen

>>> def psi(theta):
>>>     return ee_regression(theta=theta, y=d1['Y'],
>>>                          X=d1[['C', 'A_star']],
>>>                          model='logistic')
>>> estr = MEstimator(psi, init=[0, 0])
>>> estr.estimate()
>>> np.exp(estr.theta[1])  # Odds Ratio of interest

While we obtain a result, the corresponding odds ratio may be biased due to measurement error of A. Under the assumption of non-differential measurement error, we can use regression calibration (and our external data) to correct for the measurement error.

To implement regression calibration, we stack our previous estimating equations with the regression calibration estimating equations. Below is code that illustrates this process

>>> def psi(theta):
>>>     theta_calib = theta[:3]  # Calibration model parameters
>>>     theta_main = theta[3:]   # Naive model parameters
>>>     y_no_nan = np.where(d['Y'].isna(), -999, d['Y'])
>>>     r = np.asarray(d['S'])
>>>     # Naive regression model
>>>     ee_logit_star = ee_regression(theta=theta_main, y=y_no_nan,
>>>                                   X=d[['C', 'A_star']],
>>>                                   model='logistic')
>>>     ee_logit_star = ee_logit_star * r  # Only main-study contributes
>>>     # Regression calibration
>>>     ee_calib = ee_regression_calibration(theta=theta_calib,  # Calibration parameters
>>>                                          beta=theta_main[1], # Coefficient to correct
>>>                                          a=d['A'],           # True A
>>>                                          a_star=d['A_star'], # Mismeasured A
>>>                                          r=r)                # Validation set indicator
>>>     return np.vstack([ee_calib, ee_logit_star])
>>> estr = MEstimator(psi, init=[0, 0.75, 0, 0, 0])
>>> estr.estimate()

Note that 5 initial starting values are provided, 3 for the regression calibration equations and 2 for the main model. Further, note that the second parameter is non-zero. Providing a starting value of zero will result in a division by zero error. It is recommended to have the starting value between 0.5 and 1. Finally, note that the naive log-odds ratio is provided to ee_regression_calibration via the beta argument. This argument tells ee_regression_calibration which coefficient needs to be corrected.

Inspecting the parameter estimates

>>> estr.theta[0]                    # Corrected log-odds ratio
>>> estr.theta[1]                    # Calibration factor for the coefficient
>>> estr.theta[-1]                   # Uncorrected log-odds ratio (for this setup)
>>> estr.theta[-1] / estr.theta[1]   # Equivalent to estr.theta[0]

While one could calculate the corrected odds ratio by-hand, the sandwich variance estimator provides a consistent estimate of the variance in this context. The provided estimating equation method can be easily extended to account for measurement error of conditional odds ratios, parameters of generlized linear models, or marginal structural models.

Weighted calibration models can be estimated by specifying the optional weights argument (weights are not used to calibrate the coefficient).

References

Boe LA, Lumley T, & Shaw PA. (2024). Practical Considerations for Sandwich Variance Estimation in 2-Stage Regression Settings. American Journal of Epidemiology, 193(5), 798-810.

Cole SR, Chu H, & Greenland S. (2006). Multiple-imputation for measurement-error correction. International Journal of Epidemiology, 35(4), 1074-1081.

Cole SR, Jacobson LP, Tien PC, Kingsley L, Chmiel JS, & Anastos K. (2010). Using marginal structural measurement-error models to estimate the long-term effect of antiretroviral therapy on incident AIDS or death. American Journal of Epidemiology, 171(1), 113-122.

Rosner B, Willett WC, & Spiegelman D. (1989). Correction of logistic regression relative risk estimates and confidence intervals for systematic within‐person measurement error. Statistics in Medicine, 8(9), 1051-1069.