delicatessen.estimating_equations.causal.ee_gformula

ee_gformula(theta, y, X, X1, X0=None, force_continuous=False)

Estimating equations for the g-formula (or g-computation). The parameter of interest can either be the mean under a single policy or plan of action, or the mean difference between two policies. This is accomplished by providing the estimating equation the observed data (X, y), and the same data under the actions (X1 and optionally X0).

The stack of estimating equations are

\[\begin{split}\sum_{i=1}^n \begin{bmatrix} \left\{ g({X_i^*}^T \beta) - \theta_1 \right\} \\ \left\{ Y_i - g(X_i^T \beta) \right\} X_i \end{bmatrix} = 0\end{split}\]

where the first is the mean under the specified plan, with the plan setting the values of action \(A\) (e.g., exposure, treatment, vaccination, etc.), and the second equation is the outcome regression model. Here, \(g\) indicates a transformation function. For linear regression, \(g\) is the identity function. Logistic regression uses the inverse-logit function. By default, ee_gformula detects whether y is all binary (zero or one), and applies logistic regression if that is evaluated to be true.

Note

This variation includes 1+`b` parameters, where the first parameter is the causal mean, and the remainder are the parameters for the regression model.

Alternatively, a causal mean difference is estimated when X0 is specified. A common example of this would be the average causal effect, where the plans are all-action-one versus all-action-zero. Therefore, the estimating equations consist of the following three equations

\[\begin{split}\sum_{i=1}^n \begin{bmatrix} (\theta_1 - \theta_2) - \theta_0 \\ g({X_i^1}^T \beta) - \theta_1 \\ g({X_i^0}^T \beta) - \theta_2 \\ \left\{ Y_i - g(X_i^T \beta) \right\} X_i \end{bmatrix} = 0\end{split}\]

Note

This variation includes 3+`b` parameters, where the first parameter is the causal mean difference, the second is the causal mean under plan 1, the third is the causal mean under plan 0, and the remainder are the parameters for the regression model.

Parameters

theta (ndarray, list, vector) – Theta consists of 1+`b` values if X0 is None, and 3+`b` values if X0 is not None.
y (ndarray, list, vector) – 1-dimensional vector of n observed values.
X (ndarray, list, vector) – 2-dimensional vector of n observed values for b variables.
X1 (ndarray, list, vector) – 2-dimensional vector of n observed values for b variables under the action plan.
X0 (ndarray, list, vector, None, optional) – 2-dimensional vector of n observed values for b variables under the separate action plan. This second argument is optional and should be specified if the causal mean difference between two action plans is of interest.
force_continuous (bool, optional) – Option to force the use of linear regression despite detection of a binary variable.

Returns

Returns a (1+`b`)-by-n NumPy array if X0=None, or returns a (3+`b`)-by-n NumPy array if X0!=None

Return type

array

Examples

Construction of a estimating equation(s) with ee_gformula should be done similar to the following

>>> import numpy as np
>>> import pandas as pd
>>> from delicatessen import MEstimator
>>> from delicatessen.estimating_equations import ee_gformula

Some generic confounded data

>>> n = 200
>>> d = pd.DataFrame()
>>> d['W'] = np.random.binomial(1, p=0.5, size=n)
>>> d['A'] = np.random.binomial(1, p=(0.25 + 0.5*d['W']), size=n)
>>> d['Ya0'] = np.random.binomial(1, p=(0.75 - 0.5*d['W']), size=n)
>>> d['Ya1'] = np.random.binomial(1, p=(0.75 - 0.5*d['W'] - 0.1*1), size=n)
>>> d['Y'] = (1-d['A'])*d['Ya0'] + d['A']*d['Ya1']
>>> d['C'] = 1

In the first example, we will estimate the causal mean had everyone been set to A=1. Therefore, the optional argument X0 is left as None. Before creating the estimating equation, we need to do some data prep. First, we will create an interaction term between A and W in the original data. Then we will generate a copy of the data and update the values of A to be all 1.

>>> d['AW'] = d['A']*d['W']
>>> d1 = d.copy()
>>> d1['A'] = 1
>>> d1['AW'] = d1['A']*d1['W']

Having setup our data, we can now define the psi function.

>>> def psi(theta):
>>>     return ee_gformula(theta,
>>>                        y=d['Y'],
>>>                        X=d[['C', 'A', 'W', 'AW']],
>>>                        X1=d1[['C', 'A', 'W', 'AW']])

Notice that y corresponds to the observed outcomes, X corresponds to the observed covariate data, and X1 corresponds to the covariate data under the action plan.

Now we can call the M-Estimator. Since we are estimating the causal mean, and the regression parameters, the length of the initial values needs to correspond with this. Our linear regression model consists of 4 coefficients, so we need 1+4=5 initial values. When the outcome is binary (like it is in this example), we can be nice to the optimizer and give it a starting value of 0.5 for the causal mean (since 0.5 is in the middle of that distribution).

>>> estr = MEstimator(psi, init=[0.5, 0., 0., 0., 0.])
>>> estr.estimate(solver='lm')

Inspecting the parameter estimates, variance, and 95% confidence intervals

>>> estr.theta
>>> estr.variance
>>> estr.confidence_intervals()

The causal mean is

>>> estr.theta[0]

Continuing from the previous example, let’s say we wanted to estimate the average causal effect. Therefore, we want to contrast two plans (all A=1 versus all A=0). As before, we need to create the reference data for X0

>>> d0 = d.copy()
>>> d0['A'] = 0
>>> d0['AW'] = d0['A']*d0['W']

Having setup our data, we can now define the psi function.

>>> def psi(theta):
>>>     return ee_gformula(theta,
>>>                        y=d['Y'],
>>>                        X=d[['C', 'A', 'W', 'AW']],
>>>                        X1=d1[['C', 'A', 'W', 'AW']],
>>>                        X0=d0[['C', 'A', 'W', 'AW']], )

Notice that y corresponds to the observed outcomes, X corresponds to the observed covariate data, X1 corresponds to the covariate data under A=1, and X0 corresponds to the covariate data under A=0. Here, we need 3+4=7 starting values, since there are two additional parameters from the previous example. For the difference, a starting value of 0 is generally a good choice. Since Y is binary, we again provide 0.5 as starting values for the causal means

>>> estr = MEstimator(psi, init=[0., 0.5, 0.5, 0., 0., 0., 0.])
>>> estr.estimate(solver='lm')

Inspecting the parameter estimates

>>> estr.theta[0]    # causal mean difference of 1 versus 0
>>> estr.theta[1]    # causal mean under X1
>>> estr.theta[2]    # causal mean under X0
>>> estr.theta[3:]   # logistic regression coefficients

References

Snowden JM, Rose S, & Mortimer KM. (2011). Implementation of G-computation on a simulated data set: demonstration of a causal inference technique. American Journal of Epidemiology, 173(7), 731-738.

Hernán MA, & Robins JM. (2006). Estimating causal effects from epidemiological data. Journal of Epidemiology & Community Health, 60(7), 578-586.