delicatessen.estimating_equations.causal.ee_gformula
- ee_gformula(theta, y, X, X1, X0=None, force_continuous=False)
Estimating equations for the g-formula (or g-computation). The parameter of interest can either be the mean under a single policy or plan of action, or the mean difference between two policies. This is accomplished by providing the estimating equation the observed data (
X
,y
), and the same data under the actions (X1
and optionallyX0
).The stack of estimating equations are
\[\begin{split}\sum_{i=1}^n \begin{bmatrix} \left\{ g({X_i^*}^T \beta) - \theta_1 \right\} \\ \left\{ Y_i - g(X_i^T \beta) \right\} X_i \end{bmatrix} = 0\end{split}\]where the first is the mean under the specified plan, with the plan setting the values of action \(A\) (e.g., exposure, treatment, vaccination, etc.), and the second equation is the outcome regression model. Here, \(g\) indicates a transformation function. For linear regression, \(g\) is the identity function. Logistic regression uses the inverse-logit function. By default,
ee_gformula
detects whether y is all binary (zero or one), and applies logistic regression if that is evaluated to be true.Note
This variation includes 1+`b` parameters, where the first parameter is the causal mean, and the remainder are the parameters for the regression model.
Alternatively, a causal mean difference is estimated when
X0
is specified. A common example of this would be the average causal effect, where the plans are all-action-one versus all-action-zero. Therefore, the estimating equations consist of the following three equations\[\begin{split}\sum_{i=1}^n \begin{bmatrix} (\theta_1 - \theta_2) - \theta_0 \\ g({X_i^1}^T \beta) - \theta_1 \\ g({X_i^0}^T \beta) - \theta_2 \\ \left\{ Y_i - g(X_i^T \beta) \right\} X_i \end{bmatrix} = 0\end{split}\]Note
This variation includes 3+`b` parameters, where the first parameter is the causal mean difference, the second is the causal mean under plan 1, the third is the causal mean under plan 0, and the remainder are the parameters for the regression model.
- Parameters
theta (ndarray, list, vector) – Theta consists of 1+`b` values if
X0
isNone
, and 3+`b` values ifX0
is notNone
.y (ndarray, list, vector) – 1-dimensional vector of n observed values.
X (ndarray, list, vector) – 2-dimensional vector of n observed values for b variables.
X1 (ndarray, list, vector) – 2-dimensional vector of n observed values for b variables under the action plan.
X0 (ndarray, list, vector, None, optional) – 2-dimensional vector of n observed values for b variables under the separate action plan. This second argument is optional and should be specified if the causal mean difference between two action plans is of interest.
force_continuous (bool, optional) – Option to force the use of linear regression despite detection of a binary variable.
- Returns
Returns a (1+`b`)-by-n NumPy array if
X0=None
, or returns a (3+`b`)-by-n NumPy array ifX0!=None
- Return type
array
Examples
Construction of a estimating equation(s) with
ee_gformula
should be done similar to the following>>> import numpy as np >>> import pandas as pd >>> from delicatessen import MEstimator >>> from delicatessen.estimating_equations import ee_gformula
Some generic confounded data
>>> n = 200 >>> d = pd.DataFrame() >>> d['W'] = np.random.binomial(1, p=0.5, size=n) >>> d['A'] = np.random.binomial(1, p=(0.25 + 0.5*d['W']), size=n) >>> d['Ya0'] = np.random.binomial(1, p=(0.75 - 0.5*d['W']), size=n) >>> d['Ya1'] = np.random.binomial(1, p=(0.75 - 0.5*d['W'] - 0.1*1), size=n) >>> d['Y'] = (1-d['A'])*d['Ya0'] + d['A']*d['Ya1'] >>> d['C'] = 1
In the first example, we will estimate the causal mean had everyone been set to
A=1
. Therefore, the optional argumentX0
is left asNone
. Before creating the estimating equation, we need to do some data prep. First, we will create an interaction term betweenA
andW
in the original data. Then we will generate a copy of the data and update the values ofA
to be all1
.>>> d['AW'] = d['A']*d['W'] >>> d1 = d.copy() >>> d1['A'] = 1 >>> d1['AW'] = d1['A']*d1['W']
Having setup our data, we can now define the psi function.
>>> def psi(theta): >>> return ee_gformula(theta, >>> y=d['Y'], >>> X=d[['C', 'A', 'W', 'AW']], >>> X1=d1[['C', 'A', 'W', 'AW']])
Notice that
y
corresponds to the observed outcomes,X
corresponds to the observed covariate data, andX1
corresponds to the covariate data under the action plan.Now we can call the M-Estimator. Since we are estimating the causal mean, and the regression parameters, the length of the initial values needs to correspond with this. Our linear regression model consists of 4 coefficients, so we need 1+4=5 initial values. When the outcome is binary (like it is in this example), we can be nice to the optimizer and give it a starting value of 0.5 for the causal mean (since 0.5 is in the middle of that distribution).
>>> estr = MEstimator(psi, init=[0.5, 0., 0., 0., 0.]) >>> estr.estimate(solver='lm')
Inspecting the parameter estimates, variance, and 95% confidence intervals
>>> estr.theta >>> estr.variance >>> estr.confidence_intervals()
The causal mean is
>>> estr.theta[0]
Continuing from the previous example, let’s say we wanted to estimate the average causal effect. Therefore, we want to contrast two plans (all
A=1
versus allA=0
). As before, we need to create the reference data forX0
>>> d0 = d.copy() >>> d0['A'] = 0 >>> d0['AW'] = d0['A']*d0['W']
Having setup our data, we can now define the psi function.
>>> def psi(theta): >>> return ee_gformula(theta, >>> y=d['Y'], >>> X=d[['C', 'A', 'W', 'AW']], >>> X1=d1[['C', 'A', 'W', 'AW']], >>> X0=d0[['C', 'A', 'W', 'AW']], )
Notice that
y
corresponds to the observed outcomes,X
corresponds to the observed covariate data,X1
corresponds to the covariate data underA=1
, andX0
corresponds to the covariate data underA=0
. Here, we need 3+4=7 starting values, since there are two additional parameters from the previous example. For the difference, a starting value of 0 is generally a good choice. SinceY
is binary, we again provide 0.5 as starting values for the causal means>>> estr = MEstimator(psi, init=[0., 0.5, 0.5, 0., 0., 0., 0.]) >>> estr.estimate(solver='lm')
Inspecting the parameter estimates
>>> estr.theta[0] # causal mean difference of 1 versus 0 >>> estr.theta[1] # causal mean under X1 >>> estr.theta[2] # causal mean under X0 >>> estr.theta[3:] # logistic regression coefficients
References
Snowden JM, Rose S, & Mortimer KM. (2011). Implementation of G-computation on a simulated data set: demonstration of a causal inference technique. American Journal of Epidemiology, 173(7), 731-738.
Hernán MA, & Robins JM. (2006). Estimating causal effects from epidemiological data. Journal of Epidemiology & Community Health, 60(7), 578-586.