delicatessen.utilities.aggregate_efuncs
- aggregate_efuncs(est_funcs, group)
Aggregate estimating function contributions from the individual-level to the specified group-level. This function is intended to simply estimation with grouped or clustered data. Briefly, the input matrix of estimating function contributions is collapsed from n unit-level contributions into m group-level contributions under the assumption that observations within groups are independent (see later notes). This function should be used whenever observations are not independent and there is a group-level ID variable for appropriate statistical inference.
This function is intended to be called after the estimating functions have been stacked, but before they are returned in a
psifunction. See the examples below for details on the intended use.Note
Here, an independent working correlation structure is assumed.
The assumption of an independent working correlation structure is done for two reasons: computational simplification and it does not rely on an extra assumption. Without needing to specify a more detailed structure, the aggregation of observations is straightforward for an arbitrary set of estimating functions. This means this procedure is flexible with any general input matrix of estimating function contributions. Regarding the second reason, as described in Pepe & Anderson (1994) and Pan et al. (2000), use of non-diagonal working correlation matrices (i.e., other options than independent) relies on an additional assumption that may not hold. When this assumption does not hold, point estimates may be biased. Given that the empirical sandwich variance estimator is consistent under misspecification of the working correlation structure, the philosophy of this utility is to maintain unbiased point estimation at the potential cost of some statistical efficiency (when the correlation structure is correctly specified and the additional assumption holds).
- Parameters
est_funcs (ndarray, list, vector) – Input p-by-n matrix to collapse into a p-by-m matrix, where n is the number of units and m is the number of groups.
group (ndarray, list, vector) – A vector of length n designating the group identifiers for each unit-level observation.
- Returns
Returns a p-by-m NumPy array
- Return type
array
Examples
Using the
aggregate_efuncsutility for grouped or clustered data>>> import numpy as np >>> from delicatessen import MEstimator >>> from delicatessen.estimating_equations import ee_mean >>> from delicatessen.utilities import aggregate_efuncs
Some generic data for clustered-observations
>>> y = [1, -1, 0, 3, 2, -3, -2, -1, 1, 0] >>> group = [1, 1, 1, 2, 2, 3, 4, 4, 5, 6]
Here, we are interested in estimating the mean of Y. There are 10 observations, but these observations only come from 6 unique groups (clusters). Therefore, we aggregate the estimating functions for the mean of Y by the group IDs. To apply this, we (1) compute the estimating functions at the unit-level, (2) aggregate the contributions at the group level (using
aggregate_efuncs), and (3) return the group-level estimating functions toMEstimator(orGMMEstimator). The following blocks of code illustrate this process>>> def psi(theta): >>> ee_ind_level = ee_mean(theta=theta, y=y) >>> ee_group_level = aggregate_efuncs(ee_ind_level, group=group) >>> return ee_group_level
>>> estr = MEstimator(psi, init=[0., ]) >>> estr.estimate()
By aggregating estimating functions prior to providing to
MEstimator, we change the effective sample size and modify the inputs to the empirical sandwich variance estimator. This aggregation can be done for more than 1 parameter and helps to simplify inference for grouped or clustered data.References
Pepe SM & Anderson GL (1994). A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics-Simulation and Computation, 23, 939-951.
Pan W, Louis TA, & Connett JE. (2000). A note on marginal linear regression with correlated response data. The American Statistician, 54(3), 191-195.