Handling Clustered Data

One of the main uses for estimating equations historically has been to handle clustered data. This use was popularized by Liang & Zeger (1986). While delicatessen relies on estimating equations for other tasks, it can also be used to handle clustered data. This tutorial reviews how clustered observations can be handled using built-in delicatessen functionalities.

Setup

[1]:

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

import delicatessen as deli
from delicatessen import MEstimator
from delicatessen.estimating_equations import ee_regression
from delicatessen.utilities import aggregate_efuncs

print("Versions")
print("NumPy:       ", np.__version__)
print("SciPy:       ", sp.__version__)
print("pandas:      ", pd.__version__)
print("Matplotlib:  ", mpl.__version__)
print("Delicatessen:", deli.__version__)

Versions
NumPy:        2.3.5
SciPy:        1.16.3
pandas:       2.3.3
Matplotlib:   3.10.8
Delicatessen: 4.1

This tutorial uses data High School and Beyond study from 1982. This data set comes from the mlmRev R package. See that package for details. In this example, we will conduct a simple regression analysis on mathematical achievement scores of students by several individual and school level factors. Here, clustering is assumed to occur at the school-level and is considered an incidental feature (i.e., school-specific coefficients are not of interest).

The following code loads this data (saved as a .csv file)

[2]:

d = pd.read_csv("data/hsb82.csv")
d['intercept'] = 1
d['female'] = np.where(d['sx'] == 'Female', 0, 1)

[3]:

y = np.asarray(d['mAch'])  # Math achievement score
X = np.asarray(d[['intercept', 'female', 'cses']])
g = np.asarray(d['school'])

To begin, consider if we ignored the clustering. For this, we can fit a linear regression model using the built-in ee_regression function (as illustrated elsewhere)

[4]:

def psi_i(theta):
    return ee_regression(theta=theta, y=y, X=X, model='linear')

[5]:

estr_i = MEstimator(psi_i, init=[10, 0, 0])
estr_i.estimate()

[6]:

estr_i.print_results()

==============================================================
              Estimation Method: M-estimator
--------------------------------------------------------------
No. Observations:        7185 | No. Parameters:              3
Solving algorithm:         lm | Max Iterations:           5000
Solving tolerance:      1e-09 | Allow P-Inverse:             1
Derivative Method:     approx | Deriv Approx:            1e-09
Small N Correction:      None | Distribution:           Z-stat
==============================================================
   Theta   StdErr  Z-score      LCL      UCL  P-value  S-value
--------------------------------------------------------------
   12.01     0.11   114.12    11.80    12.21     0.00      inf
    1.57     0.16     9.93     1.26     1.88     0.00    74.75
    2.14     0.12    17.71     1.90     2.38     0.00   230.81
==============================================================

From this model, we see that being male and having a higher SES relative to your school’s average SES had a positive association math achievement scores.

Clustering by School

The previous results, particularly the inferential statistics (standard errors, Z-scores, confidence intervals, P-values, S-values), are all premised on that observations are independent. However, we might be skeptical of this assumption. In particular, those in the same school may be more similar than those from different schools. From a certain perspective, we can think about these correlated observations as contributing ‘less than 1 unit’s worth’ of information to our model. We can use estimating equations and the sandwich variance to handle this challenge.

To do this, we will essentially collapse the estimating functions from \(n\), the number of units, to \(m\), the number of schools. So, this changes our sample size (and thus all asymptotics will be based on \(m\) and not \(n\) anymore). How we collapse observations is determined by something called the ‘working correlation matrix’. This matrix stipulates how observations are correlated (and is something we assume beforehand). The good news is that the sandwich variance is robust to misspecification of this working correlation matrix.

Within delicatessen, the collapsing of estimating functions from \(n\) to \(m\) can be done by the aggregate_efuncs utility function. This function takes a given estimating function and adds together observations within the same cluster defined by the group argument. Note that this function only supports the ‘independent’ working correlation matrix. While this might be a known misspecification (in this and other clustering settings), this choice was made for several reasons: (1) this approach is more flexible and easily generalizes to arbitrary input estimating functions, (2) non-diagonal working correlation matrices rely on an additional assumption that may not hold and will produced biased point estimates when it doesn’t. The second point is detailed further in Pepe & Anderson (1994) and Pan et al. (2000). The independent correlation matrix avoids this, so it doesn’t rely on this assumption and won’t produce biased point estimates. The downside of this choice is that the standard error estimate is not as efficient as could be (i.e., larger than needs be) when a non-independent working correlation matrix is specified and the prior assumption does hold. Despite this downside, the flexibility and robustness offered by this approach seems preferable. Therefore, only the independent working correlation matrix was made available.

The following code uses the aggregate_efuncs function to condense the previous estimating functions

[7]:

def psi_s(theta):
    return aggregate_efuncs(psi_i(theta), group=g)

[8]:

estr_s = MEstimator(psi_s, init=[10, 0, 0])
estr_s.estimate()

[9]:

estr_s.print_results()

==============================================================
              Estimation Method: M-estimator
--------------------------------------------------------------
No. Observations:         160 | No. Parameters:              3
Solving algorithm:         lm | Max Iterations:           5000
Solving tolerance:      1e-09 | Allow P-Inverse:             1
Derivative Method:     approx | Deriv Approx:            1e-09
Small N Correction:      None | Distribution:           Z-stat
==============================================================
   Theta   StdErr  Z-score      LCL      UCL  P-value  S-value
--------------------------------------------------------------
   12.01     0.26    45.31    11.49    12.52     0.00      inf
    1.57     0.31     5.03     0.96     2.19     0.00    20.96
    2.14     0.13    16.76     1.89     2.39     0.00   207.06
==============================================================

In the results output, we can see several changes to our results. First, in the meta-data we see the number of observations drop from 7185 to 160. This is because while there were 7185 students in the study, these students only came from 160 different schools. Second, we see the standard errors are substantially larger than the previous case. This then leads to differences in the Z-scores, confidence intervals, P-values, and S-values. These increased, as we would expect with clustered data where there is some correlation between observations. While our inferential results changed, the point estimates did not. Again, this is because of our selection of the independent working correlation matrix.

That concludes this example of how to handle clustered data with delicatessen. While only show in the context of linear regression, the aggregate_efuncs function handles condensing any user-specified estimating functions.

References

Liang KY, & Zeger L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1), 13-22.

Pan W, Louis TA, & Connett JE. (2000). A note on marginal linear regression with correlated response data. The American Statistician, 54(3), 191-195.

Pepe SM & Anderson GL (1994). A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics-Simulation and Computation, 23, 939-951.