delicatessen.estimating_equations.regression.ee_additive_regression

ee_additive_regression(theta, X, y, specifications, model, weights=None, offset=None)

Estimating equation for Generalized Additive Models (GAMs). GAMs are an extension of generalized linear models that allow for more flexible specifications of relationships of continuous variables. This flexibility is accomplished via splines. To further control the flexibility, the spline terms are penalized.

Note

The implemented GAM uses L2-penalization. This penalization only applies to the generated spline terms (i.e., penalization decreases the ‘wiggliness’ of the estimated relationships).

The estimating equation for a generalized additive linear regression model is

\[\sum_{i=1}^n \left\{ (Y_i - f(X_i)^T \theta) f(X_i) - \lambda \theta \right\} = 0\]

where \(\lambda\) is the penalty term.

While this looks similar to Ridge regression, there are two important differences: the function \(f()\) and how \(\lambda\) is defined. First, the function \(f()\) denotes a general vector function. For spline terms, this function defines the basis functions for the splines (set via the specifications parameter). For non-spline terms, this is the identity function (i.e., no changes are made to the input). This setup allows for terms to be selectively modeled using splines (e.g., categorical features are not modeled using splines). Next, the penalty term, \(\lambda\), is only non-zero for \(\theta\) that correspond to parameters for splines (i.e., only the spline terms are penalized). This is distinction from default Ridge regression, which penalizes all terms in the model.

Note

Originally, GAMs were implemented via splines with a knot at each unique values of \(X\). More recently, GAMs use a more moderate amount of knots to improve computationally efficiency. Both versions can be implemented by ee_additive_regression through setting the knot locations.

Here, \(\theta\) is a 1-by-(b`+`k) array, where b is the distinct covariates included as part of X and the k distinct spline basis functions. For example, if X is a 2-by-n matrix with a 10-knot natural spline for the second column in X, then \(\theta\) will be a 1-by-(2+9) array. The code is general to allow for an arbitrary number of X variables and spline knots.

Parameters
  • theta (ndarray, list, vector) – Theta in this case consists of b`+`k values. Number of values should match the number of columns in the additive design matrix.

  • X (ndarray, list, vector) – 2-dimensional vector of n observed values for b variables.

  • y (ndarray, list, vector) – 1-dimensional vector of n observed values.

  • specifications (list, dict, None) – A list of dictionaries that define the hyperparameters for the spline (e.g., number of knots, strength of penalty). For terms that should not have splines, None should be specified instead (see examples below). Each dictionary supports the following parameters: “knots”, “natural”, “power”, “penalty” knots (list): controls the position of the knots, with knots are placed at given locations. There is no default, so must be specified by the user. natural (bool): controls whether to generate natural (restricted) or unrestricted splines. Default is True, which corresponds to natural splines. power (float): controls the power to raise the spline terms to. Default is 3, which corresponds to cubic splines. penalty (float): penalty term (\(\lambda\)) applied to each corresponding spline basis term. Default is 0, which applies no penalty to the spline basis terms.

  • model (str) – Type of regression model to estimate. Options are 'linear' (linear regression), 'logistic' (logistic regression), and 'poisson' (Poisson regression).

  • weights (ndarray, list, vector, None, optional) – 1-dimensional vector of n weights. Default is None, which assigns a weight of 1 to all observations.

  • offset (ndarray, list, vector, None, optional) – A 1-dimensional offset to be included in the model. Default is None, which applies no offset term.

Returns

Returns a (b`+`k)-by-n NumPy array evaluated for the input theta.

Return type

array

Examples

Construction of a estimating equation(s) with ee_additive_regression should be done similar to the following

>>> import numpy as np
>>> import pandas as pd
>>> from scipy.stats import logistic
>>> from delicatessen import MEstimator
>>> from delicatessen.estimating_equations import ee_additive_regression
>>> from delicatessen.utilities import additive_design_matrix, regression_predictions

Some generic data to estimate a generalized additive model

>>> n = 2000
>>> data = pd.DataFrame()
>>> x = np.random.uniform(-5, 5, size=n)
>>> data['X'] = x
>>> data['Y1'] = np.exp(x+0.5)) + np.abs(x) + np.random.normal(scale=1.5, size=n)
>>> data['Y2'] = np.random.binomial(n=1, p=logistic.cdf(np.exp(np.sin(x+0.5)) + np.abs(x)), size=n)
>>> data['Y3'] = np.random.poisson(lam=np.exp(np.exp(np.sin(x+0.5)) + np.abs(x)), size=n)
>>> data['C'] = 1

Note that C here is set to all 1’s. This will be the intercept in the regression. Further, notice that the relationship between X and the various Y’s is not linear.

The design matrix for linear regression would be X = np.asarray(d[['C', 'X']]). As the intercept is a constant, we only want spline terms to be applied to 'X' column. To define the spline specifications, we create the following list

>>> specs = [None, {"knots": [-4, -3, -2, -1, 0, 1, 2, 3, 4], "penalty": 10}]

This tells ee_additive_regression to not generate a spline term for the first column in the input design matrix and to generate a default spline with knots at the specified locations and penalty of 10 for the second column in the input design matrix. Interally, the design matrix processing is done by the additive_design_matrix utility function. We can see what the output of that function looks like via

>>> Xa_design = additive_design_matrix(X=np.asarray(data[['C', 'X']]), specifications=specs)

That output matrix is the corresponding design matrix. Use of the additive_design_matrix utility will be demonstrated later for generating predictions from the estimated parameters.

Now psi, or the stacked estimating equations.

>>> def psi(theta):
>>>     x, y = data[['C', 'X']], data['Y']
>>>     return ee_additive_regression(theta=theta, X=x, y=y, model='linear', specifications=specs)

Calling the M-estimator. Note that the input initial values depends on the number of splines being generated. To easily determine the number of initial values, we can use the shape of the previously generated design matrix (Xa_design)

>>> n_params = Xa_design.shape[1]
>>> estr = MEstimator(stacked_equations=psi, init=[0., ]*n_params)
>>> estr.estimate()

Inspecting the parameter estimates, variance, and confidence intervals

>>> estr.theta
>>> estr.variance
>>> estr.confidence_intervals()

While all these estimates can be easily inspected, interpretting them is not easy. Instead, we can generate predictions from the GAM and plot the estimated regression line. To do this, we will first create a new design matrix where 'X' is evenly spaced over a range of values

>>> p = pd.DataFrame()
>>> p['X'] = np.linspace(-5, 5, 200)
>>> p['C'] = 1
>>> Xa_pred = additive_design_matrix(X=np.asarray(p[['C', 'X']]), specifications=specs)

To generate the predicted values of Y (and the corresponding confidence intervals), we do the following

>>> yhat = regression_predictions(Xa_pred, theta=estr.theta, covariance=estr.variance)

For further details, see the regression_predictions utility function documentation.

Other optional specifications are available for the spline terms. Here, we will specify an unrestricted quadratic spline with a penalty of 5.5 for the second column of the design matrix.

>>> specs = [None, {"knots": [-4, -2, 0, 2, 4], "natural": False, "power": 2, "penalty": 5.5}]

See the documentation of additive_design_matrix for additional examples of how to specify the additive design matrix and corresponding splines.

Lastly, knots could be placed at each unique observation via

>>> specs = [None, {"knots": np.unique(data['X']), "penalty": 500}]

Note that the penalty is increased here (as the number of knots has dramatically increased).

A GAM for a binary outcome (i.e., logistic regression) can be implemented as follows

>>> specs = [None, {"knots": [-4, -3, -2, -1, 0, 1, 2, 3, 4], "penalty": 10}]
>>> Xa_design = additive_design_matrix(X=np.asarray(data[['C', 'X']]), specifications=specs)
>>> n_params = Xa_design.shape[1]
>>> def psi(theta):
>>>     x, y = data[['C', 'X']], data['Y2']
>>>     return ee_additive_regression(theta=theta, X=x, y=y, model='logistic', specifications=specs)
>>> estr = MEstimator(stacked_equations=psi, init=[0.]*n_params)
>>> estr.estimate(solver='lm', maxiter=5000)

A GAM for count outcomes (i.e., Poisson regression) can be implemented as follows

>>> specs = [None, {"knots": [-4, -3, -2, -1, 0, 1, 2, 3, 4], "penalty": 10}]
>>> Xa_design = additive_design_matrix(X=np.asarray(data[['C', 'X']]), specifications=specs)
>>> n_params = Xa_design.shape[1]
>>> def psi(theta):
>>>     x, y = data[['C', 'X']], data['Y3']
>>>     return ee_additive_regression(theta=theta, X=x, y=y, model='poisson', specifications=specs)
>>> estr = MEstimator(stacked_equations=psi, init=[0.]*n_params)
>>> estr.estimate(solver='lm', maxiter=5000)

Weighted models can be estimated by specifying the optional weights argument.

References

Fu WJ. (2003). Penalized estimating equations. Biometrics, 59(1), 126-132.

Hastie TJ. (2017). Generalized additive models. In Statistical models in S (pp. 249-307). Routledge.

Marx BD, & Eilers PH. (1998). Direct generalized additive modeling with penalized likelihood. Computational Statistics & Data Analysis, 28(2), 193-209.

Wild CJ, & Yee TW. (1996). Additive extensions to generalized estimating equation methods. Journal of the Royal Statistical Society: Series B (Methodological), 58(4), 711-725.