delicatessen.estimating_equations.regression.ee_glm

ee_glm(theta, X, y, distribution, link, hyperparameter=None, weights=None, offset=None)

Estimating equation for regression with a generalized linear model. Unlike ee_regression, this functionality supports generic distribution and link specifications. The general estimating equation for the outcome \(Y_i\) with the design matrix \(X_i\)

\[\sum_{i=1}^n \left\{ Y_i - g^{-1}(X_i^T \theta) \right\} \times \frac{D(\theta)}{v(\theta)} X_i = 0\]

where \(g\) is the link function, \(g^{-1}\) is the inverse link function, \(D(\theta)\) is the derivative of the inverse link function by \(\theta\), and \(v(\theta)\) is the variance function for the specified distribution.

Note

Some distributions (i.e., negative-binomial, gamma) involve additional parameters. These are estimated using additional parameter-specific estimating equations.

Here, \(\theta\) is a 1-by-b array, which corresponds to the coefficients in the corresponding regression model and b is the distinct covariates included as part of X. For example, if X is a 3-by-n matrix, then \(\theta\) will be a 1-by-3 array. The code is general to allow for an arbitrary number of elements in X.

Parameters
  • theta (ndarray, list, vector) – Theta in this case consists of b values. Therefore, initial values should consist of the same number as the number of columns present. This can easily be implemented by [0, ] * X.shape[1].

  • X (ndarray, list, vector) – 2-dimensional vector of n observed values for b variables.

  • y (ndarray, list, vector) – 1-dimensional vector of n observed values.

  • distribution (str) – Distribution for the generalized linear model. Options are: 'normal' (alias: gaussian), 'binomial' (aliases: bernoulli, bin), 'poisson', 'gamma', 'inverse_normal' (alias: inverse_gaussian), 'negative_binomial' (alias: nb), and 'tweedie'.

  • link (str) – Link function for the generalized linear model. Options are: identity, log, logistic (alias: logit), probit, cauchit (alias: cauchy), loglog, cloglog, inverse, and square_root (alias: sqrt).

  • hyperparameter (None, int, float) – Hyperparameter specification. Default is None. This option is only used by the tweedie distribution. It is ignored by all other distributions.

  • weights (ndarray, list, vector, None, optional) – 1-dimensional vector of n weights. Default is None, which assigns a weight of 1 to all observations.

  • offset (ndarray, list, vector, None, optional) – A 1-dimensional offset to be included in the model. Default is None, which applies no offset term.

Note

Link and distribution combinations are not checked for their validity. Some pairings may not converge or may produce nonsensical results. Please check the distribution-link combination you are using is valid.

Returns

Returns a b-by-n NumPy array evaluated for the input theta.

Return type

array

Examples

Construction of a estimating equation(s) with ee_regression should be done similar to the following

>>> import numpy as np
>>> import pandas as pd
>>> from delicatessen import MEstimator
>>> from delicatessen.estimating_equations import ee_glm

Some generic data to estimate the regression models

>>> d = pd.DataFrame()
>>> d['X'] = [1, -1, 0, 1, 2, 1, -2, -1, 0, 3, -3, 1, 1, -1, -1, -2, 2, 0, -1, 0]
>>> d['Z'] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> d['Y1'] = [1, 2, 4, 5, 2, 3, 1, 1, 3, 4, 2, 3, 7, 8, 2, 2, 1, 4, 2, 1]
>>> d['Y2'] = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0]
>>> d['C'] = 1
>>> X = d[['C', 'X', 'Z']]  # design matrix used hereafter

To start, we will demonstrate a GLM with a normal distribution and identity link. This GLM is equivalent to linear regression. Defining psi, or the stacked estimating equations

>>> def psi(theta):
>>>     return ee_glm(theta, X=X, y=d['Y1'],
>>>                   distribution='normal', link='identity')

Calling the M-estimator (note that init requires 3 values, since X.shape[1] is 3).

>>> estr = MEstimator(stacked_equations=psi, init=[0., 0., 0.,])
>>> estr.estimate()

Inspecting the parameter estimates, variance, and confidence intervals

>>> estr.theta
>>> estr.variance
>>> estr.confidence_intervals()

Next, we will show a GLM with a binomial distribution and log link. This GLM can be used for binary data and estimates log risk ratios. So, one may prefer this model for interpretability over logistic regression (GLM with binomial distribution and logit link).

>>> def psi(theta):
>>>     return ee_glm(theta, X=X, y=d['Y2'],
>>>                   distribution='binomial', link='log')

Calling the M-estimator (note that init requires 3 values, since X.shape[1] is 3).

>>> estr = MEstimator(stacked_equations=psi, init=[-0.9, 0., 0.,])
>>> estr.estimate()

Notice that the root-finding solution may go off to weird places if bad starting values are given for the log-binomial GLM. This is because the log-binomial GLM is not bounded. Providing starting values close to the truth (or changing link functions) can help alleviate these issues. Other variations for binomial distribution link functions that are bounded include: logit, cauchy, probit, or loglog.

The negative-binomial and gamma distributions for GLM have an additional parameter that is estimated. Therefore, both of these distribution specifications require X.shape[1] + 1 input starting values. Here, we illustrate a gamma distribution and log link GLM

>>> def psi(theta):
>>>     return ee_glm(theta, X=X, y=d['Y1'],
>>>                   distribution='gamma', link='log')

Calling the M-estimator (note that init requires 4 values, since X.shape[1] is 3 and the gamma distribution has an additional parameter).

>>> estr = MEstimator(stacked_equations=psi, init=[0., 0., 0., 0.])
>>> estr.estimate()

Note that delicatessen appropriately incorporates the estimation of the additional parameter for the negative-binomial and gamma distributions. This is unlike some statistical software that estimates this parameter but does not incorporate the uncertainty in estimation of that parameter. This may explain differences you encounter across software (and the delicatessen implementation is to be preferred, as it is a more honest expression of the uncertainty).

Finally, the tweedie distribution for GLM is a generalization of the Poisson and gamma distributions. Unlike the negative-binomial and gamma distributions, there is a fixed (i.e., not estimated) hyperparameter bounded to be \(>0\). When the tweedie distribution hyperparameter is set to 1, it is equivalent to the Poisson distribution. When the tweedie distribution hyperparameter is set to 2, it is equivalent to the gamma distribution. When the tweedie distribution hyperparameter is set to 3, it is equivalent to the inverse-normal distribution. However, the tweedie distribution hyperparameter can be specified for any values. Here, we illustrate the tweedie distribution that is between a Poisson and gamma distribution.

>>> def psi(theta):
>>>     return ee_glm(theta, X=X, y=d['Y1'],
>>>                   distribution='tweedie', link='log',
>>>                   hyperparameter=1.5)

Calling the M-estimator (note that init requires 3 values, since X.shape[1] is 3).

>>> estr = MEstimator(stacked_equations=psi, init=[0., 0., 0.])
>>> estr.estimate()

Notice that the tweedie distribution does not estimate an additional parameter, unlike the gamma distribution GLM described previously.

References

Boos DD, & Stefanski LA. (2013). M-estimation (estimating equations). In Essential Statistical Inference (pp. 297-337). Springer, New York, NY.

Hilbe JM. (2011). Negative Binomial Regression. Cambridge University Press.

Nakashima E. (1997). Some methods for estimation in a Negative-Binomial model. Annals of the Institute of Statistical Mathematics, 49, 101-115.