delicatessen.estimating_equations.regression.ee_robust_regression

ee_robust_regression(theta, X, y, model, k, loss='huber', weights=None, upper=None, lower=None, offset=None)

Estimating equations for (unscaled) robust regression. Robust linear regression is robust to outlying observations of the outcome variable. Currently, only linear regression is supported by ee_robust_regression. The estimating equation is

\[\sum_{i=1}^n f_k(Y_i - X_i^T \theta) X_i = 0\]

where \(f_k(x)\) is the corresponding robust loss function. Options for the loss function include: Huber, Tukey’s biweight, Andrew’s Sine, and Hampel. See robust_loss_function for further details on the loss functions for the robust mean.

Note

The estimating-equation is not non-differentiable everywhere for some loss functions. Therefore, it is assumed that no points occur exactly at the non-differentiable points. For truly continuous \(Y\), the probability of that occurring is zero.

Here, \(\theta\) is a 1-by-b array, which corresponds to the coefficients in the corresponding regression model and b is the distinct covariates included as part of X. For example, if X is a 3-by-n matrix, then \(\theta\) will be a 1-by-3 array. The code is general to allow for an arbitrary number of elements in X.

Parameters

theta (ndarray, list, vector) – Theta in this case consists of b values. Therefore, initial values should consist of the same number as the number of columns present. This can easily be implemented via [0, ] * X.shape[1].
X (ndarray, list, vector) – 2-dimensional vector of n observed values for b variables.
y (ndarray, list, vector) – 1-dimensional vector of n observed values.
model (str) – Type of regression model to estimate. Options include: 'linear' (linear regression).
k (int, float) – Tuning or hyperparameter for the chosen loss function. Notice that the choice of hyperparameter should depend on the chosen loss function.
loss (str, optional) – Robust loss function to use. Default is 'huber'. Options include 'andrew', 'hampel', 'tukey'.
weights (ndarray, list, vector, None, optional) – 1-dimensional vector of n weights. Default is None, which assigns a weight of 1 to all observations.
lower (int, float, None, optional) – Lower parameter for the Hampel loss function. This parameter does not impact the other loss functions. Default is None.
upper (int, float, None, optional) – Upper parameter for the Hampel loss function. This parameter does not impact the other loss functions. Default is None.
offset (ndarray, list, vector, None, optional) – A 1-dimensional offset to be included in the model. Default is None, which applies no offset term.

Returns

Returns a b-by-n NumPy array evaluated for the input theta

Return type

array

Examples

Construction of a estimating equation(s) with ee_robust_regression should be done similar to the following

>>> import numpy as np
>>> import pandas as pd
>>> from delicatessen import MEstimator
>>> from delicatessen.estimating_equations import ee_robust_regression

Some generic data to estimate a robust linear regression model

>>> n = 100
>>> data = pd.DataFrame()
>>> data['X'] = np.random.normal(size=n)
>>> data['Z'] = np.random.normal(size=n)
>>> data['Y'] = 0.5 + 2*data['X'] - 1*data['Z'] + np.random.normal(loc=0, scale=3, size=n)
>>> data['C'] = 1

>>> X = data[['C', 'X', 'Z']]
>>> y = data['Y']

Note that C here is set to all 1’s. This will be the intercept in the regression.

Defining psi, or the stacked estimating equations for Huber’s robust regression

>>> def psi(theta):
>>>         return ee_robust_regression(theta=theta, X=X, y=y, model='linear', k=1.345, loss='huber')

Calling the M-estimator procedure (note that init has 3 values now, since X.shape[1] is 3).

>>> estr = MEstimator(stacked_equations=psi, init=[0., 0., 0.,])
>>> estr.estimate()

Inspecting the parameter estimates, variance, and confidence intervals

>>> estr.theta
>>> estr.variance
>>> estr.confidence_intervals()

Weighted models can be estimated by specifying the optional weights argument.

References

Andrews DF. (1974). A robust method for multiple linear regression. Technometrics, 16(4), 523-531.

Beaton AE & Tukey JW (1974). The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics, 16(2), 147-185.

Boos DD, & Stefanski LA. (2013). M-estimation (estimating equations). In Essential Statistical Inference (pp. 297-337). Springer, New York, NY.

Hampel FR. (1971). A general qualitative definition of robustness. The Annals of Mathematical Statistics, 42(6), 1887-1896.

Huber PJ. (1964). Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1), 73–101.

Huber PJ, Ronchetti EM. (2009) Robust Statistics 2nd Edition. Wiley. pgs 98-100