Cameron & Trivedi (1998): Modeling Count Data

In the Cameron & Trivedi book, they are concerned with the number of recreational boating trips to Lake Somerville, Texas in 1980. A regression model is used to model how various factors (e.g., perceived facility quality, water-skiing, costs) relate to the number of trips. Here, the outcome is count data. Two competing regression models for count data are Poisson and Negative-Binomial models. Both of these models are illustrated.

Setup

[1]:

import numpy as np
import scipy as sp
import pandas as pd

import delicatessen as deli
from delicatessen import MEstimator
from delicatessen.estimating_equations import ee_glm

print("NumPy version:       ", np.__version__)
print("SciPy version:       ", sp.__version__)
print("Pandas version:      ", pd.__version__)
print("Delicatessen version:", deli.__version__)

NumPy version:        2.3.5
SciPy version:        1.16.3
Pandas version:       2.3.3
Delicatessen version: 4.2

[5]:

d = pd.read_csv("data/fishingdemand.csv")

# Converting text to numbers for NumPy
d['ski'] = np.where(d['ski'] == 'yes', 1, 0)
d['userfee'] = np.where(d['userfee'] == 'yes', 1, 0)
d['intercept'] = 1

The following code prepares the data for fitting the corresponding regression models.

[6]:

X = np.asarray(d[['intercept', 'quality', 'ski', 'income',
                  'userfee', 'costC', 'costS', 'costH']])
y = np.asarray(d['trips'])

Poisson Regression

For the first model, we will use Poisson regression. In delicatessen, the log-Poisson model can be fit using ee_regression(..., model='poisson'). However, this same model can be fit using ee_glm, which is the approach taken here. While not shown, Generalized Linear Models allow for the use of other link function (e.g., the identity function to model the differences in counts rather than the differences on the log-scale).

The following code fits the Poisson regression model.

[7]:

def psi_poisson(theta):
    return ee_glm(theta=theta, X=X, y=y, distribution='poisson', link='log')

[8]:

init_vals = [0., ] * X.shape[1]
estr = MEstimator(psi_poisson, init=init_vals)
estr.estimate()

[10]:

estr.print_results(decimals=3)

==============================================================
              Estimation Method: M-estimator
--------------------------------------------------------------
No. Observations:         659 | No. Parameters:              8
Solving algorithm:         lm | Max Iterations:           5000
Solving tolerance:      1e-09 | Allow P-Inverse:             1
Derivative Method:     approx | Deriv Approx:            1e-09
Small N Correction:      None | Distribution:           Z-stat
==============================================================
   Theta   StdErr  Z-score      LCL      UCL  P-value  S-value
--------------------------------------------------------------
   0.265    0.432    0.613   -0.583    1.113    0.540    0.889
   0.472    0.049    9.657    0.376    0.567    0.000   70.879
   0.418    0.194    2.157    0.038    0.798    0.031    5.012
  -0.111    0.050   -2.213   -0.210   -0.013    0.027    5.216
   0.898    0.247    3.638    0.414    1.382    0.000   11.828
  -0.003    0.015   -0.233   -0.032    0.025    0.815    0.294
  -0.043    0.012   -3.625   -0.066   -0.020    0.000   11.756
   0.036    0.009    3.850    0.018    0.055    0.000   13.046
==============================================================

These results are the same as those reported in the R package AER.

A concern with the previous model is over-dispersion. The Poisson model assumes that the mean is equal to the variance. Over-dispersion is when the variance is larger than the mean. This violation of an assumption of the Poisson model can lead to invalid statistical inference. Luckily the empirical sandwich variance estimator is robust to this assumption. However, there are other ways over-dispersion can be addressed.

Negative Binomial

One alternative to address over-dispersion is using the Negative-Binomial model. The Negative-Binomial model is a generalization of the Poisson model that allows the mean and variance to differ. To allow for the mean and variance to differ, the Negative-Binomial model introduces an additional parameter, referred to as the dispersion parameter.

The Negative-Binomial model is fit using ee_glm in delicatessen. One key change is that the number of parameters in this model is 1 more than the previous model. Therefore, we need to adjust init_vals accordingly. The following code applies the Negative-Binomial model.

[11]:

def psi_negbin(theta):
    return ee_glm(theta=theta, X=X, y=y, distribution='negative_binomial', link='log')

[13]:

init_vals = [0., ] * X.shape[1] + [1., ]
estr = MEstimator(psi_negbin, init=init_vals)
estr.estimate()

[14]:

estr.print_results(decimals=3)

==============================================================
              Estimation Method: M-estimator
--------------------------------------------------------------
No. Observations:         659 | No. Parameters:              9
Solving algorithm:         lm | Max Iterations:           5000
Solving tolerance:      1e-09 | Allow P-Inverse:             1
Derivative Method:     approx | Deriv Approx:            1e-09
Small N Correction:      None | Distribution:           Z-stat
==============================================================
   Theta   StdErr  Z-score      LCL      UCL  P-value  S-value
--------------------------------------------------------------
  -1.122    0.296   -3.788   -1.702   -0.541    0.000   12.685
   0.722    0.058   12.379    0.608    0.836    0.000  114.503
   0.612    0.200    3.058    0.220    1.004    0.002    8.811
  -0.026    0.055   -0.475   -0.133    0.081    0.634    0.656
   0.669    0.320    2.089    0.041    1.297    0.037    4.768
   0.048    0.028    1.697   -0.007    0.103    0.090    3.479
  -0.093    0.015   -6.389   -0.121   -0.064    0.000   32.483
   0.039    0.018    2.190    0.004    0.074    0.029    5.130
   0.316    0.135    2.340    0.051    0.580    0.019    5.697
==============================================================

The coefficients in this table match those given in AER. We see every factor appears to be related in some way besides income. Here, the final parameter in the output is for the dispersion factor. We see that it is different from 1 so there is evidence of over-dispersion in this example.

Note that delicatessen treats this as being estimated (because it is being estimated). In principle this is the preferred way to apply negative-binomial regression models. However, some software treats this extra parameter as known (or fixed) which can lead to differences in the standard error estimates.

References

Cameron AC & Trivedi PK. (1998). Regression Analysis of Count Data. Cambridge: Cambridge University Press.