{ "cells": [ { "metadata": { "collapsed": true }, "cell_type": "markdown", "source": [ "# Cameron & Trivedi (1998): Modeling Count Data\n", "\n", "In the Cameron & Trivedi book, they are concerned with the number of recreational boating trips to Lake Somerville, Texas in 1980. A regression model is used to model how various factors (e.g., perceived facility quality, water-skiing, costs) relate to the number of trips. Here, the outcome is count data. Two competing regression models for count data are Poisson and Negative-Binomial models. Both of these models are illustrated.\n", "\n", "## Setup" ], "id": "d8533290b67e5dfe" }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-10T12:15:22.329539100Z", "start_time": "2026-04-10T12:15:21.610004Z" } }, "cell_type": "code", "source": [ "import numpy as np\n", "import scipy as sp\n", "import pandas as pd\n", "\n", "import delicatessen as deli\n", "from delicatessen import MEstimator\n", "from delicatessen.estimating_equations import ee_glm\n", "\n", "print(\"NumPy version: \", np.__version__)\n", "print(\"SciPy version: \", sp.__version__)\n", "print(\"Pandas version: \", pd.__version__)\n", "print(\"Delicatessen version:\", deli.__version__)" ], "id": "888e56cad8f9d332", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NumPy version: 2.3.5\n", "SciPy version: 1.16.3\n", "Pandas version: 2.3.3\n", "Delicatessen version: 4.2\n" ] } ], "execution_count": 1 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-10T12:18:28.796392800Z", "start_time": "2026-04-10T12:18:28.777009400Z" } }, "cell_type": "code", "source": [ "d = pd.read_csv(\"data/fishingdemand.csv\")\n", "\n", "# Converting text to numbers for NumPy\n", "d['ski'] = np.where(d['ski'] == 'yes', 1, 0)\n", "d['userfee'] = np.where(d['userfee'] == 'yes', 1, 0)\n", "d['intercept'] = 1" ], "id": "a9ad6957b2123609", "outputs": [], "execution_count": 5 }, { "metadata": {}, "cell_type": "markdown", "source": "The following code prepares the data for fitting the corresponding regression models.", "id": "4df44ad6368bd8b1" }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-10T12:19:39.326102200Z", "start_time": "2026-04-10T12:19:39.295753600Z" } }, "cell_type": "code", "source": [ "X = np.asarray(d[['intercept', 'quality', 'ski', 'income',\n", " 'userfee', 'costC', 'costS', 'costH']])\n", "y = np.asarray(d['trips'])" ], "id": "9a85665c3d33ef29", "outputs": [], "execution_count": 6 }, { "metadata": {}, "cell_type": "markdown", "source": [ "## Poisson Regression\n", "\n", "For the first model, we will use Poisson regression. In `delicatessen`, the log-Poisson model can be fit using `ee_regression(..., model='poisson')`. However, this same model can be fit using `ee_glm`, which is the approach taken here. While not shown, Generalized Linear Models allow for the use of other link function (e.g., the identity function to model the differences in counts rather than the differences on the log-scale).\n", "\n", "The following code fits the Poisson regression model." ], "id": "52a18586a5298271" }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-10T12:19:52.858852600Z", "start_time": "2026-04-10T12:19:52.838872200Z" } }, "cell_type": "code", "source": [ "def psi_poisson(theta):\n", " return ee_glm(theta=theta, X=X, y=y, distribution='poisson', link='log')" ], "id": "5b4f78659464491f", "outputs": [], "execution_count": 7 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-10T12:20:22.758778900Z", "start_time": "2026-04-10T12:20:22.720730900Z" } }, "cell_type": "code", "source": [ "init_vals = [0., ] * X.shape[1]\n", "estr = MEstimator(psi_poisson, init=init_vals)\n", "estr.estimate()" ], "id": "16a2f9b004a209a9", "outputs": [], "execution_count": 8 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-10T12:20:39.149455200Z", "start_time": "2026-04-10T12:20:39.118435700Z" } }, "cell_type": "code", "source": "estr.print_results(decimals=3)", "id": "3cb7f196af4c8b9d", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 659 | No. Parameters: 8\n", "Solving algorithm: lm | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " 0.265 0.432 0.613 -0.583 1.113 0.540 0.889 \n", " 0.472 0.049 9.657 0.376 0.567 0.000 70.879 \n", " 0.418 0.194 2.157 0.038 0.798 0.031 5.012 \n", " -0.111 0.050 -2.213 -0.210 -0.013 0.027 5.216 \n", " 0.898 0.247 3.638 0.414 1.382 0.000 11.828 \n", " -0.003 0.015 -0.233 -0.032 0.025 0.815 0.294 \n", " -0.043 0.012 -3.625 -0.066 -0.020 0.000 11.756 \n", " 0.036 0.009 3.850 0.018 0.055 0.000 13.046 \n", "==============================================================\n" ] } ], "execution_count": 10 }, { "metadata": {}, "cell_type": "markdown", "source": [ "These results are the same as those reported in the R package `AER`.\n", "\n", "A concern with the previous model is over-dispersion. The Poisson model assumes that the mean is equal to the variance. Over-dispersion is when the variance is larger than the mean. This violation of an assumption of the Poisson model can lead to invalid statistical inference. Luckily the empirical sandwich variance estimator is robust to this assumption. However, there are other ways over-dispersion can be addressed.\n", "\n", "## Negative Binomial\n", "\n", "One alternative to address over-dispersion is using the Negative-Binomial model. The Negative-Binomial model is a generalization of the Poisson model that allows the mean and variance to differ. To allow for the mean and variance to differ, the Negative-Binomial model introduces an additional parameter, referred to as the dispersion parameter.\n", "\n", "The Negative-Binomial model is fit using `ee_glm` in `delicatessen`. One key change is that the number of parameters in this model is 1 more than the previous model. Therefore, we need to adjust `init_vals` accordingly. The following code applies the Negative-Binomial model." ], "id": "2a8938488cd1b7c4" }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-10T12:23:27.392377700Z", "start_time": "2026-04-10T12:23:27.377642900Z" } }, "cell_type": "code", "source": [ "def psi_negbin(theta):\n", " return ee_glm(theta=theta, X=X, y=y, distribution='negative_binomial', link='log')" ], "id": "28861e97a3d264f8", "outputs": [], "execution_count": 11 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-10T12:24:09.368696800Z", "start_time": "2026-04-10T12:24:08.866541100Z" } }, "cell_type": "code", "source": [ "init_vals = [0., ] * X.shape[1] + [1., ]\n", "estr = MEstimator(psi_negbin, init=init_vals)\n", "estr.estimate()" ], "id": "3b64e22dc1217af0", "outputs": [], "execution_count": 13 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-10T12:24:10.994581800Z", "start_time": "2026-04-10T12:24:10.965017300Z" } }, "cell_type": "code", "source": "estr.print_results(decimals=3)", "id": "e2abaff655f2b3a3", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==============================================================\n", " Estimation Method: M-estimator\n", "--------------------------------------------------------------\n", "No. Observations: 659 | No. Parameters: 9\n", "Solving algorithm: lm | Max Iterations: 5000\n", "Solving tolerance: 1e-09 | Allow P-Inverse: 1\n", "Derivative Method: approx | Deriv Approx: 1e-09\n", "Small N Correction: None | Distribution: Z-stat\n", "==============================================================\n", " Theta StdErr Z-score LCL UCL P-value S-value \n", "--------------------------------------------------------------\n", " -1.122 0.296 -3.788 -1.702 -0.541 0.000 12.685 \n", " 0.722 0.058 12.379 0.608 0.836 0.000 114.503 \n", " 0.612 0.200 3.058 0.220 1.004 0.002 8.811 \n", " -0.026 0.055 -0.475 -0.133 0.081 0.634 0.656 \n", " 0.669 0.320 2.089 0.041 1.297 0.037 4.768 \n", " 0.048 0.028 1.697 -0.007 0.103 0.090 3.479 \n", " -0.093 0.015 -6.389 -0.121 -0.064 0.000 32.483 \n", " 0.039 0.018 2.190 0.004 0.074 0.029 5.130 \n", " 0.316 0.135 2.340 0.051 0.580 0.019 5.697 \n", "==============================================================\n" ] } ], "execution_count": 14 }, { "metadata": {}, "cell_type": "markdown", "source": [ "The coefficients in this table match those given in `AER`. We see every factor appears to be related in some way besides `income`. Here, the final parameter in the output is for the dispersion factor. We see that it is different from `1` so there is evidence of over-dispersion in this example.\n", "\n", "Note that `delicatessen` treats this as being estimated (because it is being estimated). In principle this is the preferred way to apply negative-binomial regression models. However, some software treats this extra parameter as known (or fixed) which can lead to differences in the standard error estimates.\n", "\n", "## References\n", "\n", "Cameron AC & Trivedi PK. (1998). *Regression Analysis of Count Data*. Cambridge: Cambridge University Press." ], "id": "88f80a9180101217" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }