{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e80fd6f2-d4d9-45f9-8ff8-7dd36792bd09",
   "metadata": {},
   "source": [
    "# Mroz 1987: Regression and Instrumental Variables\n",
    "\n",
    "Mroz explored (mis)specification of statistical models using labor data on married women from 1975. These data were used for a number of examples in the book *Econometric Analysis of Cross Section and Panel Data* by Jeffrey Wooldridge. Here some of these examples are shown further.\n",
    "\n",
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cbf368ec-24f4-4231-afe0-503d5403446c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Versions\n",
      "NumPy:         2.3.5\n",
      "SciPy:         1.16.3\n",
      "pandas:        2.3.3\n",
      "Delicatessen:  4.1\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import scipy as sp\n",
    "import pandas as pd\n",
    "\n",
    "import delicatessen as deli\n",
    "from delicatessen import MEstimator\n",
    "from delicatessen.estimating_equations import ee_2sls, ee_regression\n",
    "\n",
    "print(\"Versions\")\n",
    "print(\"NumPy:        \", np.__version__)\n",
    "print(\"SciPy:        \", sp.__version__)\n",
    "print(\"pandas:       \", pd.__version__)\n",
    "print(\"Delicatessen: \", deli.__version__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c02095fd-fc78-41d1-8ecb-96bd581747dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "d = pd.read_csv('data/mroz.csv').dropna()\n",
    "d['intercept'] = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8fd56e7-d2c4-4136-9ed0-170268c35e0c",
   "metadata": {},
   "source": [
    "## Chapter 4: The Single-Equation Linear Model and OLS Estimation \n",
    "\n",
    "Here, a simple model for the log-transformed wage (`lwage`) is fit as a function of labor market experience (`exper`), years of schooling (`educ`), age (`age`), number of kids 0-6 years old (`kidslt6`), and number of kids 6-18 years old (`kidsge6`). Fitting this model is easily done using the built-in `ee_regression` functionality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "bd17e16a-cdd6-4933-83ad-c9e2e3f5a5ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "design_matrix = ['intercept', 'exper', 'expersq', 'educ', 'age', 'kidslt6', 'kidsge6']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2c8b0808-5515-404b-bcfa-e31afd6dd774",
   "metadata": {},
   "outputs": [],
   "source": [
    "def psi_lm(theta):\n",
    "    return ee_regression(theta, \n",
    "                         X=d[design_matrix], \n",
    "                         y=d['lwage'], \n",
    "                         model='linear')  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "dea08961-bc9d-4031-933b-a879c5ed5e54",
   "metadata": {},
   "outputs": [],
   "source": [
    "estr = MEstimator(psi_lm, init=[0., ]*7)\n",
    "estr.estimate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "83fa36e1-cf7e-4480-a1f8-399a950e01c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Est</th>\n",
       "      <th>SE</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>intercept</th>\n",
       "      <td>-0.421</td>\n",
       "      <td>0.316</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>exper</th>\n",
       "      <td>0.040</td>\n",
       "      <td>0.015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>expersq</th>\n",
       "      <td>-0.001</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>educ</th>\n",
       "      <td>0.108</td>\n",
       "      <td>0.014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>age</th>\n",
       "      <td>-0.001</td>\n",
       "      <td>0.006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>kidslt6</th>\n",
       "      <td>-0.061</td>\n",
       "      <td>0.105</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>kidsge6</th>\n",
       "      <td>-0.015</td>\n",
       "      <td>0.029</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Est     SE\n",
       "label                  \n",
       "intercept -0.421  0.316\n",
       "exper      0.040  0.015\n",
       "expersq   -0.001  0.000\n",
       "educ       0.108  0.014\n",
       "age       -0.001  0.006\n",
       "kidslt6   -0.061  0.105\n",
       "kidsge6   -0.015  0.029"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r = pd.DataFrame()\n",
    "r['label'] = design_matrix\n",
    "r['Est'] = estr.theta\n",
    "r['SE'] = np.diag(estr.variance)**0.5\n",
    "r.set_index('label').round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d7dd412-2e6b-4075-a511-c79251b9842e",
   "metadata": {},
   "source": [
    "These results match those provided in the box. Note that the reported standard error (SE) here corresponds to the heteroskedasticity-robust standard error reported in the book. \n",
    "\n",
    "## Chapter 5: Instrumental Variables Estimation of Single-Equation Linear Models \n",
    "\n",
    "The next chapter considers instrumental variable estimation using 2-stage least squares (2SLS). In particular we are interested in the effect of education (`educ`) on log-transformed wages (`lwage`). Here, we will account labor market experience (`exper`) in both stages. The instruments in this setting are mother's education (`motheduc`), father's educaiton (`fatheduc`), and husband's education (`huseduc`). \n",
    "\n",
    "We will apply the 2SLS estimator using the built-in `ee_2sls` function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "b1f1d3ac-1b4e-4761-8d0f-0f06463ebe68",
   "metadata": {},
   "outputs": [],
   "source": [
    "def psi_2sls(theta):\n",
    "    return ee_2sls(theta,\n",
    "                   y=d['lwage'],\n",
    "                   A=d['educ'],\n",
    "                   Z=d[['motheduc', 'fatheduc', 'huseduc']],\n",
    "                   W=d[['intercept', 'exper', 'expersq']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6ca6e04e-922a-4147-b748-d80f06d484ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "init_vals = [0., ] + [0., ]*3 + [0., ]*3*2\n",
    "estr = MEstimator(psi_2sls, init=init_vals)\n",
    "estr.estimate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "9ad66d25-c043-4eee-9595-e35a73e40a3e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Est</th>\n",
       "      <th>SE</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>educ</th>\n",
       "      <td>0.080</td>\n",
       "      <td>0.022</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>intercept</th>\n",
       "      <td>-0.187</td>\n",
       "      <td>0.300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>exper</th>\n",
       "      <td>0.043</td>\n",
       "      <td>0.015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>expersq</th>\n",
       "      <td>-0.001</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Est     SE\n",
       "label                  \n",
       "educ       0.080  0.022\n",
       "intercept -0.187  0.300\n",
       "exper      0.043  0.015\n",
       "expersq   -0.001  0.000"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r = pd.DataFrame()\n",
    "r['label'] = ['educ', 'intercept', 'exper', 'expersq']\n",
    "r['Est'] = estr.theta[:4]\n",
    "r['SE'] = np.diag(estr.variance)[:4]**0.5\n",
    "r.set_index('label').round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81ebff6c-1acf-4d3f-93be-2257a3f70f46",
   "metadata": {},
   "source": [
    "Again, these match the output provided in the book. Note that the order of the output of `ee_2sls` is slightly different from the book. \n",
    "\n",
    "## Example from `OneSampleMR`\n",
    "\n",
    "As a final use of the Mroz data, we replicate the example with two action variable (education and labor force experience) from the `OneSampleMR` documentation, provided [here](https://remlapmot.github.io/OneSampleMR/articles/f-statistic-comparison.html). In this case, we will have age, and the number of kids serve as the instruments.\n",
    "\n",
    "Currently, `ee_2sls` does not allow for multiple action variables. Therefore, we instead code up with 2SLS estimator using the basic regression functions. Briefly, we fit two models in the first stage (one for `educ` and one for `exper`). Using the predicted values from these models, we then fit the second stage model for `lwage`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "35a7de3b-cd3f-4e59-a342-b23826e10b81",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Instrument design matrix\n",
    "Z = d[['intercept', 'age', 'kidslt6', 'kidsge6']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "c311e32d-9aa5-42d5-b1ac-43280be128da",
   "metadata": {},
   "outputs": [],
   "source": [
    "def psi(theta):\n",
    "    gamma = theta[:3]\n",
    "    alpha = theta[3:3+4]\n",
    "    beta = theta[3+4:]\n",
    "    \n",
    "    # First-stage regression for education\n",
    "    ee_s1a = ee_regression(alpha, X=Z, y=d['educ'], \n",
    "                           model='linear')  \n",
    "    a_hat = np.dot(Z, alpha)\n",
    "\n",
    "    # First-stage regression for experience\n",
    "    ee_s1b = ee_regression(beta, X=Z, y=d['exper'], \n",
    "                           model='linear')\n",
    "    b_hat = np.dot(Z, beta)\n",
    "\n",
    "    # Second-stage regression for log(wage)\n",
    "    Xhat = np.c_[np.asarray(d['intercept']), a_hat, b_hat]\n",
    "    ee_2s = ee_regression(gamma, X=Xhat, y=d['lwage'], \n",
    "                           model='linear')\n",
    "\n",
    "    return np.vstack([ee_2s, ee_s1a, ee_s1b])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "de6f5f61-2fa1-4206-8daa-e5e450870183",
   "metadata": {},
   "outputs": [],
   "source": [
    "init_vals = [0., ]*3 + [0., ]*4*2\n",
    "estr = MEstimator(psi, init=init_vals)\n",
    "estr.estimate()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "ab408fed-a3e8-4b32-be83-3bef0e9cf853",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Est</th>\n",
       "      <th>SE</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>educ</th>\n",
       "      <td>-0.360</td>\n",
       "      <td>1.105</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>exper</th>\n",
       "      <td>0.106</td>\n",
       "      <td>0.088</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>intercept</th>\n",
       "      <td>0.016</td>\n",
       "      <td>0.008</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Est     SE\n",
       "label                  \n",
       "educ      -0.360  1.105\n",
       "exper      0.106  0.088\n",
       "intercept  0.016  0.008"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r = pd.DataFrame()\n",
    "r['label'] = ['educ', 'exper', 'intercept']\n",
    "r['Est'] = estr.theta[:3]\n",
    "r['SE'] = np.diag(estr.variance)[:3]**0.5\n",
    "r.set_index('label').round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abc76d5d-0068-4bb6-b495-e0f4633176a2",
   "metadata": {},
   "source": [
    "These output mostly match those provided in the documentation. However, note that SE differs due to the use of a different variance estimator. \n",
    "\n",
    "## References\n",
    "\n",
    "Mroz, T. A. (1987). The sensitivity of an empirical model of married women's hours of work to economic and statistical assumptions. *Econometrica* 55(4), 765-799.\n",
    "\n",
    "Wooldridge, J. M. (2010). *Econometric analysis of cross section and panel data*. MIT press."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}