Mroz 1987: Regression and Instrumental Variables

Mroz explored (mis)specification of statistical models using labor data on married women from 1975. These data were used for a number of examples in the book Econometric Analysis of Cross Section and Panel Data by Jeffrey Wooldridge. Here some of these examples are shown further.

Setup

[1]:
import numpy as np
import scipy as sp
import pandas as pd

import delicatessen as deli
from delicatessen import MEstimator
from delicatessen.estimating_equations import ee_2sls, ee_regression

print("Versions")
print("NumPy:        ", np.__version__)
print("SciPy:        ", sp.__version__)
print("pandas:       ", pd.__version__)
print("Delicatessen: ", deli.__version__)
Versions
NumPy:         2.3.5
SciPy:         1.16.3
pandas:        2.3.3
Delicatessen:  4.1
[2]:
d = pd.read_csv('data/mroz.csv').dropna()
d['intercept'] = 1

Chapter 4: The Single-Equation Linear Model and OLS Estimation

Here, a simple model for the log-transformed wage (lwage) is fit as a function of labor market experience (exper), years of schooling (educ), age (age), number of kids 0-6 years old (kidslt6), and number of kids 6-18 years old (kidsge6). Fitting this model is easily done using the built-in ee_regression functionality

[3]:
design_matrix = ['intercept', 'exper', 'expersq', 'educ', 'age', 'kidslt6', 'kidsge6']
[4]:
def psi_lm(theta):
    return ee_regression(theta,
                         X=d[design_matrix],
                         y=d['lwage'],
                         model='linear')
[5]:
estr = MEstimator(psi_lm, init=[0., ]*7)
estr.estimate()
[6]:
r = pd.DataFrame()
r['label'] = design_matrix
r['Est'] = estr.theta
r['SE'] = np.diag(estr.variance)**0.5
r.set_index('label').round(3)
[6]:
Est SE
label
intercept -0.421 0.316
exper 0.040 0.015
expersq -0.001 0.000
educ 0.108 0.014
age -0.001 0.006
kidslt6 -0.061 0.105
kidsge6 -0.015 0.029

These results match those provided in the box. Note that the reported standard error (SE) here corresponds to the heteroskedasticity-robust standard error reported in the book.

Chapter 5: Instrumental Variables Estimation of Single-Equation Linear Models

The next chapter considers instrumental variable estimation using 2-stage least squares (2SLS). In particular we are interested in the effect of education (educ) on log-transformed wages (lwage). Here, we will account labor market experience (exper) in both stages. The instruments in this setting are mother’s education (motheduc), father’s educaiton (fatheduc), and husband’s education (huseduc).

We will apply the 2SLS estimator using the built-in ee_2sls function

[7]:
def psi_2sls(theta):
    return ee_2sls(theta,
                   y=d['lwage'],
                   A=d['educ'],
                   Z=d[['motheduc', 'fatheduc', 'huseduc']],
                   W=d[['intercept', 'exper', 'expersq']])
[8]:
init_vals = [0., ] + [0., ]*3 + [0., ]*3*2
estr = MEstimator(psi_2sls, init=init_vals)
estr.estimate()
[9]:
r = pd.DataFrame()
r['label'] = ['educ', 'intercept', 'exper', 'expersq']
r['Est'] = estr.theta[:4]
r['SE'] = np.diag(estr.variance)[:4]**0.5
r.set_index('label').round(3)
[9]:
Est SE
label
educ 0.080 0.022
intercept -0.187 0.300
exper 0.043 0.015
expersq -0.001 0.000

Again, these match the output provided in the book. Note that the order of the output of ee_2sls is slightly different from the book.

Example from OneSampleMR

As a final use of the Mroz data, we replicate the example with two action variable (education and labor force experience) from the OneSampleMR documentation, provided here. In this case, we will have age, and the number of kids serve as the instruments.

Currently, ee_2sls does not allow for multiple action variables. Therefore, we instead code up with 2SLS estimator using the basic regression functions. Briefly, we fit two models in the first stage (one for educ and one for exper). Using the predicted values from these models, we then fit the second stage model for lwage.

[10]:
# Instrument design matrix
Z = d[['intercept', 'age', 'kidslt6', 'kidsge6']]
[11]:
def psi(theta):
    gamma = theta[:3]
    alpha = theta[3:3+4]
    beta = theta[3+4:]

    # First-stage regression for education
    ee_s1a = ee_regression(alpha, X=Z, y=d['educ'],
                           model='linear')
    a_hat = np.dot(Z, alpha)

    # First-stage regression for experience
    ee_s1b = ee_regression(beta, X=Z, y=d['exper'],
                           model='linear')
    b_hat = np.dot(Z, beta)

    # Second-stage regression for log(wage)
    Xhat = np.c_[np.asarray(d['intercept']), a_hat, b_hat]
    ee_2s = ee_regression(gamma, X=Xhat, y=d['lwage'],
                           model='linear')

    return np.vstack([ee_2s, ee_s1a, ee_s1b])
[12]:
init_vals = [0., ]*3 + [0., ]*4*2
estr = MEstimator(psi, init=init_vals)
estr.estimate()
[13]:
r = pd.DataFrame()
r['label'] = ['educ', 'exper', 'intercept']
r['Est'] = estr.theta[:3]
r['SE'] = np.diag(estr.variance)[:3]**0.5
r.set_index('label').round(3)
[13]:
Est SE
label
educ -0.360 1.105
exper 0.106 0.088
intercept 0.016 0.008

These output mostly match those provided in the documentation. However, note that SE differs due to the use of a different variance estimator.

References

Mroz, T. A. (1987). The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions. Econometrica 55(4), 765-799.

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT press.