{ "cells": [ { "cell_type": "markdown", "id": "1838c8e6", "metadata": {}, "source": [ "# Custom Estimating Equations\n", "\n", "While there are many alternative libraries for regression, causal inference, pharmacokinetic models, a key advantages of `delicatessen` is the provided flexibility in how you can specify estimating equations. Specifically, `delicatessen` allows low-level access to the estimating equations, which allows you develop your own novel combinations. However, building your own stack of estimating equations can be complicated when first starting out. Here, I provide an overview and tips for how to build your own estimating equations using `delicatessen`. Understanding this process will unleash the true power of estimating equations in your quantitative research.\n", "\n", "In general, it will be best if you find a paper or book that directly provides the mathematical expression of an estimating equation(s) for you. Alternatively, if you can find the score function or gradient for a regression model, that is the corresponding estimating equation. This section does *not* address how to derive your own estimating equation(s). Rather, this guide focuses on how to translate an estimating equation into code that is compatible with `delicatessen`, as `delicatessen` assumes you are giving it a valid estimating equation. For some guidance on deriving the mathematical expressions for your own estimating equations, see Ross et al. (2024)." ] }, { "cell_type": "code", "execution_count": 1, "id": "8cf88eb2-873f-43a1-9a12-0aa46c9f80db", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd \n", "from scipy.stats import logistic\n", "\n", "from delicatessen import MEstimator\n", "from delicatessen.estimating_equations import ee_regression\n", "from delicatessen.utilities import inverse_logit" ] }, { "cell_type": "markdown", "id": "f097954b-c8d6-40eb-8ddf-1aff4493fdcd", "metadata": {}, "source": [ "## Building from scratch\n", "\n", "To begin, we will go through the case of building an estimating equation completely from scratch. To do this, we will go through an example with linear regression. First, we have the estimating equation (which is the score function) given in Boos & Stefanski (2013) and Ross et al. (2024).\n", "\n", "$$ \\sum_{i=1}^{n} (Y_i - X_i \\beta^T) X_i^T = 0 $$\n", "\n", "where $Y$ is the outcome and $X$ is the design matrix. \n", "\n", "We will demonstrate using the following simulated data set" ] }, { "cell_type": "code", "execution_count": 2, "id": "39296f1d", "metadata": {}, "outputs": [], "source": [ "np.random.seed(8091421)\n", "n = 200\n", "d = pd.DataFrame()\n", "d['X'] = np.random.normal(size=n)\n", "d['W'] = np.random.normal(size=n)\n", "d['C'] = 1\n", "d['Y'] = 5 + d['X'] + np.random.normal(size=n)" ] }, { "cell_type": "markdown", "id": "d7cea190", "metadata": {}, "source": [ "Here, we are interested in the model $Y = \\beta_0 + \\beta_1 X + \\epsilon$. To help see how to code the estimating equation for this regression mode, it is helpful to write-out explicitly (instead of using the matrix algebra notation from above). Here the estimating equation is\n", "\n", "$$ \\sum_{i=1}^{n} \n", "\\begin{bmatrix}\n", " \\left[ Y_i - (\\beta_0 + \\beta_1 X_i)\\right] \\times 1 \\\\\n", " \\left[ Y_i - (\\beta_0 + \\beta_1 X_i)\\right] \\times X_i \\\\\n", "\\end{bmatrix}\n", "= 0 $$\n", "where the first equation is for $\\beta_0$ and the second is for $\\beta_1$.\n", "\n", "To program this estimating equation, we have a few options (as suggested by the multiple mathematical expressions we have). First, we can build the estimating equation using a for-loop where each `i` piece will be stacked together. While this for-loop approach will be 'slow', it is often a good strategy to implement a for-loop version that is easier to debug first (unless you are a linear algebra wizard!).\n", "\n", "Below calculates the estimating equation for each `i` in the for-loop. This function returns a stacked array of each `i` observation as a 3-by-n array. That array can then be passed to the `MEstimator`" ] }, { "cell_type": "code", "execution_count": 3, "id": "00a58866", "metadata": {}, "outputs": [], "source": [ "def psi(theta):\n", " # Transforming to arrays\n", " X = np.asarray(d[['C', 'X', 'W']]) # Design matrix\n", " y = np.asarray(d['Y']) # Dependent variable\n", " beta = np.asarray(theta)[:, None] # Parameters\n", " n = X.shape[0] # Number of observations\n", "\n", " # Where to store each of the resulting estimating functions\n", " est_vals = []\n", "\n", " # Looping through each observation from 1 to n\n", " for i in range(n):\n", " v_i = (y[i] - np.dot(X[i], beta)) * X[i]\n", " est_vals.append(v_i)\n", "\n", " # returning 3-by-n NumPy array\n", " return np.asarray(est_vals).T" ] }, { "cell_type": "markdown", "id": "cf7c2c9b", "metadata": {}, "source": [ "`delicatessen` is not picky about the particulars of the estimating functions. One only needs to create a function that has a single argument (i.e., `theta`). That argument needs to return a `len(theta)` by $n$ NumPy array. Below is code that calls our custom estimating equation" ] }, { "cell_type": "code", "execution_count": 4, "id": "310e0ceb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 5.08271932e+00, 9.68128991e-01, -1.27507410e-04])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "estr = MEstimator(psi, init=[0., 0., 0.])\n", "estr.estimate()\n", "estr.theta" ] }, { "cell_type": "markdown", "id": "c52bf2fe", "metadata": {}, "source": [ "for which the coefficients match the coefficients from a ordinary least squares model (variance estimates may differ, since by default most OLS software use the inverse of the information matrix to estimate the variance rather than the empirical sandwich variance estimator).\n", "\n", "Here, we can vectorize the operations. The advantage of the vectorized-form is that it will run much faster. With some careful experimentation, the following is a vectorized version. Remember that `delicatessen` is expecting a 3-by-n array to be given by the `psi` function in this example. Failure to provide this is a common mistake when building custom estimating equations." ] }, { "cell_type": "code", "execution_count": 5, "id": "b6d834eb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 5.08271932e+00, 9.68128991e-01, -1.27507410e-04])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def psi(theta):\n", " X = np.asarray(d[['C', 'X', 'W']]) # Design matrix\n", " y = np.asarray(d['Y'])[:, None] # Dependent variable\n", " beta = np.asarray(theta)[:, None] # Parameters\n", " return ((y - np.dot(X, beta)) * X).T # Computes all estimating functions\n", "\n", "\n", "estr = MEstimator(psi, init=[0., 0., 0.])\n", "estr.estimate()\n", "estr.theta" ] }, { "cell_type": "markdown", "id": "9c701ab8", "metadata": {}, "source": [ "This code provides the same results. Vectorizing (even parts of an estimating equation) can help to improve run-times if you find the root-finding step to be taking a long time.\n", "\n", "## Building with basics\n", "\n", "Instead of building everything from scratch, you can also piece together built-in estimating equations with your custom estimating equations code. To demonstrate this, we will go through an example with inverse probability weighting.\n", "\n", "The inverse probability weighting estimator consists of four estimating equations: the difference between the weighted means, the weighted mean under $A=1$, the weighted mean under $A=0$, and the propensity score model. We can express this mathematically as\n", "\n", "$$\n", "\\sum_{i=1}^n\n", "\\begin{bmatrix}\n", " (\\theta_1 - \\theta_2) - \\theta_0 \\\\\n", " \\frac{A_i \\times Y_i}{\\pi_i(W_i, \\alpha)} - \\theta_1 \\\\\n", " \\frac{(1-A_i) \\times Y_i}{1-\\pi_i(W_i, \\alpha)} - \\theta_2 \\\\\n", " (A_i - \\text{expit}(W_i^T \\alpha)) W_i\n", "\\end{bmatrix}\n", "= 0\n", "$$\n", "where $A$ is the action of interest, $Y$ is the outcome of interest, and $W$ is the set of confounding variables.\n", "\n", "Rather than re-code the logistic regression model (to estimate the propensity scores), we will use the built-in logistic regression functionality. Below is a stacked estimating equation for the inverse probability weighting estimator above" ] }, { "cell_type": "code", "execution_count": 6, "id": "57f30dc7", "metadata": {}, "outputs": [], "source": [ "def psi(theta):\n", " # Ensuring correct typing\n", " W = np.asarray(d['C', 'W']) # Design matrix of confounders\n", " A = np.asarray(d['A']) # Action\n", " y = np.asarray(y) # Outocome\n", " beta = theta[3:] # Regression parameters\n", "\n", " # Estimating propensity score\n", " preds_reg = ee_regression(theta=beta, # Built-in regression\n", " X=W, # Plug-in covariates for X\n", " y=A, # Plug-in treatment for Y\n", " model='logistic') # Specify logistic\n", " # Estimating weights\n", " pi = inverse_logit(np.dot(W, beta)) # Pr(A|W) using delicatessen.utilities\n", "\n", " # Calculating Y(a=1)\n", " ya1 = (A * y) / pi - theta[1] # i's contribution is (AY) / \\pi\n", "\n", " # Calculating Y(a=0)\n", " ya0 = ((1-A) * y) / (1-pi) - theta[2] # i's contribution is ((1-A)Y) / (1-\\pi)\n", "\n", " # Calculating Y(a=1) - Y(a=0) (using np.ones to ensure a 1-by-n array)\n", " ate = np.ones(y.shape[0]) * (theta[1] - theta[2]) - theta[0]\n", "\n", " # Output (3+b)-by-n stacked array\n", " return np.vstack((ate, # theta[0] is for the ATE\n", " ya1[None, :], # theta[1] is for R1\n", " ya0[None, :], # theta[2] is for R0\n", " preds_reg)) # theta[3:] is for the regression coefficients" ] }, { "cell_type": "markdown", "id": "0fa23650", "metadata": {}, "source": [ "This example demonstrates how estimating equations can easily be stacked together using `delicatessen`. Specifically, both built-in and user-specified functions can be specified together seamlessly. All it requires is specifying both in the estimating equation and returning a stacked array of the estimates.\n", "\n", "One important piece to note here is that the returned array needs to be in the *same* order as the theta's are input. As done here, all the `theta` values after the 3rd are for the propensity score model. Therefore, the propensity score model values are last in the returned stack. Returning the values in a different order than input is a common mistake when programming your own. ``delicatessen`` currently cannot detect this issue, but it will often result in the root-finding procedure failing and returning the starting value(s).\n", "\n", "## Handling `np.nan`\n", "\n", "Sometimes, `np.nan` will show up in your data set. However, `delicatessen` does not naturally handle `np.nan` (as there are many options on how to handle missing data, and `delicatessen` by design does not assume how the user would like missing data to be handled). When there are `np.nan`'s detected in the output estimating equations, `delicatessen` will return an error (by design). The following section discusses how to handle `np.nan` with `delicatessen`.\n", "\n", "In the first case, we will consider handling `np.nan` with a built-in estimating equation. When trying to fit a regression model where there are ``np.nan``'s present, the missing values will be set to a placeholder value and their contributions will be manually removed using an indicator function for missingness. This process is equivalent to a complete-case analysis. Below is an example using the built-in logistic regression estimating equations:" ] }, { "cell_type": "code", "execution_count": 7, "id": "5c439050", "metadata": {}, "outputs": [], "source": [ "d = pd.DataFrame()\n", "d['X'] = np.random.normal(size=100)\n", "y = np.random.binomial(n=1, p=0.5 + 0.01 * d['X'], size=100)\n", "d['y'] = np.where(np.random.binomial(n=1, p=0.9, size=100), y, np.nan)\n", "d['C'] = 1\n", "\n", "X = np.asarray(d[['C', 'X']])\n", "y = np.asarray(d['y'])\n", "r = np.where(d['y'].isna(), 0, 1)\n", "y_no_nan = np.asarray(d['y'].fillna(-999))" ] }, { "cell_type": "code", "execution_count": 8, "id": "1f3c209d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.03331353, 0.27781908])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def psi(theta):\n", " # Estimating logistic model values with filled-in Y's\n", " a_model = ee_regression(theta,\n", " X=X, y=y_no_nan,\n", " model='logistic')\n", " # Setting contributions with missing to zero manually\n", " a_model = a_model * r\n", " return a_model\n", "\n", "\n", "estr = MEstimator(psi, init=[0, 0, ])\n", "estr.estimate()\n", "estr.theta" ] }, { "cell_type": "markdown", "id": "c9f5067a", "metadata": {}, "source": [ "If the contribution to the estimating function with missing Y's had not been included, the optimized points would have been `nan`. Alternatively, we could have used `numpy.nan_to_num`. However, this method to handling `nan` does not work with automatic differentiation (i.e., when `deriv_method='exact'`).\n", "\n", "As a second example, we will consider estimating the mean with missing data and correcting for informative missing by inverse probability weighting. To reduce random error, this example uses 1000 observations. Here, we set `nan`'s to be zero's prior to subtracting off the mean using `np.where`. This is shown below:" ] }, { "cell_type": "code", "execution_count": 9, "id": "c3e1b975", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([5.04578442, 0.96731873, 0.97044598])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Generating data\n", "n = 1000\n", "d = pd.DataFrame()\n", "d['X'] = np.random.normal(size=n)\n", "y = 5 + d['X'] + np.random.normal(size=n)\n", "d['y'] = np.where(np.random.binomial(n=1, p=logistic.cdf(1 + d['X']), size=n), y, np.nan)\n", "d['C'] = 1\n", "\n", "X = np.asarray(d[['C', 'X']])\n", "y = np.asarray(d['y'])\n", "r = np.asarray(np.where(d['y'].isna(), 0, 1))\n", "\n", "\n", "def psi(theta):\n", " # Estimating logistic model values\n", " a_model = ee_regression(theta[1:], X=X, y=r,\n", " model='logistic')\n", " pi = inverse_logit(np.dot(X, theta[1:]))\n", "\n", " y_w = np.where(r, y / pi, 0) - theta[0] # nan-to-zero then subtract off\n", " return np.vstack((y_w[None, :],\n", " a_model))\n", "\n", "\n", "estr = MEstimator(psi, init=[0, 0, 0])\n", "estr.estimate()\n", "estr.theta" ] }, { "cell_type": "markdown", "id": "e7ecd26e", "metadata": {}, "source": [ "Our estimate for the mean (i.e., `theta[0]`) close to the truth (i.e., 5). \n", "\n", "\n", "## Wrapping estimators\n", "\n", "If you find yourself using `delicatessen` for the same tasks, you may want to develop wrapper functions. This is a process I use for my own work and other software libraries I have developed. To illustrate the idea behind wrapping up `delicatessen`, we consider the previous missing data example. We will create a new class object, which will organize and process the data for us. This data will then be passed to `MEstimator` for estimation" ] }, { "cell_type": "code", "execution_count": 10, "id": "4cdcd329-f5b1-4386-8029-0628009269c0", "metadata": {}, "outputs": [], "source": [ "class IPMW:\n", " def __init__(self, data, outcome):\n", " # Storing data inputs\n", " self.data = data.copy()\n", " self.outcome = outcome\n", " \n", " # Calculating missing indicator\n", " self.missing_indicator = np.where(self.data[self.outcome].isna(), 0, 1)\n", " self.y_no_nan = self.data[self.outcome].fillna(-999)\n", "\n", " # Empty storage for later\n", " self.M_dmatrix = None\n", "\n", " def missing_model(self, design_matrix):\n", " self.M_dmatrix = self.data[design_matrix]\n", "\n", " def estimate(self, decimals=3):\n", " def psi(theta):\n", " # Estimating logistic model values\n", " a_model = ee_regression(theta[1:], \n", " X=self.M_dmatrix, \n", " y=self.missing_indicator,\n", " model='logistic')\n", " pi = inverse_logit(np.dot(self.M_dmatrix, theta[1:]))\n", " \n", " y_w = np.where(self.missing_indicator, self.y_no_nan / pi, 0) - theta[0]\n", " return np.vstack((y_w[None, :],\n", " a_model))\n", "\n", " # M-estimator\n", " estr = MEstimator(psi, init=[0, 0, 0])\n", " estr.estimate()\n", "\n", " print(\"The estimated mean is\", np.round(estr.theta[0], decimals=decimals))" ] }, { "cell_type": "code", "execution_count": 11, "id": "175ef6bb-7594-419a-b959-fcae570148e9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The estimated mean is 5.046\n" ] } ], "source": [ "ipmw = IPMW(d, outcome='y')\n", "ipmw.missing_model(['C', 'X'])\n", "ipmw.estimate()" ] }, { "cell_type": "markdown", "id": "d30b1970-1a23-4147-b7b5-fa13089979e6", "metadata": {}, "source": [ "This illustrates how you can package `delicatessen` for your needs (or your software library).\n", "\n", "## Common Mistakes\n", "\n", "Here is a list of common mistakes when programming your own estimating functions, most of which we have done ourselves:\n", "\n", "1. The `psi` function doesn't return a NumPy array.\n", "2. The `psi` function returns the wrong shape. Remember, it should be a $b$-by-$n$ NumPy array!\n", "3. The `psi` function is summing over $n$. `delicatessen` needs to perform the sum internally (in order to compute the bread and filling), so do not sum over $n$ in `psi`!\n", "4. The `theta` values and `b` *must* be in the same order. If `theta[0]` is the mean, the 1st row of the returned array better be the estimating function for that mean!\n", "5. Automatic differentiation with `np.nan_to_num`. This will result in the bread matrix having `nan` values.\n", "6. Trying to use a SciPy function with `deriv_method='exact'` (only some functionalities are currently supported. please open an issue on GitHub if you have one you would like to see added).\n", "\n", "If you still have trouble, please open an issue at\n", "pzivich/Delicatessen\n", "This will help me to add other common mistakes here and improve the documentation for custom estimating equations.\n", "\n", "### Additional Examples\n", "\n", "For further examples on how to use `delicatessen` with custom estimating equations, see the Applied Examples page." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 5 }