{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# Hardy-Weinberg Equilibrium\n",
    "\n",
    "Hardy-Weinberg equilibrium is the condition that the allele and genotypes frequencies in a population remain constant between generations, absent of other evolutionary influences. Another way to express this statement is that in a randomly mating (large) population with no selection or mutation, if the frequency of an allele $D$ is the proportion $\\rho$ then the expected frequency of heterozygous individuals (those with a $D$ and $R$ allele) is $2\\rho(1-\\rho)$. To check whether a population is in Hardy-Weinberg equilibrium, we can compare this expected allele proportion to the observed proportion with heterozygous alleles. Therefore, we want to compute $\\gamma = \\pi - 2\\rho(1-\\rho)$, where $\\pi$ is the observed proportion. While $\\delta$ is easy to compute, the variance is less straightforward. In general, one needs to apply the delta method to combine the variance estimates for $\\rho$ and $\\pi$, which involves computing the derivatives. An alternative is estimating equations, which automates the delta method and provides a variance estimate without needing to do any by-hand calculations.\n",
    "\n",
    "To illustrate how estimating equations simplify estimation of whether a population is in Hardy-Weinberg equilibrium, we provide an application using data collected by AS Wiener on the M-N blood group locus among white individuals in New York City white between 1931 and 1969. Here, we consider whether the M-N blood group is in Hardy-Weinberg equilibrium.\n",
    "\n",
    "## Setup"
   ],
   "id": "11bed29977deeb28"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2026-04-13T21:19:52.262936200Z",
     "start_time": "2026-04-13T21:19:51.462363900Z"
    }
   },
   "cell_type": "code",
   "source": [
    "import numpy as np\n",
    "import scipy as sp\n",
    "import pandas as pd\n",
    "import delicatessen as deli\n",
    "from delicatessen import MEstimator\n",
    "\n",
    "print(\"Versions\")\n",
    "print(\"NumPy:       \", np.__version__)\n",
    "print(\"SciPy:       \", sp.__version__)\n",
    "print(\"Pandas:      \", pd.__version__)\n",
    "print(\"Delicatessen:\", deli.__version__)"
   ],
   "id": "28012174eda02a36",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Versions\n",
      "NumPy:        2.3.5\n",
      "SciPy:        1.16.3\n",
      "Pandas:       2.3.3\n",
      "Delicatessen: 4.2\n"
     ]
    }
   ],
   "execution_count": 2
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "The data used here can be found in Table 4.3 of *The Evaluation of Forensic DNA Evidence*. This tabled data is reproduced below and converted into individual-level data",
   "id": "a26b2ee0d373f751"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2026-04-13T21:38:25.967436400Z",
     "start_time": "2026-04-13T21:38:25.955077500Z"
    }
   },
   "cell_type": "code",
   "source": [
    "# Table 4.3 — Wiener data\n",
    "table = pd.DataFrame()\n",
    "table['S'] = [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6]\n",
    "table['A'] = [1., 0.5, 0.] * 6\n",
    "table['H'] = [0, 1, 0] * 6\n",
    "table['n'] = [71, 116, 49, 132, 232, 97, 166, 289, 127,\n",
    "              1037, 1623, 608, 287, 481, 186, 158, 249, 93]\n",
    "assert 6001 == np.sum(table['n'])\n",
    "\n",
    "# Expanding to individual level\n",
    "d = pd.DataFrame(np.repeat(table.values, table['n'], axis=0), columns=table.columns)"
   ],
   "id": "25679613d1014546",
   "outputs": [],
   "execution_count": 12
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Overall\n",
    "\n",
    "Using this data, we can then determine whether Hardy-Weinberg equilibrium holds. To begin, we will ignore that these data come from different convenience samples. The estimating equations for testing whether a population is in Hardy-Weinberg equilibrium are\n",
    "$$ \\psi(O_i; \\theta) =\n",
    "\\begin{bmatrix}\n",
    "  A_i - \\rho \\\\\n",
    "  H_i - \\pi \\\\\n",
    "  \\pi - 2\\rho(1 - \\rho) - \\gamma\n",
    "\\end{bmatrix}\n",
    "$$\n",
    "where $A_i$ is the allele frequency for individual $i$ and $H_i$ is an indicator whether an individual is heterozygous. The final estimating function is for the Hardy-Weinberg equilibrium comparison, which follows from some algebra. The code provided below implements these estimating functions with `delicatessen`"
   ],
   "id": "662b6515d71d1966"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2026-04-13T21:35:26.048128300Z",
     "start_time": "2026-04-13T21:35:26.027141Z"
    }
   },
   "cell_type": "code",
   "source": [
    "alleles = np.asarray(d['A'])\n",
    "heterozygote = np.asarray(d['H'])"
   ],
   "id": "b21232733a367773",
   "outputs": [],
   "execution_count": 8
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2026-04-13T21:37:19.453573500Z",
     "start_time": "2026-04-13T21:37:19.434608500Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def psi_delta(theta):\n",
    "    # Dividing up input parameters\n",
    "    rho, pi, gamma = theta\n",
    "\n",
    "    # Estimating function for mean of MM group\n",
    "    ee_p = alleles - rho\n",
    "\n",
    "    # Estimating function for proportion of MN group\n",
    "    ee_h = heterozygote - pi\n",
    "\n",
    "    # Estimating function for Hardy-Weinberg equilibrium\n",
    "    ef_d = pi - 2*rho*(1-rho) - gamma * np.ones(alleles.shape[0])\n",
    "\n",
    "    # Stacking the estimating functions together\n",
    "    return np.vstack([ee_p, ee_h, ef_d])"
   ],
   "id": "f4824820dc1e8d29",
   "outputs": [],
   "execution_count": 9
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2026-04-13T21:51:24.277090300Z",
     "start_time": "2026-04-13T21:51:24.262048700Z"
    }
   },
   "cell_type": "code",
   "source": [
    "estr = MEstimator(psi_delta, init=[0.5, 0.5, 0.])\n",
    "estr.estimate()"
   ],
   "id": "540a8172b9a5ce84",
   "outputs": [],
   "execution_count": 18
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2026-04-13T21:51:28.875490300Z",
     "start_time": "2026-04-13T21:51:28.839547500Z"
    }
   },
   "cell_type": "code",
   "source": "estr.print_results(decimals=3)",
   "id": "36b24272317afe4a",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==============================================================\n",
      "              Estimation Method: M-estimator\n",
      "--------------------------------------------------------------\n",
      "No. Observations:        6001 | No. Parameters:              3\n",
      "Solving algorithm:         lm | Max Iterations:           5000\n",
      "Solving tolerance:      1e-09 | Allow P-Inverse:             1\n",
      "Derivative Method:     approx | Deriv Approx:            1e-09\n",
      "Small N Correction:      None | Distribution:           Z-stat\n",
      "==============================================================\n",
      "   Theta   StdErr  Z-score      LCL      UCL  P-value  S-value \n",
      "--------------------------------------------------------------\n",
      "   0.558    0.005  123.599    0.549    0.566    0.000      inf \n",
      "   0.498    0.006   77.196    0.486    0.511    0.000      inf \n",
      "   0.005    0.006    0.766   -0.008    0.017    0.443    1.173 \n",
      "==============================================================\n"
     ]
    }
   ],
   "execution_count": 19
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "Here, the last coefficient is near zero which indicates that the population is in Hardy-Weinberg equilibrium (provided the other assumptions for this conceptual model all hold). From the output, we can see we get confidence intervals and a corresponding P-value for this directly. This example highlights how estimating equations simplify statistical inference for transformation of different parameters.\n",
    "\n",
    "## Strata-Specific\n",
    "\n",
    "Given these results, we might also be interested in whether Hardy-Weinberg equilibrium holds for all the convenience samples. The following code applies a stratified version of the previous estimator. For stratification, we only need to multiply by an indicator of the strata. Effectively, this will zero out the contribution of any individuals outside of the strata. The following code applies this process"
   ],
   "id": "501969c94b140e5f"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2026-04-13T21:52:22.849410900Z",
     "start_time": "2026-04-13T21:52:22.835751400Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def psi_delta_s(theta):\n",
    "    # Setup items for a loop through the estimating functions\n",
    "    start, end = 0, 3\n",
    "    est_funcs = []\n",
    "\n",
    "    # Loop over the specific subgroups\n",
    "    for s in [1, 2, 3, 4, 5, 6]:\n",
    "        # Indicator if in corresponding strata\n",
    "        strata = np.where(d['S'] == s, 1, 0)\n",
    "        # Estimating function for corresponding parameters and strata\n",
    "        theta_s = theta[start: end]\n",
    "        ef_s = psi_delta(theta=theta_s) * strata\n",
    "        # Storage and updating for next steps\n",
    "        est_funcs.append(ef_s)\n",
    "        start = start + 3\n",
    "        end = end + 3\n",
    "\n",
    "    # Stacking the estimating functions together\n",
    "    return np.vstack(est_funcs)"
   ],
   "id": "93eab0b223dca55d",
   "outputs": [],
   "execution_count": 22
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2026-04-13T21:52:24.714599600Z",
     "start_time": "2026-04-13T21:52:24.668511400Z"
    }
   },
   "cell_type": "code",
   "source": [
    "estr = MEstimator(psi_delta_s, init=[0.5, 0.5, 0.]*6)\n",
    "estr.estimate()"
   ],
   "id": "9b009c582ccc2ae8",
   "outputs": [],
   "execution_count": 23
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2026-04-13T21:52:49.387793300Z",
     "start_time": "2026-04-13T21:52:49.354002400Z"
    }
   },
   "cell_type": "code",
   "source": "estr.print_results(subset=[2, 5, 8, 11, 14, -1], decimals=3)",
   "id": "63e198eacd1fcca4",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==============================================================\n",
      "              Estimation Method: M-estimator\n",
      "--------------------------------------------------------------\n",
      "No. Observations:        6001 | No. Parameters:             18\n",
      "Solving algorithm:         lm | Max Iterations:           5000\n",
      "Solving tolerance:      1e-09 | Allow P-Inverse:             1\n",
      "Derivative Method:     approx | Deriv Approx:            1e-09\n",
      "Small N Correction:      None | Distribution:           Z-stat\n",
      "==============================================================\n",
      "   Theta   StdErr  Z-score      LCL      UCL  P-value  S-value \n",
      "--------------------------------------------------------------\n",
      "  -0.004    0.032   -0.128   -0.067    0.059    0.898    0.155 \n",
      "   0.006    0.023    0.265   -0.039    0.052    0.791    0.338 \n",
      "  -0.001    0.021   -0.058   -0.042    0.039    0.954    0.068 \n",
      "   0.005    0.009    0.611   -0.012    0.022    0.541    0.886 \n",
      "   0.010    0.016    0.612   -0.022    0.041    0.540    0.888 \n",
      "   0.006    0.022    0.294   -0.037    0.050    0.769    0.379 \n",
      "==============================================================\n"
     ]
    }
   ],
   "execution_count": 25
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "Again these strata-specific results indicate that Hardy-Weinberg equilibrium holds. There is a challenge here. Namely, that reporting the confidence intervals for this full collection of parameters is not quite appropriate. If we consider statistical inference for *all six parameters*, then the reported confidence intervals will be too narrow (i.e., coverage will be below the implied 95%). To correct for this, we can instead compute the confidence bands. The following code computes the corresponding confidence bands for these six parameters",
   "id": "d3753b8bd276aa6"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2026-04-13T21:57:03.058068400Z",
     "start_time": "2026-04-13T21:57:02.934331600Z"
    }
   },
   "cell_type": "code",
   "source": "estr.confidence_bands(subset=[2, 5, 8, 11, 14, -1], seed=198041)",
   "id": "2a8e6848497aedd4",
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[-0.08903632,  0.08077719],\n",
       "       [-0.0547735 ,  0.06704523],\n",
       "       [-0.05548282,  0.05310035],\n",
       "       [-0.0173577 ,  0.02785834],\n",
       "       [-0.03229401,  0.05188822],\n",
       "       [-0.05136222,  0.06426222]])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 26
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "As expected, the confidence bands are wider. Overall our conclusions of Hardy-Weinberg equilibrium remain the same.\n",
    "\n",
    "## References\n",
    "\n",
    "Hardy, G. H. (1908). Mendelian proportions in a mixed population. *Science*, 28(706), 49-50.\n",
    "\n",
    "National Research Council (US) Committee on DNA Forensic Science: An Update. The Evaluation of Forensic DNA Evidence. Washington (DC): National Academies Press (US); 1996. 4, Population Genetics. Available from: https://www.ncbi.nlm.nih.gov/books/NBK232608/\n",
    "\n",
    "Weir, B. S. (1996). *Genetic data analysis II*. Sunderland. MA: Sinauer Associates, 161-173."
   ],
   "id": "a479cc5dd1d775df"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}