{ "cells": [ { "metadata": {}, "cell_type": "markdown", "source": [ "# Spielman et al. (1993): Transmission/Disequilibrium Test\n", "\n", "In Spielman et al., the Transmission/Disequilibrium Test (TDT) is introduced to assess whether a genetic marker results in a population-level association. This design works by restricting to heterozygous parents, who have one copy of each allele. Under no association, each allele should be equally likely to be transmitted to offspring. If an allele was instead associated with disease, then transmission to affected children should happen more often than chance. Spielman et al. apply this approach is the context of Type 1 diabetes (insulin-dependent diabetes mellitus). Here, we re-examine their application using estimating equations\n", "\n", "## Setup" ], "id": "7c371f633cdcf8f7" }, { "cell_type": "code", "id": "initial_id", "metadata": { "collapsed": true, "ExecuteTime": { "end_time": "2026-04-20T18:46:45.889195700Z", "start_time": "2026-04-20T18:46:45.179415200Z" } }, "source": [ "import numpy as np\n", "import scipy as sp\n", "import pandas as pd\n", "import delicatessen as deli\n", "from delicatessen import MEstimator\n", "from delicatessen.utilities import inverse_logit, aggregate_efuncs\n", "\n", "print(\"Versions\")\n", "print(\"NumPy: \", np.__version__)\n", "print(\"SciPy: \", sp.__version__)\n", "print(\"Pandas: \", pd.__version__)\n", "print(\"Delicatessen:\", deli.__version__)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Versions\n", "NumPy: 2.3.5\n", "SciPy: 1.16.3\n", "Pandas: 2.3.3\n", "Delicatessen: 4.2\n" ] } ], "execution_count": 1 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:45.901521700Z", "start_time": "2026-04-20T18:46:45.889195700Z" } }, "cell_type": "code", "source": "y = np.asarray([1, ]*78 + [0, ]*46)", "id": "54f849ca869c8df", "outputs": [], "execution_count": 2 }, { "metadata": {}, "cell_type": "markdown", "source": [ "The corresponding data is available from Table 5 of their paper. In particular, there are 78 of 124 disease-associated alleles transmitted. To apply the TDT method, this data is sufficient.\n", "\n", "The TDT compares the proportion of transmitted candidate allele versus the expected. Here, there are only two alleles, so the expected transmission is 0.50. TDT compares the proportion observed in the data to this expectation.\n", "\n", "For estimating equations, we simply need to estimate the proportion of transmitted alleles. Then we can use the sandwich variance estimator for inference. There are several ways to estimate the proportion. The simplest is to use the following estimating function\n", "$$ \\psi(O_i; \\rho) = Y_i - \\rho $$\n", "where $\\rho$ is the proportion and $Y_i$ is an indicator for the candidate allele. The following code applies this estimator" ], "id": "3bd5b52e199dd1b8" }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:45.919060300Z", "start_time": "2026-04-20T18:46:45.901521700Z" } }, "cell_type": "code", "source": [ "def psi(theta):\n", " return y - theta[0]" ], "id": "617ebe1dc782e433", "outputs": [], "execution_count": 3 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:45.930075200Z", "start_time": "2026-04-20T18:46:45.919060300Z" } }, "cell_type": "code", "source": [ "estr = MEstimator(psi, init=[0.5])\n", "estr.estimate()" ], "id": "e3b8618aa85056f4", "outputs": [], "execution_count": 4 }, { "metadata": {}, "cell_type": "markdown", "source": "Now we can conduct the statistical test comparing the observed proportion to the expected proportion. There are several ways to do this. The easiest is to compute the corresponding P-value. The following code applies this process", "id": "3046fbc306e09a77" }, { "metadata": {}, "cell_type": "markdown", "source": "Then we can aslo....", "id": "698338172b77ec" }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:45.946691500Z", "start_time": "2026-04-20T18:46:45.930075200Z" } }, "cell_type": "code", "source": "estr.p_values(null=0.5)", "id": "cda576c2e3a16571", "outputs": [ { "data": { "text/plain": [ "array([0.00293528])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 5 }, { "metadata": {}, "cell_type": "markdown", "source": [ "This P-value indicates the allele transmission is quite different from the expectation of 0.5, which would then indicate that this allele is related to Type 1 Diabetes.\n", "\n", "Rather than a P-value, we can also compute S-values, which is the 'surprisal' value. It translates the P-value into how many heads in a row (with a fair coin) would need to occur. This conversion can be done in `delicatessen`" ], "id": "dff0e485bd9abc5f" }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:45.958195400Z", "start_time": "2026-04-20T18:46:45.946691500Z" } }, "cell_type": "code", "source": "estr.s_values(null=0.5)", "id": "a7f619f300a356fb", "outputs": [ { "data": { "text/plain": [ "array([8.41228684])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 6 }, { "metadata": {}, "cell_type": "markdown", "source": [ "So, the comparison we see is as unlikely as getting more than 8 heads in a row with a fair coin. That's pretty unlikely (if you don't think so, try flipping a coin until you get 8 heads in a row).\n", "\n", "As a final option, we can also look at the confidence intervals. " ], "id": "c187bdad84d274e0" }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:45.976863200Z", "start_time": "2026-04-20T18:46:45.958195400Z" } }, "cell_type": "code", "source": "estr.confidence_intervals()", "id": "70daf53da8123ba0", "outputs": [ { "data": { "text/plain": [ "array([[0.54400821, 0.71405631]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 7 }, { "metadata": {}, "cell_type": "markdown", "source": [ "The confidence intervals are quite wide, but are fairly above 0.5.\n", "\n", "While not a concern here, the previous estimating function can have poor performance when the true $\\rho$ is near the boundaries of the parameter space. To ensure our estimate is bounded, we can modify the previous estimating function so that it is bounded in the parameter space. We do this via the inverse-logistic function" ], "id": "687423e5bdcf4c85" }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.096945700Z", "start_time": "2026-04-20T18:46:46.076681200Z" } }, "cell_type": "code", "source": [ "def psi(theta):\n", " return y - inverse_logit(theta[0])" ], "id": "5332bfb5a8417ef9", "outputs": [], "execution_count": 8 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.140004100Z", "start_time": "2026-04-20T18:46:46.123979600Z" } }, "cell_type": "code", "source": [ "estr = MEstimator(psi, init=[0.])\n", "estr.estimate()" ], "id": "7a1615679a5627a7", "outputs": [], "execution_count": 9 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.166896300Z", "start_time": "2026-04-20T18:46:46.147370700Z" } }, "cell_type": "code", "source": "estr.p_values(null=0.)", "id": "bf4700b920825503", "outputs": [ { "data": { "text/plain": [ "array([0.00450337])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 10 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.179975700Z", "start_time": "2026-04-20T18:46:46.171804600Z" } }, "cell_type": "code", "source": "estr.s_values(null=0.)", "id": "762e101851275d3a", "outputs": [ { "data": { "text/plain": [ "array([7.79478032])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 11 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.201777Z", "start_time": "2026-04-20T18:46:46.186939900Z" } }, "cell_type": "code", "source": "inverse_logit(estr.confidence_intervals())", "id": "9467d157cd5d207e", "outputs": [ { "data": { "text/plain": [ "array([[0.54083528, 0.7093912 ]])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 12 }, { "metadata": {}, "cell_type": "markdown", "source": [ "These results are quite similar to the previous. However, note that we are now testing whether the transformed parameter is equal to zero (since $\\text{logit}(0.5) = 0$. This is generally to be expected in this example since the proportion seems to be 'far' from the boundary points of 0 or 1.\n", "\n", "## Clustering\n", "\n", "An issue not discussed in Spielman et al. is the impact of potential clustering. These alleles came from 57 parents. Since some parents had multiple children, we might be concerned about clustering by parents. In particular, the previous data does not constitute completely independent observations. As such, the effective sample size is likely lower than the 124 from before. Therefore, the variance estimate is likely too small (and thus our statistical inference is not correct). TDT does not handle this clustering naturally, but estimating equations provide an easy solution to this problem.\n", "\n", "Unfortunately, Spielman et al. does not indicate which data came from the same parents. Therefore, we instead simulate a group variable. This following analysis should be viewed as illustrative of the approach and should not be interpreted." ], "id": "a7e9724f5e82043d" }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.329098Z", "start_time": "2026-04-20T18:46:46.230908900Z" } }, "cell_type": "code", "source": [ "np.random.seed(123454321)\n", "family = list(range(1, 58)) + list(range(1, 58)) + [1, 1, 1, 1, 2, 2, 3, 3, 4, 4]\n", "np.random.shuffle(family)" ], "id": "6eb64e0cbbb5ab92", "outputs": [], "execution_count": 13 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.369768600Z", "start_time": "2026-04-20T18:46:46.356902800Z" } }, "cell_type": "code", "source": [ "def psi_individual(theta):\n", " return y - inverse_logit(theta[0])" ], "id": "969e43804536d116", "outputs": [], "execution_count": 14 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.389401400Z", "start_time": "2026-04-20T18:46:46.376916800Z" } }, "cell_type": "code", "source": [ "def psi_cluster(theta):\n", " return aggregate_efuncs(psi_individual(theta), group=family)" ], "id": "fcf7576e8ef23a43", "outputs": [], "execution_count": 15 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.402618900Z", "start_time": "2026-04-20T18:46:46.389401400Z" } }, "cell_type": "code", "source": [ "estr = MEstimator(psi_cluster, init=[0.])\n", "estr.estimate()" ], "id": "53fc2401f13380eb", "outputs": [], "execution_count": 16 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.427087600Z", "start_time": "2026-04-20T18:46:46.406756200Z" } }, "cell_type": "code", "source": "estr.p_values(null=0.)", "id": "a46807f982030728", "outputs": [ { "data": { "text/plain": [ "array([0.00361566])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 17 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.470450100Z", "start_time": "2026-04-20T18:46:46.427087600Z" } }, "cell_type": "code", "source": "estr.s_values(null=0.)", "id": "dd3fde9425141731", "outputs": [ { "data": { "text/plain": [ "array([8.11152533])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 18 }, { "metadata": { "ExecuteTime": { "end_time": "2026-04-20T18:46:46.488983400Z", "start_time": "2026-04-20T18:46:46.473449700Z" } }, "cell_type": "code", "source": "inverse_logit(estr.confidence_intervals())", "id": "36da902f0222dc56", "outputs": [ { "data": { "text/plain": [ "array([[0.54298989, 0.70759864]])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 19 }, { "metadata": {}, "cell_type": "markdown", "source": [ "Here, the cluster-adjusted analysis did not change the results much. This is to be expected since we cannot link the alleles together and randomly assigned groups here. In practice, clustering may make a bigger impact on statistical inference.\n", "\n", "## References\n", "\n", "Spielman RS, McGinnis RE, & Ewens WJ (1993). Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). *American Journal of Human Genetics*, 52(3), 506." ], "id": "6f224484443e7d5b" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }