Ding (2024) Chapter 25: Mendelian Randomization

Mendelian Randomization is an alternative identification and estimation strategy. It is based on leveraging genes as instrumental variables for some exposure and outcome. Importantly, it is premised on the standard assumptions made in instrumental variable analysis, so it is only as valid as those assumptions.

Mendelian Randomization with individual-level data becomes a single Two-Stage Least Squares problem, which are illustrated with delicatessen in other examples. However, many Mendelian Randomization analyses are based on summary-level data. Here, we use delicatessen to apply these methods. The provided example comes from Peng Ding’s book A First Course in Causal Inference. Using data from the R package mr.raps (bmi.sbp), the results from 3 GWAS studies are used to assess the effect of body mass index (BMI) on systolic blood pressure (SBP). For various reasons, this ‘effect’ is likely ill-defined. As such, this example should only be viewed as illustrative of how Mendelian Randomization analyses can be done using delicatessen.

Setup

[1]:

import numpy as np
import scipy as sp
import pandas as pd
import delicatessen as deli
from delicatessen import MEstimator
from delicatessen.estimating_equations import ee_regression

print("Versions")
print("NumPy:       ", np.__version__)
print("SciPy:       ", sp.__version__)
print("Pandas:      ", pd.__version__)
print("Delicatessen:", deli.__version__)

Versions
NumPy:        2.3.5
SciPy:        1.16.3
Pandas:       2.3.3
Delicatessen: 4.2

[3]:

d = pd.read_csv("data/mr_bmi_sbp.csv")
d['I'] = 1

Here, we analyze the summary estimates and the estimated variance using Egger regression. Egger regression operates by fitting a linear model for the association between the outcome and exposure conditional on the association between the instrument and exposure. Here, a weighted linear regression is used, where the weights are the inverse of the variance for the association between the exposure and outcome.

Fitting this model can be done using the ee_regression functionality in delicatessen as follows

[6]:

def psi(theta):
    return ee_regression(theta, y=d['beta.outcome'],
                         X=d[['I', 'beta.exposure']],
                         weights=1/(d['se.outcome']**2),
                         model='linear')

[8]:

estr = MEstimator(psi, init=[0, 0])
estr.estimate()
estr.print_results(decimals=4)

==============================================================
              Estimation Method: M-estimator
--------------------------------------------------------------
No. Observations:         160 | No. Parameters:              2
Solving algorithm:         lm | Max Iterations:           5000
Solving tolerance:      1e-09 | Allow P-Inverse:             1
Derivative Method:     approx | Deriv Approx:            1e-09
Small N Correction:      None | Distribution:           Z-stat
==============================================================
   Theta   StdErr  Z-score      LCL      UCL  P-value  S-value
--------------------------------------------------------------
  0.0001   0.0020   0.0566  -0.0038   0.0040   0.9549   0.0666
  0.3173   0.1069   2.9678   0.1078   0.5268   0.0030   8.3812
==============================================================

These point estimates match those provided in Section 25.3 of the book. You will notice that there are differences between the standard error estimates. This difference arises from delicatessen leveraging the empirical sandwich variance estimator for inference. However, the statistical conclusions we would draw remain consistent between the approaches.

References

Ding P. (2024). A First Course in Causal Inference. Chapman and Hall/CRC.