delicatessen.estimating_equations.survival.ee_plogit

ee_plogit(theta, X, t, delta, S=None, unique_times=None, weights=None)

Estimating equation for pooled logistic regression with discrete-time survival data. One way to model survival data is to use survival stacking, where survival data is discretized into intervals. These intervals are expanded into a ‘long’ data set, where each row corresponds to a unique person-period. This standard implementation of pooled logistic regression can be computationally expensive. This estimating equation utilizes a re-expression that does not depend on the creation of a long data set. See the references for further details.

The estimating equations for the implementation that does not require a long data set are

\[\begin{split}\sum_{i=1}^n \sum_{k \in \mathcal{K}} \left\{ I(T_i^* > s_{k-1}) \left( Y_{i,k} - \hat{Y}_{i,k} \right) \begin{bmatrix} \mathbb{X}_i^T \\ \mathbb{S}_{i,k}^T \\ \end{bmatrix} \right\} = 0\end{split}\]

where \(T_i^*\) is the observed (continuous time), \(Y_{i,k} = I(s_{k-1} < T_i^* \le s_{k}) \Delta_i\) is the indicator that the event happened in interval \(s_k\), \(\hat{Y}_{i,k} = \text{expit}(\mathbb{X}_i \theta_x^T + \mathbb{S}_{i,k} \theta_s^T)\) is the predicted probability of the event for interval \(k\) where \(\theta_x\) are the coefficients for the covariates and \(\theta_s\) are the coefficients for the times, \(\mathbb{X}\) is the design matrix for baseline covariates, and \(\mathbb{S}\) is the design matrix for time.

Note

Time-varying covariates are not currently supported for ee_plogit.

Here, \(\mathbb{S}\) controls how the hazard is allowed to vary over time and is specified by the S argument. The default is None, which automatically uses disjoint indicators. This approach is highly flexible and makes no parametric constraints on the functional form for the underlying (discrete-time) hazard function. When specifying S, a p-by-K matrix is expected, where p is the number of parameters and K is the number of discrete time points (as detected in the data). See the examples for advice on constructing this design matrix.

Here, \(\theta\) is a 1-by-(b`+`K) array, which corresponds to the coefficients in the corresponding pooled logistic regression model and b is the distinct covariates included as part of X and K is the number of unique time intervals in t. Note that K varies by whether and how S is specified. See examples for further details.

Parameters
  • theta (ndarray, list, vector) – theta consists of b`+1 values. Therefore, initial values should consist of the same number as the number of columns present in ``X` plus 1. This can easily be implemented via [0, ] * X.shape[1] + [0, ]. Note that if using an exponential model, only b values need to be provided.

  • X (ndarray, list, vector) – 2-dimensional vector of n observed values for b variables.

  • t (ndarray, list, vector) – 1-dimensional vector of n observed times.

  • delta (ndarray, list, vector) – 1-dimensional vector of n values indicating whether the time was an event or censoring.

  • S (ndarray, list, vector, None, optional) – Design matrix for how the (discrete-time) hazard can vary over time. Default is None, which uses a disjoint indicator matrix for time. Here, a computational trick described in Zivich et al. is used to reduce the computational burden. As such, the time design matrix is only dimension K* which corresponds to the number of unique event times. If providing a matrix, it is expected to have dimensions p-by-K, where p is the number of parameters (columns) and K is the number of unique times with a one-unit increase.

  • unique_times (ndarray, list, vector, None, optional) – This argument is only used when S=None, otherwise it is ignored. This function allows a user to restrict the automatic disjoint indicator matrix to a subset of times. This feature is only intended to be used when evaluating a stratified pooled logistic model and to maintain the computational trick described in Zivich et al. See examples for further details.

  • weights (ndarray, list, vector, None, optional) – 1-dimensional vector of n weights. Default is None, which assigns a weight of 1 to all observations.

Returns

Returns a (b`+`K)-by-n NumPy array evaluated for the input theta.

Return type

array

Examples

Construction of a estimating equation(s) with ee_plogit should be done similar to the following

>>> import numpy as np
>>> import pandas as pd
>>> from scipy.stats import logistic
>>> from delicatessen import MEstimator
>>> from delicatessen.estimating_equations import ee_plogit
>>> from delicatessen.data import load_breast_cancer

Here, we will illustrate pooled logistic regression with breast cancer from the Middlesex Hospital in July 1987. This data can be loaded as follows

>>> d = pd.DataFrame(load_breast_cancer(), columns=['d', 't', 'statin'])

To start, we will demonstrate pooled logistic regression where time is modeled using the automatically generated design matrix with disjoint indicators. First, we will assess how many unique event times exist in this data. This can easily be done as follows

>>> unique_event_times = list(np.unique(d.loc[d['d'] == 1, 't']))

Next, we specify the estimating function and our data.

>>> def psi(theta):
>>>     return ee_plogit(theta=theta, X=d[['statin', ]], delta=d['d'], t=d['t'])

Next, M-estimator procedure can be called to estimate theta. Note that init requires X.shape[1] plus len(unique_event_times) starting values. To help speed up the parameter solving process, it can be helpful to specify the intercept as a negative value (-3 as below). Note that the time coefficients follows after the covariate coefficients in the inputs.

>>> inits = [0., ] + [-3., ] + [0., ]*(len(unique_times) - 1)
>>> estr = MEstimator(stacked_equations=psi, init=inits)
>>> estr.estimate()

Inspecting the covariate parameter estimates and variance

>>> estr.theta[0]
>>> estr.variance[0, 0]

As described elsewhere, the covariate coefficients can be broadly interpreted as (approximations of that) hazard ratios from a Cox proportional hazards model.

Pooled logistic also allows one to specify parametric functional forms for how the discrete-time hazard evolves over time. Some of these specifications correspond with other well-known survival models. Here, we will model time as linear, which is analogous to a Gompertz model. Here, we need to manually create S, which can be done using the following process

>>> t_steps = np.asarray(range(1, np.max(d['t'])+1))
>>> intercept = np.ones(t_steps.shape)[:, None]
>>> s_matrix = np.concatenate([intercept, t_steps[:, None], ], axis=1)

The first step creates an array of times corresponding to one-unit increases in time. The second creates the intercept term. Finally, we concatenate the intercept column and the linear term columns together into the design matrix for time. Now we can specify and fit the corresponding pooled logistic model. Note that only 3 starting values need to be provided, since the design matrix for time has two terms.

>>> def psi(theta):
>>>     return ee_plogit(theta=theta, X=d[['statin', ]], delta=d['d'], t=d['t'],
>>>                               S=s_matrix)
>>> inits = [0., ] + [-3., 0.]
>>> estr = MEstimator(stacked_equations=psi, init=inits)
>>> estr.estimate()

Weighted models can be estimated by specifying the optional weights argument. This can be a set of weights fixed as baseline, provided as a 1D array (like other regression models). Unlike other regression models, ee_plogit also allows for weights to be time-varying (i.e., a 2D array can be input). Note that the dimensions of the 2D array need to correspond to the time intervals of \(\mathbb{S}\), so it should be n-by-K dimensions.

References

Abbott RD. (1985). Logistic regression in survival analysis. American Journal of Epidemiology, 121(3), 465-471.

D’Agostino RB, Lee ML, Belanger AJ, Cupples LA, Anderson K, & Kannel WB. (1990). Relation of pooled logistic regression to time dependent Cox regression analysis: the Framingham Heart Study. Statistics in Medicine, 9(12), 1501-1515.

Murray EJ, Caniglia EC, & Petito LC. (2021). Causal survival analysis: a guide to estimating intention-to-treat and per-protocol effects from randomized clinical trials with non-adherence. Research Methods in Medicine & Health Sciences, 2(1), 39-49.

Zivich PN, Cole SR, Shook-Sa BE, DeMonte JB, & Edwards JK. (2025). Estimating equations for survival analysis with pooled logistic regression. arXiv:2504.13291

Zivich PN, Klose M, DeMonte JB, Shook-Sa BE, Cole SR, & Edwards JK. (2026). An Improved Pooled Logistic Regression Implementation. Epidemiology, In-Press.