delicatessen.estimating_equations.survival.ee_plogit
- ee_plogit(theta, X, t, delta, S=None, unique_times=None, weights=None)
Estimating equation for pooled logistic regression with discrete-time survival data. One way to model survival data is to use survival stacking, where survival data is discretized into intervals. These intervals are expanded into a ‘long’ data set, where each row corresponds to a unique person-period. This standard implementation of pooled logistic regression can be computationally expensive. This estimating equation utilizes a re-expression that does not depend on the creation of a long data set. See the references for further details.
The estimating equations for the implementation that does not require a long data set are
\[\begin{split}\sum_{i=1}^n \sum_{k \in \mathcal{K}} \left\{ I(T_i^* > s_{k-1}) \left( Y_{i,k} - \hat{Y}_{i,k} \right) \begin{bmatrix} \mathbb{X}_i^T \\ \mathbb{S}_{i,k}^T \\ \end{bmatrix} \right\} = 0\end{split}\]where \(T_i^*\) is the observed (continuous time), \(Y_{i,k} = I(s_{k-1} < T_i^* \le s_{k}) \Delta_i\) is the indicator that the event happened in interval \(s_k\), \(\hat{Y}_{i,k} = \text{expit}(\mathbb{X}_i \theta_x^T + \mathbb{S}_{i,k} \theta_s^T)\) is the predicted probability of the event for interval \(k\) where \(\theta_x\) are the coefficients for the covariates and \(\theta_s\) are the coefficients for the times, \(\mathbb{X}\) is the design matrix for baseline covariates, and \(\mathbb{S}\) is the design matrix for time.
Note
Time-varying covariates are not currently supported for
ee_plogit.Here, \(\mathbb{S}\) controls how the hazard is allowed to vary over time and is specified by the
Sargument. The default isNone, which automatically uses disjoint indicators. This approach is highly flexible and makes no parametric constraints on the functional form for the underlying (discrete-time) hazard function. When specifyingS, a p-by-K matrix is expected, where p is the number of parameters and K is the number of discrete time points (as detected in the data). See the examples for advice on constructing this design matrix.Here, \(\theta\) is a 1-by-(b`+`K) array, which corresponds to the coefficients in the corresponding pooled logistic regression model and b is the distinct covariates included as part of
Xand K is the number of unique time intervals int. Note that K varies by whether and howSis specified. See examples for further details.- Parameters
theta (ndarray, list, vector) – theta consists of b`+1 values. Therefore, initial values should consist of the same number as the number of columns present in ``X` plus 1. This can easily be implemented via
[0, ] * X.shape[1] + [0, ]. Note that if using an exponential model, only b values need to be provided.X (ndarray, list, vector) – 2-dimensional vector of n observed values for b variables.
t (ndarray, list, vector) – 1-dimensional vector of n observed times.
delta (ndarray, list, vector) – 1-dimensional vector of n values indicating whether the time was an event or censoring.
S (ndarray, list, vector, None, optional) – Design matrix for how the (discrete-time) hazard can vary over time. Default is
None, which uses a disjoint indicator matrix for time. Here, a computational trick described in Zivich et al. is used to reduce the computational burden. As such, the time design matrix is only dimension K* which corresponds to the number of unique event times. If providing a matrix, it is expected to have dimensions p-by-K, where p is the number of parameters (columns) and K is the number of unique times with a one-unit increase.unique_times (ndarray, list, vector, None, optional) – This argument is only used when
S=None, otherwise it is ignored. This function allows a user to restrict the automatic disjoint indicator matrix to a subset of times. This feature is only intended to be used when evaluating a stratified pooled logistic model and to maintain the computational trick described in Zivich et al. See examples for further details.weights (ndarray, list, vector, None, optional) – 1-dimensional vector of n weights. Default is
None, which assigns a weight of 1 to all observations.
- Returns
Returns a (b`+`K)-by-n NumPy array evaluated for the input
theta.- Return type
array
Examples
Construction of a estimating equation(s) with
ee_plogitshould be done similar to the following>>> import numpy as np >>> import pandas as pd >>> from scipy.stats import logistic >>> from delicatessen import MEstimator >>> from delicatessen.estimating_equations import ee_plogit >>> from delicatessen.data import load_breast_cancer
Here, we will illustrate pooled logistic regression with breast cancer from the Middlesex Hospital in July 1987. This data can be loaded as follows
>>> d = pd.DataFrame(load_breast_cancer(), columns=['d', 't', 'statin'])
To start, we will demonstrate pooled logistic regression where time is modeled using the automatically generated design matrix with disjoint indicators. First, we will assess how many unique event times exist in this data. This can easily be done as follows
>>> unique_event_times = list(np.unique(d.loc[d['d'] == 1, 't']))
Next, we specify the estimating function and our data.
>>> def psi(theta): >>> return ee_plogit(theta=theta, X=d[['statin', ]], delta=d['d'], t=d['t'])
Next, M-estimator procedure can be called to estimate
theta. Note thatinitrequiresX.shape[1]pluslen(unique_event_times)starting values. To help speed up the parameter solving process, it can be helpful to specify the intercept as a negative value (-3as below). Note that the time coefficients follows after the covariate coefficients in the inputs.>>> inits = [0., ] + [-3., ] + [0., ]*(len(unique_times) - 1) >>> estr = MEstimator(stacked_equations=psi, init=inits) >>> estr.estimate()
Inspecting the covariate parameter estimates and variance
>>> estr.theta[0] >>> estr.variance[0, 0]
As described elsewhere, the covariate coefficients can be broadly interpreted as (approximations of that) hazard ratios from a Cox proportional hazards model.
Pooled logistic also allows one to specify parametric functional forms for how the discrete-time hazard evolves over time. Some of these specifications correspond with other well-known survival models. Here, we will model time as linear, which is analogous to a Gompertz model. Here, we need to manually create
S, which can be done using the following process>>> t_steps = np.asarray(range(1, np.max(d['t'])+1)) >>> intercept = np.ones(t_steps.shape)[:, None] >>> s_matrix = np.concatenate([intercept, t_steps[:, None], ], axis=1)
The first step creates an array of times corresponding to one-unit increases in time. The second creates the intercept term. Finally, we concatenate the intercept column and the linear term columns together into the design matrix for time. Now we can specify and fit the corresponding pooled logistic model. Note that only 3 starting values need to be provided, since the design matrix for time has two terms.
>>> def psi(theta): >>> return ee_plogit(theta=theta, X=d[['statin', ]], delta=d['d'], t=d['t'], >>> S=s_matrix)
>>> inits = [0., ] + [-3., 0.] >>> estr = MEstimator(stacked_equations=psi, init=inits) >>> estr.estimate()
Weighted models can be estimated by specifying the optional
weightsargument. This can be a set of weights fixed as baseline, provided as a 1D array (like other regression models). Unlike other regression models,ee_plogitalso allows for weights to be time-varying (i.e., a 2D array can be input). Note that the dimensions of the 2D array need to correspond to the time intervals of \(\mathbb{S}\), so it should be n-by-K dimensions.References
Abbott RD. (1985). Logistic regression in survival analysis. American Journal of Epidemiology, 121(3), 465-471.
D’Agostino RB, Lee ML, Belanger AJ, Cupples LA, Anderson K, & Kannel WB. (1990). Relation of pooled logistic regression to time dependent Cox regression analysis: the Framingham Heart Study. Statistics in Medicine, 9(12), 1501-1515.
Murray EJ, Caniglia EC, & Petito LC. (2021). Causal survival analysis: a guide to estimating intention-to-treat and per-protocol effects from randomized clinical trials with non-adherence. Research Methods in Medicine & Health Sciences, 2(1), 39-49.
Zivich PN, Cole SR, Shook-Sa BE, DeMonte JB, & Edwards JK. (2025). Estimating equations for survival analysis with pooled logistic regression. arXiv:2504.13291
Zivich PN, Klose M, DeMonte JB, Shook-Sa BE, Cole SR, & Edwards JK. (2026). An Improved Pooled Logistic Regression Implementation. Epidemiology, In-Press.