Papers by Mark Van Der Laan

To GEE or Not to GEE
Epidemiology, Jul 1, 2010
Two modeling approaches are commonly used to estimate the associations between neighborhood chara... more Two modeling approaches are commonly used to estimate the associations between neighborhood characteristics and individual-level health outcomes in multilevel studies (subjects within neighborhoods). Random effects models (or mixed models) use maximum likelihood estimation. Population average models typically use a generalized estimating equation (GEE) approach. These methods are used in place of basic regression approaches because the health of residents in the same neighborhood may be correlated, thus violating independence assumptions made by traditional regression procedures. This violation is particularly relevant to estimates of the variability of estimates. Though the literature appears to favor the mixed-model approach, little theoretical guidance has been offered to justify this choice. In this paper, we review the assumptions behind the estimates and inference provided by these 2 approaches. We propose a perspective that treats regression models for what they are in most circumstances: reasonable approximations of some true underlying relationship. We argue in general that mixed models involve unverifiable assumptions on the data-generating distribution, which lead to potentially misleading estimates and biased inference. We conclude that the estimation-equation approach of population average models provides a more useful approximation of the truth.

Estimators for the value of the optimal dynamic treatment rule with application to criminal justice interventions
The International Journal of Biostatistics, Jun 6, 2022
Given an (optimal) dynamic treatment rule, it may be of interest to evaluate that rule – that is,... more Given an (optimal) dynamic treatment rule, it may be of interest to evaluate that rule – that is, to ask the causal question: what is the expected outcome had every subject received treatment according to that rule? In this paper, we study the performance of estimators that approximate the true value of: (1) an a priori known dynamic treatment rule (2) the true, unknown optimal dynamic treatment rule (ODTR); (3) an estimated ODTR, a so-called “data-adaptive parameter,” whose true value depends on the sample. Using simulations of point-treatment data, we specifically investigate: (1) the impact of increasingly data-adaptive estimation of nuisance parameters and/or of the ODTR on performance; (2) the potential for improved efficiency and bias reduction through the use of semiparametric efficient estimators; and, (3) the importance of sample splitting based on the cross-validated targeted maximum likelihood estimator (CV-TMLE) for accurate inference. In the simulations considered, there was very little cost and many benefits to using CV-TMLE to estimate the value of the true and estimated ODTR; importantly, and in contrast to non cross-validated estimators, the performance of CV-TMLE was maintained even when highly data-adaptive algorithms were used to estimate both nuisance parameters and the ODTR. In addition, we apply these estimators for the value of the rule to the “Interventions” study, an ongoing randomized controlled trial, to identify whether assigning cognitive behavioral therapy (CBT) to criminal justice-involved adults with mental illness using an ODTR significantly reduces the probability of recidivism, compared to assigning CBT in a non-individualized way.

The optimal dynamic treatment rule superlearner: considerations, performance, and application to criminal justice interventions
The International Journal of Biostatistics, Jun 16, 2022
The optimal dynamic treatment rule (ODTR) framework offers an approach for understanding which ki... more The optimal dynamic treatment rule (ODTR) framework offers an approach for understanding which kinds of patients respond best to specific treatments – in other words, treatment effect heterogeneity. Recently, there has been a proliferation of methods for estimating the ODTR. One such method is an extension of the SuperLearner algorithm – an ensemble method to optimally combine candidate algorithms extensively used in prediction problems – to ODTRs. Following the ``causal roadmap,” we causally and statistically define the ODTR and provide an introduction to estimating it using the ODTR SuperLearner. Additionally, we highlight practical choices when implementing the algorithm, including choice of candidate algorithms, metalearners to combine the candidates, and risk functions to select the best combination of algorithms. Using simulations, we illustrate how estimating the ODTR using this SuperLearner approach can uncover treatment effect heterogeneity more effectively than traditional approaches based on fitting a parametric regression of the outcome on the treatment, covariates and treatment-covariate interactions. We investigate the implications of choices in implementing an ODTR SuperLearner at various sample sizes. Our results show the advantages of: (1) including a combination of both flexible machine learning algorithms and simple parametric estimators in the library of candidate algorithms; (2) using an ensemble metalearner to combine candidates rather than selecting only the best-performing candidate; (3) using the mean outcome under the rule as a risk function. Finally, we apply the ODTR SuperLearner to the ``Interventions” study, an ongoing randomized controlled trial, to identify which justice-involved adults with mental illness benefit most from cognitive behavioral therapy to reduce criminal re-offending.
Targeted Learning
Springer eBooks, 2012

Electronic Journal of Statistics, 2015
In binary classification problems, the area under the ROC curve (AUC) is commonly used to evaluat... more In binary classification problems, the area under the ROC curve (AUC) is commonly used to evaluate the performance of a prediction model. Often, it is combined with cross-validation in order to assess how the results will generalize to an independent data set. In order to evaluate the quality of an estimate for cross-validated AUC, we obtain an estimate of its variance. For massive data sets, the process of generating a single performance estimate can be computationally expensive. Additionally, when using a complex prediction method, the process of cross-validating a predictive model on even a relatively small data set can still require a large amount of computation time. Thus, in many practical settings, the bootstrap is a computationally intractable approach to variance estimation. As an alternative to the bootstrap, we demonstrate a computationally efficient influence curve based approach to obtaining a variance estimate for cross-validated AUC.

arXiv (Cornell University), Jun 8, 2017
Estimating the impact of exposures occurring at the cluster-level is of substantial interest. Com... more Estimating the impact of exposures occurring at the cluster-level is of substantial interest. Community randomized trials are often applied to learn about real-world implementation, sustainability, and direct and spill-over effects of interventions with proven individual-level efficacy. Likewise, the literature on the impact of neighborhood exposures on health and well-being continues to grow. To estimate the effect of a cluster-level exposure, we present two targeted maximum likelihood estimators (TMLEs), which harness the hierarchical data structure to reduce bias and variance in an observational setting and increase efficiency in a trial setting. The first TMLE is developed under a non-parametric causal model, which allows for arbitrary interactions between individuals within a cluster. The second TMLE is developed under a causal sub-model, which restricts the dependence of each individual's outcome on the baseline covariates of others. Simulations are used to compare the alternative TMLEs and illustrate the potential gains from incorporating the full hierarchical data structure during estimation, while avoiding unwarranted assumptions. Unlike common approaches (such as random effects modeling or generalized estimating equations), our approach is double robust, semi-parametric, and efficient. We illustrate our approach with an applied example to estimate the association of a community-based Test-and-Treat strategy and cumulative HIV incidence.
K23707-C020.tex 365 2015/12/9 7:22pm * For simplicity, we have been considering the timescale to ... more K23707-C020.tex 365 2015/12/9 7:22pm * For simplicity, we have been considering the timescale to be in months. Depending on our scientific question and the data resolution, we might be interested in shorter or longer intervals. If our time interval were days, then an intervention to start by day 30 (i.e., within 1 month) is a stochastic intervention. Alternatively, we could consider an intervention to initiate therapy on each day or not. For further discussion of longitudinal treatment regimes, see Appendix. K23707-C020.tex 366 2015/12/9 7:22pm * Under the Neyman-Rubin framework, we would assume the existence of the potential outcomes Ya for all exposures a ∈ A.

Many statistical methods exist that can be used to learn a predictor based on observed data. Exam... more Many statistical methods exist that can be used to learn a predictor based on observed data. Examples include decision trees, neural networks, support vector regression, least angle regression, Logic Regression, and the Deletion/Substitution/Addition algorithm. The optimal algorithm for prediction will vary depending on the underlying data-generating distribution. In this article, we introduce a "super learner," a prediction algorithm that applies any set of candidate learners and uses crossvalidation to select among them. Theory shows that asymptotically the super learner performs essentially as well or better than any of the candidate learners. We briefly present the theory behind the super learner, before providing an example based on research aimed at predicting the in vitro phenotypic susceptibility of the HIV virus to antiretroviral drugs based on viral mutations. We apply the super learner to predict susceptibility to one protease inhibitor, nelfinavir, using a set of database-derived nonpolymorphic treatment-selected protease mutations.

The Methods for Improving Reproductive Health in Africa (MIRA) trial is a recently completed rand... more The Methods for Improving Reproductive Health in Africa (MIRA) trial is a recently completed randomized trial that investigated the effect of diaphragm and lubricant gel use in reducing HIV infection among susceptible women. 5,045 women were randomly assigned to either the active treatment arm or not. Additionally, all subjects in both arms received intensive condom counselling and provision, the "gold standard" HIV prevention barrier method. There was much lower reported condom use in the intervention arm than in the control arm, making it difficult to answer important public health questions based solely on the intention-to-treat analysis. We adapt an analysis technique from causal inference to estimate the "direct effects" of assignment to the diaphragm arm, adjusting for condom use in an appropriate sense. Issues raised in the MIRA trial apply to other trials of HIV prevention methods, some of which are currently being conducted or designed.

arXiv (Cornell University), Oct 26, 2017
The assumption that no subject's exposure affects another subject's outcome, known as the no-inte... more The assumption that no subject's exposure affects another subject's outcome, known as the no-interference assumption, has long held a foundational position in the study of causal inference. However, this assumption may be violated in many settings, and in recent years has been relaxed considerably. Often this has been achieved with either the aid of a known underlying network, or the assumption that the population can be partitioned into separate groups, between which there is no interference, and within which each subject's outcome may be affected by all the other subjects in the group via the proportion exposed (the stratified interference assumption). In this paper, we instead consider a complete interference setting, in which each subject affects every other subject's outcome. In particular, we make the stratified interference assumption for a single group consisting of the entire sample. This can occur when the exposure is a shared resource whose efficacy is modified by the number of subjects among whom it is shared. We show that a targeted maximum likelihood estimator for the i.i.d. setting can be used to estimate a class of causal parameters that includes direct effects and overall effects under certain interventions. This estimator remains doubly-robust, semiparametric efficient, and continues to allow for incorporation of machine learning under our model. We conduct a simulation study, and present results from a data application where we study the effect of a nurse-based triage system on the outcomes of patients receiving HIV care in Kenyan health clinics.

Nonparametric Efficient Estimation with Current Status Data and Right-Censored Data Structures When Observing a Marker at the Censoring Time
We study nonparametric estimation with two types of data structures. In the first data structure ... more We study nonparametric estimation with two types of data structures. In the first data structure we observe n i.i.d. copies of (C, N(C)) where N is a counting process and C a random monitoring time. In the second data structure we observe n i.i.d. copies of (C ^ T, I(T =\u3c C), N(C ^ T)), where N is a counting process with a final jump at time T (e.g., death). This data structure includes observing right-censored data on T and a marker variable at the censoring time. In these data structures, easy to compute estimators, namely (Weighted)-Pool-Adjacent-Violator estimators for the unobservable time variables, and the Kaplan-Meier estimator for the time T until the final observable event, are available. These estimators ignore seemingly important information in the data. The actual nonparametric maximum likelihood estimator (NPMLE) uses all the data, but is very hard to compute. In this paper we prove that, at most data generating distributions, the ad hoc estimators yield asymptotically efficient estimators of (square root of n)-estimable parameters, and we explain why the NPMLE is more complex at these data generating distributions. The results and a simulation for a special case in van der Laan, Jewell, Peterson (1997) suggest strongly that the practical performance of the proposed simple estimators is better than the NMLE at these data generating distributions

This is the detailed technical report that accompanies the paper "Analyzing Direct Effects in Ran... more This is the detailed technical report that accompanies the paper "Analyzing Direct Effects in Randomized Trials with Secondary Interventions: An Application to HIV Prevention Trials" (an unpublished, technical report version of which is available online at http://www.bepress.com/ucbbiostat/paper223). The version here gives full details of the models for the time-dependent analysis, and presents further results in the data analysis section. The Methods for Improving Reproductive Health in Africa (MIRA) trial is a recently completed randomized trial that investigated the effect of diaphragm and lubricant gel use in reducing HIV infection among susceptible women. 5,045 women were randomly assigned to either the active treatment arm or not. Additionally, all subjects in both arms received intensive condom counselling and provision, the "gold standard" HIV prevention barrier method. There was much lower reported condom use in the intervention arm than in the control arm, making it difficult to answer important public health questions based solely on the intention-to-treat analysis. We adapt an analysis technique from causal inference to estimate the "direct effects" of assignment to the diaphragm arm, adjusting for condom use in an appropriate sense. Issues raised in the MIRA trial apply to other trials of HIV prevention methods, some of which are currently being conducted or designed.

The present article discusses and compares multiple testing procedures (MTP) for controlling Type... more The present article discusses and compares multiple testing procedures (MTP) for controlling Type I error rates defined as tail probabilities for the number (gFWER) and proportion (TPPFP) of false positives among the rejected hypotheses. Specifically, we consider the gFWER-and TPPFP-controlling MTPs proposed recently by Lehmann & Romano (2004) and in a series of four articles by Dudoit et al. (2004), van der Laan et al. (2004b,a), and Pollard & van der Laan (2004). The former Lehmann & Romano (2004) procedures are marginal, in the sense that they are based solely on the marginal distributions of the test statistics, i.e., on cutoff rules for the corresponding unadjusted p-values. In contrast, the procedures discussed in our previous articles take into account the joint distribution of the test statistics and apply to general data generating distributions, i.e., dependence structures among test statistics. The gFWER-controlling common-cutoff and common-quantile procedures of Dudoit et al. (2004) and Pollard & van der Laan (2004) are based on the distributions of maxima of test statistics and minima of unadjusted p-values, respectively. For a suitably chosen initial FWERcontrolling procedure, the gFWER-and TPPFP-controlling augmentation multiple testing procedures (AMTP) of van der Laan et al. (2004a) can also take into account the joint distribution of the test statistics. Given a gFWER-controlling procedure, we also propose AMTPs for controlling tail probability error rates, Pr(g(V n,R n) > q), for arbitrary functions g(V n,R n) of the numbers of false positives V n and rejected hypotheses R n. The different gFWER-and TPPFPcontrolling procedures are compared in a simulation study, where the tests concern the components of the mean vector of a multivariate Gaussian data generating distribution. Among notable findings are the substantial power gains achieved by joint procedures compared to marginal procedures.
Beyond the Cox Hazard Ratio: A Targeted Learning Approach to Survival Analysis in a Cardiovascular Outcome Trial Application
Statistics in Biopharmaceutical Research, Apr 3, 2023
Data-Adaptive Estimation in Cluster Randomized Trials
Springer series in statistics, 2018
In randomized trials, adjustment for measured covariates during the analysis can reduce variance ... more In randomized trials, adjustment for measured covariates during the analysis can reduce variance and increase power. To avoid misleading inference, the analysis plan must be pre-specified. However, it is often unclear a priori which baseline covariates (if any) should be included in the analysis. This results in an important challenge: the need to learn from the data to realize precision gains, but to do so in pre-specified and rigorous way to maintain valid statistical inference. This challenge is especially prominent in cluster randomized trials (CRTs), which often have limited numbers of independent units (e.g., communities, clinics or schools) and many potential adjustment variables.

The Sample Average Treatment Effect
Springer series in statistics, 2018
In cluster randomized trials (CRTs), the study units usually are not a simple random sample from ... more In cluster randomized trials (CRTs), the study units usually are not a simple random sample from some clearly defined target population. Instead, the target population tends to be hypothetical or ill-defined, and the selection of study units tends to be systematic, driven by logistical and practical considerations. As a result, the population average treatment effect (PATE) may be neither well defined nor easily interpretable. In contrast, the sample average treatment effect (SATE) is the mean difference in the counterfactual outcomes for the study units. The sample parameter is easily interpretable and arguably the most relevant when the study units are not sampled from some specific super-population of interest. Furthermore, in most settings the sample parameter will be estimated more efficiently than the population parameter.

There is mixed evidence of the effectiveness of interventions operating on a large scale. Althoug... more There is mixed evidence of the effectiveness of interventions operating on a large scale. Although the lack of consistent results is generally attributed to problems of implementation or governance of the program, the failure to find a statistically significant effect (or the success of finding one) may be due to choices made in the evaluation. To demonstrate the potential limitations and pitfalls of the usual analytic methods used for estimating causal effects, we apply the first half of a roadmap for causal inference to a pre-post evaluation of a communitylevel, national nutrition program. Selection into the program was non-random and strongly associated with the pre-treatment (lagged) outcome. Using structural causal models (SCM), directed acyclic graphs (DAGs) and simulated data, we demonstrate that a post treatment estimand controls for confounding by the lagged outcome but not from possible unmeasured confounders. Two separate difference-indifferences estimands have the potential to adjust for a certain type of unmeasured confounding, but introduce bias if the additional assumptions they require are not met. Our results reveal an important issue of identifiability when estimating the causal effect of a program with pre-post observational data. A careful appraisal of the assumptions underlying the causal model is imperative before committing to a statistical model and progressing to estimation.

The NPMLE in the Uniform Doubly Censored Current Status Data Model
In biostatistical applications interest often focuses on the estimation of the distribution of ti... more In biostatistical applications interest often focuses on the estimation of the distribution of time T between two consecutive events. If the initial event time is observed and the subsequent event time is only known to be larger or smaller than an observed point in time, then the data is described by the well understood singly censored current status model, also known as interval censored data, case I. Jewell, Malani and Vittinghoff (1994) extended this current status model by allowing the initial time to be unobserved, but with its distribution over an observed interval [A,B] known to be uniformly distributed; the data is referred to as doubly censored current status data. These authors used this model to handle applications in AIDS partner studies focusing on the nonparametirc maximum likelihood estimate (NPMLE) of the distribution function, G, of T. The model is a submodel of the current status model, but G is essentially the derivative of the distribution function of interest, F, in the current status model. In this paper we establish that the NPMLE of G is uniformly consistent and that the resulting estimators for square root n estimable parameters are efficient. We propose an iterative weighted Pool-Adjacent-Violator-Algorithm to compute the NPMLE of G. The rate of convergence of the NPMLE of F is also established

Biometrics, Apr 3, 2019
The assumption that no subject's exposure affects another subject's outcome, known as the no-inte... more The assumption that no subject's exposure affects another subject's outcome, known as the no-interference assumption, has long held a foundational position in the study of causal inference. However, this assumption may be violated in many settings, and in recent years has been relaxed considerably. Often this has been achieved with either the aid of a known underlying network, or the assumption that the population can be partitioned into separate groups, between which there is no interference, and within which each subject's outcome may be affected by all the other subjects in the group via the proportion exposed (the stratified interference assumption). In this paper, we instead consider a complete interference setting, in which each subject affects every other subject's outcome. In particular, we make the stratified interference assumption for a single group consisting of the entire sample. This can occur when the exposure is a shared resource whose efficacy is modified by the number of subjects among whom it is shared. We show that a targeted maximum likelihood estimator for the i.i.d. setting can be used to estimate a class of causal parameters that includes direct effects and overall effects under certain interventions. This estimator remains doubly-robust, semiparametric efficient, and continues to allow for incorporation of machine learning under our model. We conduct a simulation study, and present results from a data application where we study the effect of a nurse-based triage system on the outcomes of patients receiving HIV care in Kenyan health clinics.

The International Journal of Biostatistics, May 1, 2016
In social and health sciences, many research questions involve understanding the causal effect of... more In social and health sciences, many research questions involve understanding the causal effect of a longitudinal treatment on mortality (or time-to-event outcomes in general). Often, treatment status may change in response to past covariates that are risk factors for mortality, and in turn, treatment status may also affect such subsequent covariates. In these situations, Marginal Structural Models (MSMs), introduced by Robins (1997. Marginal structural models Proceedings of the American Statistical Association. Section on Bayesian Statistical Science, 1-10), are well-established and widely used tools to account for time-varying confounding. In particular, a MSM can be used to specify the intervention-specific counterfactual hazard function, i. e. the hazard for the outcome of a subject in an ideal experiment where he/she was assigned to follow a given intervention on their treatment variables. The parameters of this hazard MSM are traditionally estimated using the Inverse Probability Weighted estimation Robins (1999. Marginal structural models versus structural nested models as tools for causal inference. In: Statistical models in epidemiology: the environment and clinical trials.
Uploads
Papers by Mark Van Der Laan