Summary We propose seeded binary segmentation for large-scale changepoint detection problems. We ... more Summary We propose seeded binary segmentation for large-scale changepoint detection problems. We construct a deterministic set of background intervals, called seeded intervals, in which single changepoint candidates are searched for. The final selection of changepoints based on these candidates can be done in various ways, adapted to the problem at hand. The method is thus easy to adapt to many changepoint problems, ranging from univariate to high dimensional. Compared to recently popular random background intervals, seeded intervals lead to reproducibility and much faster computations. For the univariate Gaussian change in mean set-up, the methodology is shown to be asymptotically minimax optimal when paired with appropriate selection criteria. We demonstrate near-linear runtimes and competitive finite sample estimation performance. Furthermore, we illustrate the versatility of our method in high-dimensional settings.
ABSTRACT Functional gradient descent algorithm (boosting) for optimizing general risk functions u... more ABSTRACT Functional gradient descent algorithm (boosting) for optimizing general risk functions utilizing component-wise (penalised) least squares estimates or regression trees as base-learners for fitting generalized linear, additive and interaction models to potentially high-dimensional data.
We propose a new method for stationary nonlinear time series analysis which dynamically combines ... more We propose a new method for stationary nonlinear time series analysis which dynamically combines models, either parametric or nonparametric, by using mixture probabilities from so-called variable length Markov chains. The approach is very general and flexible: it can be used for modelling conditional means, conditional variances or conditional densities given the previous lagged values, and the methodology can be applied to dynamically combine almost any kind of models. Parameter estimation (finite or infinite-dimensional) and model selection can be done in a fully data-driven way. We demonstrate the predictive power of the method on finite sample data and an asymptotic consistency result is presented.
Given data sampled from a number of variables, one is often interested in the underlying causal r... more Given data sampled from a number of variables, one is often interested in the underlying causal relationships in the form of a directed acyclic graph. In the general case, without interventions on some of the variables it is only possible to identify the graph up to its Markov equivalence class. However, in some situations one can find the true causal graph just from observational data, for example in structural equation models with additive noise and nonlinear edge functions. Most current methods for achieving this rely on nonparametric independence tests. One of the problems there is that the null hypothesis is independence, which is what one would like to get evidence for. We take a different approach in our work by using a penalized likelihood as a score for model selection. This is practically feasible in many settings and has the advantage of yielding a natural ranking of the candidate models. When making smoothness assumptions on the probability density space, we prove consistency of the penalized maximum likelihood estimator. We also present empirical results for simulated scenarios and real two-dimensional data sets (cause-effect pairs) where we obtain similar results as other state-of-the-art methods. * Corresponding Author 1 X has a causal effect on Y if manipulating X changes the distribution of Y , see Pearl [2000]. 2 In fact, this is how structural equation models where first introduced and continue to be used today . 3 Except for a set of degenerate cases of measure zero.
We study the problem of causal structure learning with essentially no assumptions on the function... more We study the problem of causal structure learning with essentially no assumptions on the functional relationships and noise. We develop DAG-FOCI, a computationally fast algorithm for this setting that is based on the FOCI variable selection algorithm in . DAG-FOCI outputs the set of parents of a response variable of interest. We provide theoretical guarantees of our procedure when the underlying graph does not contain any (undirected) cycle containing the response variable of interest. Furthermore, in the absence of this assumption, we give a conservative guarantee against false positive causal claims when the set of parents is identifiable. We demonstrate the applicability of DAG-FOCI on simulated as well as a real dataset from computational biology .
We present a graph-based technique for estimating sparse covariance matrices and their inverses f... more We present a graph-based technique for estimating sparse covariance matrices and their inverses from high-dimensional data. The method is based on learning a directed acyclic graph (DAG) and estimating parameters of a multivariate Gaussian distribution based on a DAG. For inferring the underlying DAG we use the PC-algorithm [27] and for estimating the DAG-based covariance matrix and its inverse, we use a Cholesky decomposition approach which provides a positive (semi-)definite sparse estimate. We present a consistency result in the high-dimensional framework and we compare our method with the Glasso [12, 8, 2] for simulated and real data.
Histograms showing the empirical distribution of scores (left) and margins (right) for the leukemia dataset (AML/ALL distinction), based on 1,000 bootstrap replicates with permuted response variables
In this paper we describe a nonparametric GARCH model of first order and propose a simple iterati... more In this paper we describe a nonparametric GARCH model of first order and propose a simple iterative algorithm for its estimation from data. We provide a theoretical justification for this algorithm and give examples of its application to stationary time series data showing stochastic volatility. We observe that our nonparametric procedure often gives better estimates of the unobserved latent volatility process than parametric GARCH(1,1) modelling, particularly when asymmetries are present in the data. We show how the basic iterative idea may be extended to more complex time series models combining ARMA or GARCH features of possibly higher order.
Inferring causal relationships or related associations from observational data can be invalidated... more Inferring causal relationships or related associations from observational data can be invalidated by the existence of hidden confounding. We focus on a high-dimensional linear regression setting, where the measured covariates are affected by hidden confounding and propose the Doubly Debiased Lasso estimator for individual components of the regression coefficient vector. Our advocated method simultaneously corrects both the bias due to estimation of high-dimensional parameters as well as the bias caused by the hidden confounding. We establish its asymptotic normality and also prove that it is efficient in the Gauss-Markov sense. The validity of our methodology relies on a dense confounding assumption, i.e. that every confounding variable affects many covariates. The finite sample performance is illustrated with an extensive simulation study and a genomic application. * Z. Guo and D.
We study the blockwise bootstrap of K unsch (1989) for a statistic which estimates a parameter of... more We study the blockwise bootstrap of K unsch (1989) for a statistic which estimates a parameter of the entire distribution of a stationary time series. Because such a statistic is not symmetric in the observations, one should not simply resample blocks of the original data. When the parameter is the spectral distribution function or an ARMA parameter, the statistic is a symmetric function of all shifts of the sample extended suitably. Then we can resample blocks of shift indices, and the theory is basically the same as for a symmetric statistic. In other cases the statistic is a symmetric function of m-tuples of consecutive data where m increases with sample size. Then one can resample blocks of these m-tuples. But the increasing m makes the theory more delicate. We show v alidity of the bootstrap in two generic examples of spectral estimators, thereby extending results of Politis and Romano (1992).
We present a statistical perspective on boosting. Special emphasis is given to estimating potenti... more We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in high-dimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated open-source software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing user-specified loss functions.
We present a tutorial and new publicly available computational tools for variable length Markov c... more We present a tutorial and new publicly available computational tools for variable length Markov chains (vlmc). vlmc's are Markov chains with the additional attractive structure that their memories depend on a variable number of lagged values, depending on how the actual past (the lagged values) looks like. They build a very flexible class of tree structured models for categorical time series. Fitting vlmc's from data is a non-trivial computational task. We provide an efficient implementation of the so-called context algorithm which requires O(n log(n)) operations only. The implementation, which is publicly available, includes additional important new features and options: diagnostics, goodness of fit, simulation and bootstrap, residuals and tuning the context algorithm. Our tutorial is presented with a version in R which is available from the Comprehensive R Archive Network (CRAN 1997 ff.). The exposition is selfcontained, gives rigorous and partly new mathematical descriptions and is illustrated by analyzing a DNA sequence from the Epstein-Barr virus.
When testing multiple hypotheses simultaneously, a quantity of interest is the number m0 of true ... more When testing multiple hypotheses simultaneously, a quantity of interest is the number m0 of true null hypotheses. We present a general framework for finding upper probabilistic bounds for m0, that is estimates b m0 with the property Moreover, b m0 can be used for novel estimates of type I errors in multiple testing such as the false discovery rate. Control of the family-wise error rate emerges as a special case in our framework but suffers from vanishing power for a large number of tested hypotheses. We present a different estimate such that the ability to detect true non-null hypotheses increases with the number of tested hypotheses. A detailed algorithm is provided. The method is valid under general and unknown dependence between the test statistics. We develop the method primarily for multiple testing of associations between random variables. The method is illustrated with simulation studies and applications to microarray data.
Microarray experiments generate large datasets with expression values for thousands of genes, but... more Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of co-regulated genes whose collective expression is strongly associated with an outcome variable of interest. To find these groups, we suggest the use of supervised clustering algorithms: these are procedures which use external information about the response variables for clustering the genes. We present Pelora, an algorithm based on penalized logistic regression analysis, that combines gene selection, supervision, gene clustering and sample classification in a single step. With an empirical study on six different microarray datasets, we show that Pelora identifies gene clusters whose expression centroids have excellent predictive potential and yield results that are superior to state-of-the-art classification methods based on single genes. Thus, our clusters can be beneficial in medical diagnostics and prognostics, but they can also be very useful for functional genomics by providing insights into gene function and regulation.
We consider the model selection problem in the class of stationary variable length Markov chains ... more We consider the model selection problem in the class of stationary variable length Markov chains (VLMC) on a finite space. The processes in this class are still Markovian of higher order, but with memory of variable length. Various aims in selecting a VLMC can be formalized with different non-equivalent risks, such as final prediction error or expected Kullback-Leibler information. We consider the asymptotic behavior of different risk functions and show how they can be generally estimated with the same resampling strategy. Such estimated risks then yield new model selection rules: in the special case of classical higher order full Markov chains we obtain a better proposal than the AIC criterion, which has been suggested in the past. Attacking the model selection problem also yields a proposal for tuning Rissanen's context algorithm, which can be used for estimating the minimal state space and in turn the whole probability structure of a VLMC.
Motivation: Microarray experiments generate large datasets with expression values for thousands o... more Motivation: Microarray experiments generate large datasets with expression values for thousands of genes but not more than a few dozens of samples. Accurate supervised classification of tissue samples in such high-dimensional problems is difficult but often crucial for successful diagnosis and treatment. A promising way to meet this challenge is by using boosting in conjunction with decision trees. Results: We demonstrate that the generic boosting algorithm needs some modifications to become an accurate classifier in the context of gene expression data. In particular, we present a feature preselection method, a more robust boosting procedure and a new approach for multi-categorical problems. This allows for slight to drastic increase in performance and yields competitive results on several publicly available datasets.
Uploads
Papers by Peter Bühlmann