Papers by Ricardo Fraiman
Scientific Reports, 2021
Using a new probabilistic approach we model the relationship between sequences of auditory stimul... more Using a new probabilistic approach we model the relationship between sequences of auditory stimuli generated by stochastic chains and the electroencephalographic (EEG) data acquired while 19 participants were exposed to those stimuli. The structure of the chains generating the stimuli are characterized by rooted and labeled trees whose leaves, henceforth called contexts, represent the sequences of past stimuli governing the choice of the next stimulus. A classical conjecture claims that the brain assigns probabilistic models to samples of stimuli. If this is true, then the context tree generating the sequence of stimuli should be encoded in the brain activity. Using an innovative statistical procedure we show that this context tree can effectively be extracted from the EEG data, thus giving support to the classical conjecture.
TEST
We address the problem of testing for the invariance of a probability measure under the action of... more We address the problem of testing for the invariance of a probability measure under the action of a group of linear transformations. We propose a procedure based on consideration of one-dimensional projections, justified using a variant of the Cramér-Wold theorem. Our test procedure is powerful, computationally efficient, and circumvents the curse of dimensionality. It includes, as special cases, tests for exchangeability and sign-invariant exchangeability. We compare our procedure with some previous proposals in these cases, in a small simulation study. Our methods extend to the case of infinite-dimensional spaces (multivariate functional data). The paper concludes with two real-data examples.
Nonparametric Regression Based on Discretely Sampled Curves
DOAJ (DOAJ: Directory of Open Access Journals), Feb 1, 2020
TEST
The lens depth of a point has been recently extended to general metric spaces, which is not the c... more The lens depth of a point has been recently extended to general metric spaces, which is not the case for most depths. It is defined as the probability of being included in the intersection of two random balls centred at two random points X and Y , with the same radius d(X , Y). We prove that, on a separable and complete metric space, the level sets of the empirical lens depth based on an iid sample, converge in the Painlevé-Kuratowski sense, to its population counterpart. We also prove that, restricted to compact sets, the empirical level sets and their boundaries are consistent estimators, in Hausdorff distance, of their population counterparts, and analyse two real-life examples.
Journal of the Royal Statistical Society Series B: Statistical Methodology
Using some extensions of a theorem of Heppes on finitely supported discrete probability measures,... more Using some extensions of a theorem of Heppes on finitely supported discrete probability measures, we address the problems of classification and testing based on projections. In particular, when the support of the distributions is known in advance (as for instance for multivariate Bernoulli distributions), a single suitably chosen projection determines the distribution. Several applications of these results are considered.
arXiv (Cornell University), Nov 21, 2022
Several measures of non-convexity (departures from convexity) have been introduced in the literat... more Several measures of non-convexity (departures from convexity) have been introduced in the literature, both for sets and functions. Some of them are of geometric nature, while others are more of topological nature. We address the statistical analysis of some of these measures of non-convexity of a set S, by dealing with their estimation based
Semi-supervised learning
arXiv (Cornell University), Sep 17, 2017

arXiv (Cornell University), Jun 27, 2022
According to a well-known theorem of Cramér and Wold, if P and Q are two Borel probability measur... more According to a well-known theorem of Cramér and Wold, if P and Q are two Borel probability measures on R d whose projections PL, QL onto each line L in R d satisfy PL = QL, then P = Q. Our main result is that, if P and Q are both elliptical distributions, then, to show that P = Q, it suffices merely to check that PL = QL for a certain set of (d 2 + d)/2 lines L. Moreover (d 2 + d)/2 is optimal. The class of elliptical distributions contains the Gaussian distributions as well as many other multivariate distributions of interest. Our theorem contrasts with other variants of the Cramér-Wold theorem, in that no assumption is made about the finiteness of moments of P and Q. We use our results to derive a statistical test for equality of elliptical distributions, and carry out a small simulation study of the test, comparing it with other tests from the literature. We also give an application to learning (binary classification), again illustrated with a small simulation.

Canadian Journal of Statistics
Starting with Tukey's pioneering work in the 1970s, the notion of depth in statistics has bee... more Starting with Tukey's pioneering work in the 1970s, the notion of depth in statistics has been widely extended, especially in the last decade. Such extensions include those to high‐dimensional data, functional data, and manifold‐valued data. In particular, in the learning paradigm, the depth‐depth method has become a useful technique. In this article, we extend the lens depth to the case of data in metric spaces and study its main properties. We also introduce, for Riemannian manifolds, the weighted lens depth. The weighted lens depth is nothing more than a lens depth for a weighted version of the Riemannian distance. To build it, we replace the geodesic distance on the manifold with the Fermat distance, which has the important property of taking into account the density of the data together with the geodesic distance. Next, we illustrate our results with some simulations and also in some interesting real datasets, including pattern recognition in phylogenetic trees, using the d...
Some complex models are frequently employed to describe physical and mechanical phenomena. In thi... more Some complex models are frequently employed to describe physical and mechanical phenomena. In this setting we have an input X in a general space, and an output Y = f(X) where f is a very complicated function, whose computational cost for every new input is very high. We are given two sets of observations of X, S1 and S2 of different sizes such that only f(S1) is available. We tackle the problem of selecting a subset S3 ⊂ S2 of smaller size on which to run the complex mode f, and such that the empirical distribution of f(S3) is close to that of f(S1). We suggest three algorithms to solve this problem and show their efficiency using simulated datasets and the Airfoil self-noise data set.

Electronic Journal of Statistics
We study the problem of estimating the surface area of the boundary ∂S of a sufficiently smooth s... more We study the problem of estimating the surface area of the boundary ∂S of a sufficiently smooth set S ⊂ R d when the available information is only a finite subset Xn ⊂ S. We propose two estimators. The first makes use of the Devroye-Wise support estimator and is based on Crofton's formula, which, roughly speaking, states that the (d − 1)-dimensional surface area of a smooth enough set is the mean number of intersections of randomly chosen lines. For that purpose, we propose an estimator of the number of intersections of such lines with support based on the Devroye-Wise support estimators. The second surface area estimator makes use of the α-convex hull of Xn, which is denoted by Cα(Xn). More precisely, it is the (d−1)-dimensional surface area of Cα(Xn), as denoted by |Cα(Xn)| d−1 , which is proven to converge to the (d − 1)-dimensional surface area of ∂S. Moreover, |Cα(Xn)| d−1 can be computed using Crofton's formula. Our results depend on the Hausdorff distance between S and Xn for the Devroye-Wise estimator, and the Hausdorff distance between ∂S and ∂Cα(Xn) for the second estimator. Primary 62G05; secondary 62G20.
arXiv: Statistics Theory, 2018
We address one of the important problems in Big Data, namely how to combine estimators from diffe... more We address one of the important problems in Big Data, namely how to combine estimators from different subsamples by robust fusion procedures, when we are unable to deal with the whole sample. We propose a general framework based on the classic idea of `divide and conquer'. In particular we address in some detail the case of a multivariate location and scatter matrix, the covariance operator for functional data, and clustering problems.
The analysis of animal movement has gained attention recently, and new continuous-time models and... more The analysis of animal movement has gained attention recently, and new continuous-time models and statistical methods have been developed. All of them are based on the assumption that this movement can be recorded over a long period of time, which is sometimes infeasible, for instance when the battery life of the GPS is short. We prove that the estimation of its home range improves if periods when the GPS is on are alternated with periods when the GPS is turned off. This is illustrated through a simulation study, and real life data. We also provide estimators of the stationary distribution, level sets (which provides estimators of the core area) and the drift function.

arXiv: Statistics Theory, 2020
We study the problem of estimating the surface area of the boundary of a sufficiently smooth set ... more We study the problem of estimating the surface area of the boundary of a sufficiently smooth set when the available information is only a set of points (random or not) that becomes dense (with respect to Hausdorff distance) in the set or the trajectory of a reflected diffusion. We obtain consistency results in this general setup, and we derive rates of convergence for the iid case or when the data corresponds to the trajectory of a reflected Brownian motion. We propose an algorithm based on Crofton's formula, which estimates the number of intersections of random lines with the boundary of the set by counting, in a suitable way (given by the proposed algorithm), the number of intersections with the boundary of two different estimators: the Devroye-Wise esti-mator and the α-convex hull of the data. As a by-product, our results also cover the convex case, for any dimension.

Journal of Nonparametric Statistics, 2018
Non-linear aggregation strategies have recently been proposed in response to the problem of how t... more Non-linear aggregation strategies have recently been proposed in response to the problem of how to combine, in a non-linear way, estimators of the regression function (see for instance [Biau, G., Fischer, A., Guedj, B., and Malley, J. (2016), ‘COBRA: A Combined Regression Strategy’, Journal of Multivariate Analysis, 146, 18–28.]), classification rules (see [Cholaquidis, A., Fraiman, R., Kalemkerian, J., and Llop, P. (2016), ‘A Nonlinear Aggregation Type Classifier’, Journal of Multivariate Analysis, 146, 269–281.]), among others. Although there are several linear strategies to aggregate density estimators, most of them are hard to compute (even in moderate dimensions). Our approach aims to overcome this problem by estimating the density at a point x using not just sample points close to x but in a neighbourhood of the (estimated) level set . We show that the mean squared error of our proposal is at most equal to the mean squared error of the best density estimator used for the aggre...
In the context of nonparametric regression, we study conditions under which the consistency (and ... more In the context of nonparametric regression, we study conditions under which the consistency (and rates of convergence) of estimators built from discretely sampled curves can be derived from the consistency of estimators based on the unobserved whole trajectories. As a consequence, we derive asymptotic results for most of the regularization techniques used in functional data analysis, including smoothing and basis representation.

ArXiv, 2018
Semi-supervised learning deals with the problem of how, if possible, to take advantage of a huge ... more Semi-supervised learning deals with the problem of how, if possible, to take advantage of a huge amount of unclassified data, to perform a classification in situations when, typically, there is little labelled data. Even though this is not always possible (it depends on how useful, for inferring the labels, it would be to know the distribution of the unlabelled data), several algorithm have been proposed recently. A new algorithm is proposed, that under almost necessary conditions, attains asymptotically the performance of the best theoretical rule as the amount of unlabelled data tends to infinity. The set of necessary assumptions, although reasonable, show that semi-parametric classi- fication only works for very well conditioned problems. The perfor- mance of the algorithm is assessed in the well known "Isolet" real-data of phonemes, where a strong dependence on the choice of the initial training sample is shown.
arXiv: Statistics Theory, 2016
In the context of nonparametric regression, we study condi- tions under which the consistency (an... more In the context of nonparametric regression, we study condi- tions under which the consistency (and rates of convergence) of estimators built from discretely sampled curves can be derived from the consistency of estimators based on the unobserved whole trajectories. As a consequence, we derive asymptotic results for most of the regularization techniques used in functional data analysis, including smoothing and basis representation.

TEST, 2019
Semi-supervised learning deals with the problem of how, if possible, to take advantage of a huge ... more Semi-supervised learning deals with the problem of how, if possible, to take advantage of a huge amount of unclassified data, to perform a classification in situations when, typically, there is little labeled data. Even though this is not always possible (it depends on how useful, for inferring the labels, it would be to know the distribution of the unlabeled data), several algorithm have been proposed recently. A new algorithm is proposed, that under almost necessary conditions, attains asymptotically the performance of the best theoretical rule as the amount of unlabeled data tends to infinity. The set of necessary assumptions, although reasonable, show that semi-supervised classification only works for very well conditioned problems. The focus is on understanding when and why semi-supervised learning works when the size of the initial training sample remains fixed and the asymptotic is on the size of the unlabeled data. The performance of the algorithm is assessed in the well known "Isolet" real-data of phonemes, where a strong dependence on the choice of the initial training sample is shown. Semi-supervised learning; Small training sample; Consistency.
Journal of Multivariate Analysis, 2019
We address one of the important problems in Big Data, namely how to combine estimators from diffe... more We address one of the important problems in Big Data, namely how to combine estimators from different subsamples by robust fusion procedures, when we are unable to deal with the whole sample. We propose a general framework based on the classic idea of 'divide and conquer'. In particular we address in some detail the case of a multivariate location and scatter matrix, the covariance operator for functional data, and clustering problems.
Uploads
Papers by Ricardo Fraiman