Papers by David Heckerman
Content Personalization Based on User Information
Learning Mixtures of Smooth, Nonuniform Deformation Models for Probabilistic Image Matching
ABSTRACT
Journal of Artificial Intelligence Research
ABSTRACT

We examine the learning-curve sampling method, an approach for applying machinelearning algorithm... more We examine the learning-curve sampling method, an approach for applying machinelearning algorithms to large data sets. The approach is based on the observation that the computational cost of learning a model increases as a function of the sample size of the training data, whereas the accuracy of a model has diminishing improvements as a function of sample size. Thus, the learning-curve sampling method monitors the increasing costs and performance as larger and larger amounts of data are used for training, and terminates learning when future costs outweigh future benefits. In this paper, we formalize the learning-curve sampling method and its associated cost-benefit tradeoff in terms of decision theory. In addition, we describe the application of the learning-curve sampling method to the task of model-based clustering via the expectation-maximization (EM) algorithm. In experiments on three real data sets, we show that the learning-curve sampling method produces models that are nearly as accurate as those trained on complete data sets, but with dramatically reduced learning times. Finally, we describe an extension of the basic learning-curve approach for model-based clustering that results in an additional speedup. This extension is based on the observation that the shape of the learning curve for a given model and data set is roughly independent of the number of EM iterations used during training. Thus, we run EM for only a few iterations to decide how many cases to use for training, and then run EM to full convergence once the number of cases is selected.
We propose a new method for making inference about an unknown measure Γ(dλ) upon observing some v... more We propose a new method for making inference about an unknown measure Γ(dλ) upon observing some values of the Fredholm integral g(ω) = k(ω, λ)Γ(dλ) of a known kernel k(ω, λ), using Lévy random fields as Bayesian prior distributions for modeling uncertainty about Γ(dλ). Inference is based on simulation-based MCMC methods. The method is illustrated with a problem in polymer chemistry.
Recently Nickle et al. introduced a new model of genetic diversity that summarizes a large input ... more Recently Nickle et al. introduced a new model of genetic diversity that summarizes a large input dataset into a short sequence containing overlapping subsequences from the dataset. This model has direct applications to rational vaccine design. In this paper we formally investigate the combinatorics of the vaccine optimization problem. Here the vaccine is constructed as a sequence S of amino-acids such that as many of the most frequently occurring epitopes found in mutated viruses are subsequences to S. We rigorously present the related design optimization problem, establish its complexity, and present a simple probabilistic algorithm to find an efficient solution. Our vaccine designs show improvement of over 20% in the coverage score over the previously best designs and produce over 15% shorter vaccines that achieve equivalent epitope coverage.
Statistics and Computing, 2000
p(mj D)D p(m) p(Dj m) 6m0 p(m0) p(Dj m0) p(µmj D; m)D p(µmj m) p(Djµm; m)

Uncertainty in Artificial Intelligence, 2000
A simple advertising strategy that can be used to help increase sales of a product is to mail out... more A simple advertising strategy that can be used to help increase sales of a product is to mail out special offers to selected poten tial customers. Because there is a cost as sociated with sending each offer, the optimal mailing strategy depends on both the ben efit obtained from a purchase and how the offer affects the buying behavior of the cus tomers. In this paper, we describe two meth ods for partitioning the potential customers into groups, and show how to perform a sim ple cost-benefit analysis to decide which, if any, of the groups should be targeted. In par ticular, we consider two decision-tree learning algorithms. The first is an "off the shelf" al gorithm used to model the probability that groups of customers will buy the product. The second is a new algorithm that is sim ilar to the first, except that for each group, it explicitly models the probability of purchase under the two mailing scenarios: (1) the mail is sent to members of that group and (2) the mail is not sent to members of that group. Using data from a real-world advertising ex periment, we compare the algorithms to each other and to a naive mail-to-all strategy.

Approaches for testing groups of variants for association with complex traits are becoming critic... more Approaches for testing groups of variants for association with complex traits are becoming critical. Examples of groups typically include a set of rare or common variants within a gene, but could also be variants within a pathway or any other set. These tests are critical for aggregation of weak signal within a group, allow interplay among variants to be captured, and also reduce the problem of multiple hypothesis testing. Unfortunately, these approaches do not address confounding by, for example, family relatedness and population structure, a problem that is becoming more important as larger data sets are used to increase power. We introduce a new approach for group tests that can handle confounding, based on Bayesian linear regression, which is equivalent to the linear mixed model. The approach uses two sets of covariates (equivalently, two random effects), one to capture the group association signal and one to capture confounding. We also introduce a computational speedup for the...
Founder effects contribute to perceived associations between HIV sequence polymorphisms and HLA class I alleles
Regulators such as the U.S. Food and Drug Administration have elaborate, multi-year processes for... more Regulators such as the U.S. Food and Drug Administration have elaborate, multi-year processes for approving new drugs as safe and effective. Nonethe-less, in recent years, several approved drugs have been withdrawn from the market because of serious and sometimes fatal side effects. We describe sta-tistical methods for post-approval data analysis that attempt to detect drug safety problems as quickly as possible. Bayesian approaches are especially useful because of the high dimensionality of the data, and, in the future, for incorporating disparate sources of information.

The aim of many microarray experiments is to discover genes that exhibit similar behaviour, that ... more The aim of many microarray experiments is to discover genes that exhibit similar behaviour, that is, co-express. A common approach to analysis is to apply generic clustering algorithms that produce a single cluster allocation for each gene. Such a strategy does not account for the experimental context, and provides no measure of the uncertainty in this classification. The model introduced in this paper is specifically tailored to the application in hand, so allowing the incorporation of prior information, and quantifies cluster membership probabilistically. In this paper we are interested in the situation in which the experiments are indexed by a variable that has a natural ordering such as time, temperature or dose level, and propose a four-stage hierarchical model for the analysis of such data. This model assumes that each gene follows one of a number of underlying trajectories, where the number may be assumed unknown, and the specific form of the trajectory depends on the experim...
Submitted to BAYESIAN STATISTICS 7
This paper presents a new finite-dimensional Bayesian filter. The filter calculates the exact ana... more This paper presents a new finite-dimensional Bayesian filter. The filter calculates the exact analytical expression for the posterior probability density function (pdf) of static systems with kind of nonlinear measurement equation subject to Gaussian measurement uncertainty. The paper also extends this filter to a limited class of dynamic systems. The filter is applied to the estimation of the inaccurately known position and orientation of two mating parts during autonomous robotic assembly. The su#cient statistics of the posterior pdf are obtained by Kalman Filter formulas, making online estimation possible

Summary In cases where genetic sequence data are collected together with associated physical trai... more Summary In cases where genetic sequence data are collected together with associated physical traits it is natural to want to link patterns observed in the trait val- ues to the underlying genealogy of the individuals. If the traits correspond to specific phenotypes, we may wish to associate specific mutations with changes observed in phenotype distributions, whereas if the traits concern spatial in- formation, we may use the genealogy to look at population movement over time. In this paper we discuss the standard approach to analyses of this sort and propose a new framework which overcomes a number of shortcomings in the standard approach. In particular, we allow for uncertainty associated with the underlying genealogy to fully propagate through the model to directly interact with the inferences of primary interest, namely the eects of genetic mutations on phenotype and/or the dispersal patterns of populations over time.

We consider the problem of estimating an unknown function based on noisy data using nonparametric... more We consider the problem of estimating an unknown function based on noisy data using nonparametric regression. One approach to this estimation prob-lem is to represent the function in a series expansion using a linear com-bination of basis functions. Overcomplete dictionaries provide a larger, but redundant collection of generating elements than a basis, however, coefficients in the expansion are no longer unique. Despite the non-uniqueness, this has the potential to lead to sparser representations by using fewer non-zero coef-ficients. Compound Poisson random fields and their generalization to Lévy random fields are ideally suited for construction of priors on functions using these overcomplete representations for the general nonparametric regression problem, and provide a natural limiting generalization of priors for the fi-nite dimensional version of the regression problem. While expressions for posterior modes or posterior distributions of quantities of interest are not available...
Extensive Yet Ineffective Gag Epitope Variant Recognition Observed in HIV-Progressors
AIDS Research and Human Retroviruses

We consider causal models involving three binary variables: a randomized assignment Z, an exposur... more We consider causal models involving three binary variables: a randomized assignment Z, an exposure measure X, and a final response Y . We focus particular attention on the situation in which there may be confounding of X and Y , while at the same time measures of the effect of X on Y are of primary interest. In the case where Z has no effect on Y , other than through Z, this is the instrumental variable model. Many causal quantities of interest are only partially identified. We first show via an example that the resulting posteriors may be highly sensitive to the specification of the prior distribution over compliance types. To address this, we present several novel "transparent" re-parametrizations of the likelihood that separate the identified and non-identified parts of the parameter. In addition, we develop parametrizations that are robust to model mis-specification under the "intent-to-treat" null hypothesis that Z and Y are independent.

A convenient way of modelling complex interactions is by employing graphs or networks which corre... more A convenient way of modelling complex interactions is by employing graphs or networks which correspond to conditional independence structures in an underlying statistical model. One main class of models in this regard are Bayesian networks, which have the drawback of making parametric assump-tions. Bayesian nonparametric mixture models offer a possibility to overcome this limitation, but have hardly been used in combination with networks. This manuscript bridges this gap by introducing nonparametric Bayesian network models. We review (parametric) Bayesian networks, in particular Gaussian Bayesian networks, from a Bayesian perspective as well as nonparametric Bayesian mixture models. Afterwards these two modelling approaches are combined into nonparametric Bayesian networks. The new models are com-pared both to Gaussian Bayesian networks and to mixture models in a simula-tion study, where it turns out that the nonparametric network models perform favorably in non Gaussian situations....
Uploads
Papers by David Heckerman