Machine learning and data mining have found a multitude of successful applications in microarray ... more Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples. Low-level microarray analysis -often associated with the pre-processing stage within the microarray life-cycle -has increasingly become an area of active research, traditionally involving techniques from classical statistics. This paper explores opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing. Relevant methods and ideas from the machine learning community include semi-supervised learning, learning from heterogeneous data, and incremental learning.
Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually... more Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually. Plasmodium are unicellular eukaryotes with small ,23 Mb genomes encoding ,5200 protein-coding genes. The protein-coding genes comprise about half of these genomes. Although evolutionary processes have a significant impact on malaria control, the selective pressures within Plasmodium genomes are poorly understood, particularly in the non-protein-coding portion of the genome. We use evolutionary methods to describe selective processes in both the coding and non-coding regions of these genomes. Based on genome alignments of seven Plasmodium species, we show that protein-coding, intergenic and intronic regions are all subject to purifying selection and we identify 670 conserved non-genic elements. We then use genome-wide polymorphism data from P. falciparum to describe short-term selective processes in this species and identify some candidate genes for balancing (diversifying) selection. Our analyses suggest that there are many functional elements in the non-genic regions of these genomes and that adaptive evolution has occurred more frequently in the protein-coding regions of the genome.
A spatially varying two-sample recombinant coalescent, with applications to HIV escape response
Statistical evolutionary models provide an important mechanism for describing and understanding t... more Statistical evolutionary models provide an important mechanism for describing and understanding the escape response of a viral population under a particular therapy. We present a new hierarchical model that incorporates spatially varying mutation and recombination rates at the nucleotide level. It also maintains separate parameters for treatment and control groups, which allows us to estimate treatment effects explicitly. We use the model to investigate the sequence evolution of HIV populations exposed to a recently developed antisense gene therapy, as well as a more conventional drug therapy. The detection of biologically relevant and plausible signals in both therapy studies demonstrates the effectiveness of the method.
Journal of The American Statistical Association, 2006
Many of the classification algorithms developed in the machine learning literature, including the... more Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms computationally efficient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise, and we show that, in this case, strictly convex loss functions lead to faster rates of convergence of the risk than would be implied by standard uniform convergence arguments. Finally, we present applications of our results to the estimation of convergence rates in function classes that are scaled convex hulls of a finite-dimensional base class, with a variety of commonly used loss functions.
We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled docum... more We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive a maximum-likelihood procedure for parameter estimation, which relies on variational approximations to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and web page popularity predicted from text descriptions. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression.
We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled docum... more We introduce supervised latent Dirichlet allocation (sLDA), a statistical model of labelled documents. The model accommodates a variety of response types. We derive an approximate maximum-likelihood procedure for parameter estimation, which relies on variational methods to handle intractable posterior expectations. Prediction problems motivate this research: we use the fitted model to predict response values for new documents. We test sLDA on two real-world problems: movie ratings predicted from reviews, and the political tone of amendments in the U.S. Senate based on the amendment text. We illustrate the benefits of sLDA versus modern regularized regression, as well as versus an unsupervised LDA analysis followed by a separate regression.
Bioinformatics/computer Applications in The Biosciences, 2004
Phylogenetic shadowing is a comparative genomics principle which allows for the discovery of cons... more Phylogenetic shadowing is a comparative genomics principle which allows for the discovery of conserved regions in sequences from multiple closely-related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with feature-based functional annotation methods. The resulting model, a generalized hidden Markov phylogeny (GHMP), applies to a variety of situations where functional regions are to be inferred from evolutionary constraints. Results: We show how GHMPs can be used to predict complete shared gene structures in multiple primate sequences. We also describe SHADOWER, our implementation of such a prediction system. We find that SHADOWER outperforms previously reported ab initio gene finders, including comparative human-mouse approaches, on a small sample of diverse exonic regions. Finally, we report on an empirical analysis of SHADOWER's performance which reveals that as few as five well-chosen species may suffice to attain maximal sensitivity and specificity in exon demarcation.
Proceedings of The National Academy of Sciences, 2003
High-pressure liquid chromatography-tandem mass spectrometry was used to obtain a protein profile... more High-pressure liquid chromatography-tandem mass spectrometry was used to obtain a protein profile of Escherichia coli strain MG1655 grown in minimal medium with glycerol as the carbon source. By using cell lysate from only 3 ؋ 10 8 cells, at least four different tryptic peptides were detected for each of 404 proteins in a short 4-h experiment. At least one peptide with a high reliability score was detected for 986 proteins. Because membrane proteins were underrepresented, a second experiment was performed with a preparation enriched in membranes. An additional 161 proteins were detected, of which from half to two-thirds were membrane proteins. Overall, 1,147 different E. coli proteins were identified, almost 4 times as many as had been identified previously by using other tools. The protein list was compared with the transcription profile obtained on Affymetrix GeneChips. Expression of 1,113 (97%) of the genes whose protein products were found was detected at the mRNA level. The arithmetic mean mRNA signal intensity for these genes was 3-fold higher than that for all 4,300 protein-coding genes of E. coli. Thus, GeneChip data confirmed the high reliability of the protein list, which contains about one-fourth of the proteins of E. coli. Detection of even those membrane proteins and proteins of undefined function that are encoded by the same operons (transcriptional units) encoding proteins on the list remained low.
Many classification algorithms, including the support vector machine, boosting and logistic regre... more Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function-that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.
Proceedings of The National Academy of Sciences, 2005
We previously characterized nutrient-specific transcriptional changes in Escherichia coli upon li... more We previously characterized nutrient-specific transcriptional changes in Escherichia coli upon limitation of nitrogen (N) or sulfur (S). These global homeostatic responses presumably minimize the slowing of growth under a particular condition. Here, we characterize responses to slow growth per se that are not nutrientspecific. The latter help to coordinate the slowing of growth, and in the case of down-regulated genes, to conserve scarce N or S for other purposes. Three effects were particularly striking. First, although many genes under control of the stationary phase sigma factor RpoS were induced and were apparently required under S-limiting conditions, one or more was inhibitory under N-limiting conditions, or RpoS itself was inhibitory. RpoS was, however, universally required during nutrient downshifts. Second, limitation for N and S greatly decreased expression of genes required for synthesis of flagella and chemotaxis, and the motility of E. coli was decreased. Finally, unlike the response of all other met genes, transcription of metE was decreased under S-and N-limiting conditions. The metE product, a methionine synthase, is one of the most abundant proteins in E. coli grown aerobically in minimal medium. Responses of metE to S and N limitation pointed to an interesting physiological rationale for the regulatory subcircuit controlled by the methionine activator MetR.
The Dirichlet process prior allows flexible nonparametric mixture modeling. The number of mixture... more The Dirichlet process prior allows flexible nonparametric mixture modeling. The number of mixture components is not specified in advance and can grow as new data arrive. However, analyses based on the Dirichlet process prior are sensitive to the choice of the parameters, including an infinite-dimensional distributional parameter G 0. Most previous applications have either fixed G 0 as a member of a parametric family or treated G 0 in a Bayesian fashion, using parametric prior specifications. In contrast, we have developed an adaptive nonparametric method for constructing smooth estimates of G 0. We combine this method with a technique for estimating α, the other Dirichlet process parameter, that is inspired by an existing characterization of its maximum-likelihood estimator. Together, these estimation procedures yield a flexible empirical Bayes treatment of Dirichlet process mixtures. Such a treatment is useful in situations where smooth point estimates of G 0 are of intrinsic interest, or where the structure of G 0 cannot be conveniently modeled with the usual parametric prior families. Analysis of simulated and real-world datasets illustrates the robustness of this approach.
Uploads
Papers by Jon McAuliffe