Papers by C. Nevill-manning

Proceedings. International Conference on Intelligent Systems for Molecular Biology, 1997
Discrete motifs that discriminate functional classes of proteins are useful for classifying new s... more Discrete motifs that discriminate functional classes of proteins are useful for classifying new sequences, capturing structural constraints, and identifying protein subclasses. Despite the fact that the space of such motifs can grow exponentially with sequence length and number, we show that in practice it usually does not, and we describe a technique that infers motifs from aligned protein sequences by exhaustively searching this space. Our method generates sequence motifs over a wide range of recall and precision, and chooses a representative motif based on a score that we derive from both statistical and information-theoretic frameworks. Finally, we show that the selected motifs perform well in practice, classifying unseen sequences with extremely high precision, and infer protein subclasses that correspond to known biochemical classes.

www.elsevier.comrlocaterdsw Improving browsing in digital libraries with keyphrase indexes
Browsing accounts for much of people’s interaction with digital libraries, but it is poorly suppo... more Browsing accounts for much of people’s interaction with digital libraries, but it is poorly supported by standard search engines. Conventional systems often operate at the wrong level, indexing words when people think in terms of topics, and returning documents when people want a broader view. As a result, users cannot easily determine what is in a collection, how well a particular topic is covered, or what kinds of queries will provide useful results. We have built a new kind of search engine, Keyphind, that is explicitly designed to support browsing. Automatically extracted keyphrases form the basic unit of both indexing and presentation, allowing users to interact with the collection at the level of topics and subjects rather than words and documents. The keyphrase index also provides a simple mechanism for clustering documents, refining queries, and previewing results. We compared Keyphind to a traditional query engine in a small usability study. Users reported that certain kind...
Position-specific scoring matrices have been used extensively to recognize highly conserved prote... more Position-specific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched. Building on earlier work that allows evaluation of a scoring matrix to be stopped early, the suffix tree-based method excludes many protein segments from consideration at once by pruning entire subtrees. Although suffix trees are usually expensive in space, the fact that scoring matrix evaluation requires an in-order traversal allows nodes to be stored more compactly without loss of speed, and our implementation requires only 17 bytes of primary memory per input symbol. Searches are accelerated by up to a factor of ten.
It has been our experience that in order to obtain useful results using supervised learning of re... more It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accurately so that these experiments can be performed in a reasonable length of time, preferably interactively. This paper suggests a method to achieve this using a very simple algorithm that gives good performance across different supervised learning schemes and when compared to one of the most common methods for feature subset selection.

1 Feature Selection via the Discovery of Simple Classification Rules
It has been our experience that in order to obtain useful results using supervised learning of re... more It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accurately so that these experiments can be performed in a reasonable length of time, preferably interactively. This paper suggests a method to achieve this using a very simple algorithm that gives good performance across different supervised learning schemes and when compared to one of the most common methods for feature subset selection. KEYWORDS Feature subset selection; supervised learning; 1R; filter model; wrapper model. INTRODUCTION There is growing evidence that feature subset selection can substantially improve the task of performing supervised learning. The algorithms that perform feature subset selection have been studied in a variety o...
Position-specific scoring matrices have been used extensively to recognize highly conserved prote... more Position-specific scoring matrices have been used extensively to recognize highly conserved protein regions. We present a method for accelerating these searches using a suffix tree data structure computed from the sequences to be searched. Building on earlier work that allows evaluation of a scoring matrix to be stopped early, the suffix tree-based method excludes many protein segments from consideration at once by pruning entire subtrees. Although suffix trees are usually expensive in space, the fact that scoring matrix evaluation requires an in-order traversal allows nodes to be stored more compactly without loss of speed, and our implementation requires only 17 bytes of primary memory per input symbol. Searches are accelerated by up to a factor of ten.
Important properties of motifs such as conservation strength and solvent accessible surface area ... more Important properties of motifs such as conservation strength and solvent accessible surface area at each position are visually represented on the structure using a variety of color shading schemes. Users can manipulate the displayed motifs using the freely available Chime plugin.
Proceedings of the Fifth …, 1997
Discrete motifs that discriminate functional classes of proteins are useful for classifying new s... more Discrete motifs that discriminate functional classes of proteins are useful for classifying new sequences, capturing structural constraints, and identifying protein subclasses. Despite the fact that the space of such motifs can grow exponentially with sequence length and number, we show that in practice it usually does not, and we describe a technique that infers motifs from aligned protein sequences by exhaustively searching this space. Our method generates sequence motifs over a wide range of recall and precision, and chooses a representative motif based on a score that we derive from both statistical and information-theoretic frameworks. Finally, we show that the selected motifs perform well in practice, classifying unseen sequences with extremely high precision, and infer protein subclasses that correspond to known biochemical classes.

The growing need to manage and exploit the proliferation of online data sources is opening up new... more The growing need to manage and exploit the proliferation of online data sources is opening up new opportunities for bringing people closer to the resources they need. For instance, consider a recommendation service through which researchers can receive daily pointers to journal papers in their elds of interest. We survey some of the known approaches to the problem of technical paper recommendation and ask how they can be extended to deal with multiple information sources. More speci cally, we focus on a variant of this problem { recommending conference paper submissions to reviewing committee members { which o ers us a testbed to try di erent approaches. Using WHIRL { an information integration system { we are able to implement di erent recommendation algorithms derived from information retrieval principles. We also use a novel autonomous procedure for gathering reviewer interest information from the Web. We evaluate our approach and compare it to other methods using preference data provided by members of the AAAI-98 conference reviewing committee along with data about the actual submissions.
Because digital libraries are expensive to create and maintain, Internet analogs of public librar... more Because digital libraries are expensive to create and maintain, Internet analogs of public libraries-reliable, quality, community services-have only recently begun to appear. A serious obstacle to their creation is the provision of appropriate cataloguing information. Without a database of titles, authors and subjects, it is hard to offer the searching and browsing facilities normally available in physical libraries. Full-text retrieval provides a way of approximating these services without a concomitant investment of human resources. This presentation will discuss the indexing, collection and maintenance processes, and the retrieval interface, to public digital libraries.
Systems and methods for information extraction
The complementary paradigms of text compression and image compression suggest that there may be p... more The complementary paradigms of text compression and image compression suggest that there may be potential for applying methods developed for one domain to the other. In image coding, lossy techniques yield compression factors that are vastly superior to those of the best lossless schemes and we show that this is also the case for text. This paper investigates the resulting trade-off between subjective quality of the transmission and its compression factor. Two different methods are described, which can be combined into an extremely effective technique that provides far better compression than the present state of the art and yet preserves a reasonable degree of perceived match between the original and received text. The major challenge for lossy text compression is the quantitative evaluation of the quality of this match.

This paper reviews our experience with the application of machine learning techniques to agricult... more This paper reviews our experience with the application of machine learning techniques to agricultural databases. We have designed and implemented a machine learning workbench, WEKA, which permits rapid experimentation on a given dataset using a variety of machine learning schemes, and has several facilities for interactive investigation of the data: preprocessing attributes, evaluating and comparing the results of different schemes, and designing comparative experiments to be run off-line. We discuss the partnership between agricultural scientist and machine learning researcher that our experience has shown to be vital to success. We review in some detail a particular agricultural application concerned with the culling of dairy herds. In this paper we present the process model that underpins our work over the past two years for the development of applications in agriculture, the software we developed around our workbench of machine learning schemes to support this model, and the outcomes and problems we have encountered in developing applications. clean data Research goals Useful data derived attributes raw data Results Analysis of results anomalies clarification preprocessing attribute analysis experiments with machine learning schemes data provider derived attributes

People identify powerfully with music: someone might say thats my song! but they are unlikely to ... more People identify powerfully with music: someone might say thats my song! but they are unlikely to say thats my book! or thats my picture! A digital library of popular music therefore has the potential to be a compelling application of information retrieval technology. Such a library requires a retrieval method that is appropriate for a non-technical audience. Experiments on query by humming, which attempt to retrieve a tune based on sampled recording of a user singing an excerpt, have heretofore concentrated on relatively small, well-curated collections. Scaling up introduces three problems: availability of source material, an increase in false positive hits, and slower retrieval. We describe our experiments with MIDI files, propose a new, more accurate distance metric between queries and songs, and discuss possibilities for efficient indexing. 1.# INTRODUCTION Digital libraries until now could hardly be described as popular: they tend to be based on esoteric, scholarly sources close...
FEATURE SELECTION VIA THE DISCOVERY OF SIMPLE CLASSIFICATION RULES
It has been our experience that in order to obtain useful results using supervised learning of re... more It has been our experience that in order to obtain useful results using supervised learning of real-world datasets it is necessary to perform feature subset selection and to perform many experiments using computed aggregates from the most relevant features. It is, therefore, important to look for selection algorithms that work quickly and accurately so that these experiments can be performed
Proceedings of Data Compression Conference - DCC '96, 1996
Proceedings DCC '97. Data Compression Conference, 1997

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096), 1999
Life is based on two polymers, DNA and protein, whose properties can be described in a simple tex... more Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown-that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models.
Proceedings of IEEE Data Compression Conference (DCC'94), 1994
Adaptive compression methods build models of symbol sequences. In many areas of computer science,... more Adaptive compression methods build models of symbol sequences. In many areas of computer science, models of sequences are constructed for their explanatory value. In contrast, data compression schemes use models that are opaque in that they do not provide descriptions of the sequence that can be understood or applied in other domains. Statistical methods that compress text well invariably generate large models that are not so much a structural description of the sequence as a record of frequencies of short substrings. Macro models replace repeated text by references to earlier occurrences and generally work within a small moving window of symbols so that any implicit model is transient. In both cases the model is flat and does not build up abstractions by combining references into higher level phrases.
Uploads
Papers by C. Nevill-manning