AutoClass: A Bayesian Classification System
1988, Elsevier eBooks
Sign up for access to the world's latest research
Abstract
This paper describes AutoClass H, a program for automatically discovering (inducing) classes from a database, based on a Bayesian statistical technique which automatically determines the most probable number of classes, their probabilistic descriptions, and the probability that each object is a member of each class. AutoClass has been tested on several large, real databases and has discovered previously unsuspected classes. There is no doubt that these classes represent new phenomena.
Related papers
In this paper, we demonstrate how semantic categories of images can be learnt from their color distributions using an effective probabilistic approach. Many previous probabilistic approaches are based on the Naïve Bayes that assume independence among attributes, which are represented by a single Gaussian distribution. We use a derivative of the Naïve Bayesian classifier, called Flexible Bayesian classifier, which abandon the normality assumption to better represent the image data. This approach is shown to yield high accuracy results on classifying image databases as compared to it counterpart the "Naïve Bayesian classifier" and the widely used K-Nearest Neighbor classifier.
Proceedings of the 12th International …, 1991
The task of inferring a set of classes and class descriptions most likely to explain a given data set can be placed on a firm theoretical foundation using Bayesian statistics. Within this framework, and using various mathematical and algorithmic approximations, the Au-toClass system searches for the most probable classifications, automatically choosing the number of classes and complexity of class descriptions. Simpler versions of AutoClass have been applied to many large real data sets, have discovered new independently-verified phenomena, and have been released as a robust software package. Recent extensions allow attributes to be selectively correlated within particular classes, and allow classes to inherit, or share, model parameters though a class hierarchy.
IEEE Transactions on Knowledge and Data Engineering, 2000
A promising approach to Bayesian classification is based on exploiting frequent patterns, i.e., patterns that frequently occur in the training dataset, to estimate the Bayesian probability. Pattern-based Bayesian classification focuses on building and evaluating reliable probability approximations by exploiting a subset of frequent patterns tailored to a given test case. This paper proposes a novel and effective approach to estimate the Bayesian probability. Differently from previous approaches, the Entropy-based Bayesian classifier, namely EnBay, focuses on selecting the minimal set of long and not overlapped patterns that best complies with a conditional-independence model, based on an entropy-based evaluator. Furthermore, the probability approximation is separately tailored to each class. An extensive experimental evaluation, performed on both real and synthetic datasets, shows that EnBay is significantly more accurate than most state-ofthe-art classifiers, Bayesian and not.
Pattern Recognition Letters, 1999
In this paper, an approach to study the nature of the classi®cation models induced by Machine Learning algorithms is proposed. Instead of the predictive accuracy, the values of the predicted class labels are used to characterize the classi®cation models. Over these predicted class labels Bayesian networks are induced. Using these Bayesian networks, several assertions are extracted about the nature of the classi®cation models induced by Machine Learning algorithms. Ó (I. Inza) 0167-8655/99/$ -see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 -8 6 5 5 ( 9 9 ) 0 0 0 9 5 -1
1995
In this paper we present a novel induction algorithm for Bayesian networks. This selective Bayesian network classi er selects a subset of attributes that maximizes predictive accuracy prior to the network learning phase, thereby learning Bayesian networks with a bias for small, high-predictive-accuracy networks. We compare the performance of this classi er with selective and non-selective naive Bayesian classi ers. We show that the selective Bayesian network classier performs signi cantly better than both versions of the naive Bayesian classi er on almost all databases analyzed, and hence is an enhancement of the naive Bayesian classi er. Relative to the non-selective Bayesian network classi er, our selective Bayesian network classi er generates networks that are computationally simpler to evaluate and that display predictive accuracy comparable to that of Bayesian networks which model all features.
Classification is an important data mining technique that is used by many applications. Several types of classifiers have been described in the research literature. Example classifiers are decision tree classifiers, rule-based classifiers, and neural networks classifiers. Another popular classification technique is naïve Bayesian classification. Naïve Bayesian classification is a probabilistic classification approach that uses Bayesian Theorem to predict the classes of unclassified records. A drawback of Naïve Bayesian Classification is that every time a new data record is to be classified, the entire dataset needs to be scanned in order to apply a set of equations that perform the classification. Scanning the dataset is normally a very costly step especially if the dataset is very large. To alleviate this problem, a new approach for using naïve Bayesian classification is introduced in this study. In this approach, a set of classification rules is constructed on top of naïve Bayesian classifier. Hence we call this approach Rule-based Naïve Bayesian Classifier (RNBC). In RNBC, the dataset is canned only once, off-line, at the time of building the classification rule set. Subsequent scanning of the dataset, is avoided. Furthermore, this study introduces a simple three-step methodology for constructing the classification rule set.
Systems, Man, and Cybernetics, Part B: …, 2002
Abstract| In this paper, we address the problem of how to classify a set of query vectors that belong to the same unknown class. Sets of data known to be sampled from the same class are naturally available in many application domains, such as speaker recognition. We refer to these sets as homologous sets. We show how to take advantage of homologous sets in classi cation to obtain improved accuracy over classifying each query vector individually. Our method, called \homologous naive Bayes" (HNB), is based on the naive Bayes classi er, a simple algorithm shown to be e ective in many application domains. HNB uses a modi ed classi cation procedure that classi es multiple instances as a single unit. Compared with a voting method and several other variants of naive Bayes classi cation, HNB signi cantly outperforms these methods in a variety of test data sets, even when the number of query vectors in the homologous sets is small. We also report a successful application of HNB to speaker recognition. Experimental results show that HNB can achieve classi cation accuracy comparable to the Gaussian mixture model, the most widely used speaker recognition approach, while using less time for both training and classi cation.
2011
Knowledge available through Semantic Web standards can easily be missing, generally because of the adoption of the Open World Assumption (i.e. the truth value of an assertion is not necessarily known). However, the rich relational structure that characterizes ontologies can be exploited for handling such missing knowledge in an explicit way. We present a Statistical Relational Learning system designed for learning terminological naïve Bayesian classifiers, which estimate the probability that a generic individual belongs to the target concept given its membership to a set of Description Logic concepts. During the learning process, we consistently handle the lack of knowledge that may be introduced by the adoption of the Open World Assumption, depending on the varying nature of the missing knowledge itself.
2009
Data miners have access to a signiflcant number of classiflers and use them on a variety of dif- ferent types of dataset. This large selection makes it di-cult to know which classifler will perform most efiectively in any given case. Usually an understand- ing of learning algorithms is combined with detailed domain knowledge of the dataset at hand to lead to the choice of a classifler. We propose an empirical framework that quantitatively assesses the accuracy of a selection of classiflers on difierent datasets, re- sulting in a set of classiflcation rules generated by the J48 decision tree algorithm. Data miners can follow these rules to select the most efiective classifler for their work. By optimising the parameters used for learning and the sampling techniques applied, a set of rules were learned that select with 78% accuracy, the most efiective classifler.
The Naïve Bayesian Classifier and an Augmented Naïve Bayesian Classifier are applied to human classification tasks. The Naïve Bayesian Classifier is augmented with feature construction using a Galois lattice. The best features, measured on their within-and between-category overlap, are added to the category's concept description. The results show that space efficient concept descriptions can predict much of the variance in the classification phenomena.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (3)
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incom- plete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1-38, 1977.
- W. Dillon and M. Goldstein. Multivariate Analysis: Methods and Applications, chapter
- Richard C. Dubes. How many clusters are best? --an experiment. Pattern Recognition, 20(6):645-663, 1987.