With the increasing amount and complexity of data being collected, there is an urgent need to cre... more With the increasing amount and complexity of data being collected, there is an urgent need to create automated techniques for mining the data. In particular, data being generated and stored by telecom companies overwhelms scientists' ability to manually discover patterns in the data. Because much of this data is structural in nature, or composed of parts and relations between the parts, linear attribute-value based algorithms will not capture all of the intricacies of the data. Hence, there exists a need to develop scalable tools to analyze and discover concepts in structural databases.
Most approaches to knowledge discovery concentrate on either an attribute-value representation or... more Most approaches to knowledge discovery concentrate on either an attribute-value representation or a structural data representation. The discover}, systems for these two representations are typically different, and their integration is non-trivial. We investigate a simpler integration of the two systems by coupling the two approaches. Our method first executes the structural discovery s}~tem on the data, and then uses these results to augment or compress the data before being input to the attribute-value-based system. We demonstrate this strategy using the AutoClass attribute-valuebased clustering system and the Subdue structural discovery system. The results of the demonstration show that coupling the two systems allows the discovery of knowledge imperceptible to either system alone.
The goal of the MavHome smart home project is to build an intelligent home environment that is aw... more The goal of the MavHome smart home project is to build an intelligent home environment that is aware of its inhabitants and their activities. Such a home is designed to provide maximum comfort to inhabitants at minimum cost. This can be done by learning the activities of the inhabitants and to automate those activities. For this it is necessary to identify among multiple inhabitants who is currently present in the home. Subdue is a graph-based data mining algorithm that discovers patterns in structural data. By representing the activity patterns for each inhabitant as graphs, Subdue can be used for inhabitant identification. We introduce a multiple-class learning version of Subdue and show some preliminary results on synthetic smart home activity data for multiple inhabitants.
The purpose of this paper is to develop an evolutionary programming based system that performs da... more The purpose of this paper is to develop an evolutionary programming based system that performs data mining on databases represented as graphs. The importance of such an endeavor can hardly be overemphasized, given that much of the data collected nowadays is structural in nature, or is composed of parts and relations between the parts, which can be naturally represented as graphs. The searching capability of evolutionary programming is utilized for discovering concepts or substructures that are often repeating in such structural data. The superiority of the proposed technique over the previously developed SUBDUE system , which uses a computationally constrained beam search in the space of substructures, is demonstrated for a number of data sets in the Web domain.
The large amount of data collected today is quickly rwrcm.mrhnlminn rocolrrhara' Qhilitinc tn int... more The large amount of data collected today is quickly rwrcm.mrhnlminn rocolrrhara' Qhilitinc tn intornrot the ""l.l ."'Lx,"Y'1'& '~UCW~II~I,~ u,"#ll"lY" U" 'I~"~~'y".'" data and discover interesting patterns. Knowledge discovery and data mining approaches hold the potential to automate the interpretation process, but these approaches frequently utilize computationally expensive algorithms. This research outlines a general approach for scaling KDD systems using parallel and clistribut,erl resources and applies the suggested strategies to the SUBDUE knowledge discovery system. SUBDUE has been used to discover interesting and repetitive concepts in graph-based databases from a variety of do-m~;nn h..+ rom,;wx .a nrrhct~nt;al camnrrnt nf n,.,\,~occ~.. LYcLLlll", VU" r.,.yulrs," u cJ""O"'WI"aw C"LAA"UI.Y "I yL\'.A"m ing time. Experiments that demonstrate scalability of parallel versions of the SUBDUE system are performed using CAD circuit databases and artificially-generated databases, and potential achievements and obstacles are discussed.
This paper describes graph-based relational, unsupervised learning algorithm to infer node replac... more This paper describes graph-based relational, unsupervised learning algorithm to infer node replacement graph grammar and its application to metabolic pathways. We search for frequent subgraphs and then check for overlap among the instances of the subgraphs in the input graph. If subgraphs overlap by one node, we propose a node replacement graph grammar production. We also can infer a hierarchy of productions by compressing portions of a graph described by a production and then inferring new productions on the compressed graph. We show learning curves and how the learning process changes when we increase the size of a sample set. We examine how computation time changes with an increased number of nodes in the input graphs. We inferred graph grammars from metabolic pathways which do not change more with increased number of graphs in the input set. It indicates that graph grammars found represent the input sets well.
Graphs are a natural way to represent multi-relational data and are extensively used to model a v... more Graphs are a natural way to represent multi-relational data and are extensively used to model a variety of application domains in diverse fields ranging from bioinformatics to homeland security. Often, in such graphs, certain subgraphs are known to possess some distinct properties and graph patterns in the proximity of these subgraphs can be an indicator of these properties. In this work we focus on the task of mining in the proximity of subgraphs, known to possess certain distinct properties and identify patterns which distinguish these subgraphs from other subgraphs without these properties. This task is novel and of considerable interest as it can facilitate the prediction of previously unknown subgraphs possessing the properties under consideration in the graph and can lead to a better understanding of the application domain. We characterize the task of mining in the proximity of subgraphs as a supervised learning problem and present a heuristic algorithm for the same. Experimental comparison with the ILP system CProgol on real world and artificial datasets provides a strong indication of the ability and viability of the approach in uncovering interesting patterns.
International Journal of Applied Mathematics and Computer Science, Jun 1, 2008
In this paper we study the inference of node and edge replacement graph grammars. We search for f... more In this paper we study the inference of node and edge replacement graph grammars. We search for frequent subgraphs and then check for an overlap among the instances of the subgraphs in the input graph. If the subgraphs overlap by one node, we propose a node replacement graph grammar production. If the subgraphs overlap by two nodes or two nodes and an edge, we propose an edge replacement graph grammar production. We can also infer a hierarchy of productions by compressing portions of a graph described by a production and then inferring new productions on the compressed graph. We validate the approach in experiments where we generate graphs from known grammars and measure how well the approach infers the original grammar from the generated graph. We show graph grammars found in biological molecules, biological networks, and analyze learning curves of the algorithm.
The ever-increasing number of chemical compounds added every year has not been accompanied by a s... more The ever-increasing number of chemical compounds added every year has not been accompanied by a similar growth in our ability to analyze and classify these compounds. The problem of prevention of cancer caused by many of these chemicals has been of great scientific and humanitarian value. The use of AI discovery tools for predicting chemical toxicity is being investigated. The basic idea behind the work is to obtain structure-activity representation (SARs) [Srinivasan et al.], which relates molecular structures to cancerous activity. The data is obtained from the U.S National Toxicology Program conducted by the National Institute of Environmental Health Sciences (NIEHS). A general approach to automatically discover repetitive substructures from the datasets is outlined by this research. Relevant SARs are identified using the Subdue substructure discovery system that discovers commonly occurring substructures in a given set of compounds. The best substructure given by Subdue is used as a pattern indicative of cancerous activity.
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Our dynamic graph-based relational mining approach has been developed to learn structural pattern... more Our dynamic graph-based relational mining approach has been developed to learn structural patterns in biological networks as they change over time. The analysis of dynamic networks is important not only to understand life at the system-level, but also to discover novel patterns in other structural data. Most current graph-based data mining approaches overlook dynamic features of biological networks, because they are focused on only static graphs. Our approach analyzes a sequence of graphs and discovers rules that capture the changes that occur between pairs of graphs in the sequence. These rules represent the graph rewrite rules that the first graph must go through to be isomorphic to the second graph. Then, our approach feeds the graph rewrite rules into a machine learning system that learns general transformation rules describing the types of changes that occur for a class of dynamic biological networks. The discovered graph-rewriting rules show how biological networks change over time, and the transformation rules show the repeated patterns in the structural changes. In this paper, we apply our approach to biological networks to evaluate our approach and to understand how the biosystems change over time. We evaluate our results using coverage and prediction metrics, and compare to biological literature.
We perform an experimental comparison of the graph-based multi-relational data mining system, Sub... more We perform an experimental comparison of the graph-based multi-relational data mining system, Subdue, and the inductive logic programming system, CProgol, on the Mutagenesis dataset and various artificially generated Bongard problems. Experimental results indicate that Subdue can significantly outperform CProgol while discovering structurally large multi-relational concepts. It is also observed that CProgol is better at learning semantically complicated concepts and it tends to use background knowledge more effectively than Subdue. An analysis of the results indicates that the differences in the performance of the systems are a result of the difference in the expressiveness of the logic-based and the graph-based representations. The ability of graph-based systems to learn structurally large concepts comes from the use of a weaker representation whose expressiveness is intermediate between propositional and first-order logic. The use of this weaker representation is advantageous whil...
2008 IEEE International Conference on Data Mining Workshops, 2008
We propose a dynamic graph-based relational mining approach using graph-rewriting rules to learns... more We propose a dynamic graph-based relational mining approach using graph-rewriting rules to learns patterns in networks that structurally change over time. A dynamic graph containing a sequence of graphs over time represents dynamic properties as well as structural properties of the network. Our approach discovers graph-rewriting rules, which describe the structural transformations between two sequential graphs over time, and also learns description rules that generalize over the discovered graph-rewriting rules. The discovered graph-rewriting rules show how networks change over time, and the description rules in the graph-rewriting rules show temporal patterns in the structural changes. We apply our approach to biological networks to understand how the biosystems change over time. Our compression-based discovery of the description rules is compared with the frequent subgraph mining approach using several evaluation metrics.
2008 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 2008
Our project introduces a graph-based relational learning approach using graph-rewriting rules for... more Our project introduces a graph-based relational learning approach using graph-rewriting rules for temporal and structural analysis of biological networks changing over time. The analysis of dynamic biological networks is necessary to understand life at the system-level, because biological networks continuously change their structures and properties while an organism performs various biological activities to promote reproduction and sustain our lives. Most current graph-based data mining approaches overlook dynamic features of biological networks, because they are focused on only static graphs. Most approaches for analysis of microarray data disregard structural properties on biological systems. But our dynamic graph-based relational learning approach describes how the graphs temporally and structurally change over time in the dynamic graph representing biological networks in combination with microarray data.
Accessing information of interest on the Internet presents a challenge to scientists and analysts... more Accessing information of interest on the Internet presents a challenge to scientists and analysts, particularly if the desired information is structural in nature. Our goal is to design a structural search engine which uses the hyperlink structure of the Web, in addition to textual information, to search for sites of interest. To design a structural search engine, we use the SUBDUE graph-based discovery tool. The tool, called WebSUBDUE, is enhanced by WordNet features that allow the engine to search for synonym terms. Our search engine retrieves sites corresponding to structures formed by graph-based user queries. We demonstrate the approach on a number of structural web queries.
International Journal of Applied Mathematics and Computer Science, 2008
Inferring Graph Grammars by Detecting Overlap in Frequent SubgraphsIn this paper we study the inf... more Inferring Graph Grammars by Detecting Overlap in Frequent SubgraphsIn this paper we study the inference of node and edge replacement graph grammars. We search for frequent subgraphs and then check for an overlap among the instances of the subgraphs in the input graph. If the subgraphs overlap by one node, we propose a node replacement graph grammar production. If the subgraphs overlap by two nodes or two nodes and an edge, we propose an edge replacement graph grammar production. We can also infer a hierarchy of productions by compressing portions of a graph described by a production and then inferring new productions on the compressed graph. We validate the approach in experiments where we generate graphs from known grammars and measure how well the approach infers the original grammar from the generated graph. We show graph grammars found in biological molecules, biological networks, and analyze learning curves of the algorithm.
International Journal on Artificial Intelligence Tools, 2006
Historically, data mining research has been focused on discovering sets of attributes that discri... more Historically, data mining research has been focused on discovering sets of attributes that discriminate data entities into classes or association rules between attributes. In contrast, we are working to develop data mining techniques to discover patterns consisting of complex relationships between entities. Our research is particularly applicable to domains in which the data is event driven, such as counter-terrorism intelligence analysis. In this paper we describe an algorithm designed to operate over relational data received from a continuous stream. Our approach includes a mechanism for summarizing discoveries from previous data increments so that the globally best patterns can be computed by examining only the new data increment. We then describe a method by which relational dependencies that span across temporal increment boundaries can be efficiently resolved so that additional pattern instances, which do not reside entirely in a single data increment, can be discovered. We al...
Most approaches to knowledge discovery concentrate on either an attribute-value representation or... more Most approaches to knowledge discovery concentrate on either an attribute-value representation or a structural data representation. The discover}, systems for these two representations are typically different, and their integration is non-trivial. We investigate a simpler integration of the two systems by coupling the two approaches. Our method first executes the structural discovery s}~tem on the data, and then uses these results to augment or compress the data before being input to the attribute-value-based system. We demonstrate this strategy using the AutoClass attribute-valuebased clustering system and the Subdue structural discovery system. The results of the demonstration show that coupling the two systems allows the discovery of knowledge imperceptible to either system alone.
Uploads
Papers by Diane Cook