The 3w model and algebra for unified data mining
2000
Abstract
Real data mining analysis applications call for a framework which adequately supports knowledge discovery as a multi-step process, where the input of one mining operation can be the output of another. Previous studies, primarily focusing on fast computation of one speci c mining task at a time, ignore this vital issue. Motivated by this observation, we develop a unied model supporting all major mining and analysis tasks. Our model consists of three distinct worlds, corresponding to intensional and extensional dimensions, and to data sets. The notion of dimension is a centerpiece of the model. Equipped with hierarchies, dimensions integrate the output of seemingly dissimilar mining and analysis operations in a clean manner. We propose an algebra, called the dimension algebra, for manipulating intensional dimensions, as well as operators that serve as bridges" between the worlds. We demonstrate by examples that several real data mining processes can be captured using our model and algebra. We demonstrate the naturality of the algebra by establishing several identities. Finally, w e discuss e cient implementation of the proposed framework.
References (31)
- C. Aggarwal and P. Y u. Online Generation of Associ- ation Rules. In Proc. 1998 ICDE, pp 402 411.
- R. Agrawal, J. Gehrke, D. Gunopolos and P. Ragha- van. Automatic Subspace Clustering of High Dimen- sional Data for Data Mining Applications. In Proc. 1998 SIGMOD, pp. 94 105.
- R. Agrawal, T. Imielinski, and A. Swami. Mining asso- ciation rules between sets of items in large databases. In Proc. 1993 SIGMOD, pp 207 216.
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 VLDB, pp 487 499.
- M. Ankerst, M. Breunig, H.P. Kriegel and J. Sander. Optics: Ordering Points to Identify the Clustering Structure. In Proc. 1999 SIGMOD, pp. 49 60.
- N. Beckmann, H.-P. Kriegel, R. Schneider, and Seeger. newblock The R*-Tree: an E cient and Ro- bust Access Method for Points and Rectangles. In Proc. 1990 SIGMOD, pp. 322 331.
- M. Benedikt, G. Dong, L. Libkin, and L. Wong. Re- lational Expressive Power of Constraint Query Lan- guages. Journal of the ACM, 45:1, 1998, pp. 1 34.
- S. Brin, R. Motwani, and C. Silverstein. Beyond mar- ket basket: Generalizing association rules to correla- tions. In Proc. 1997 SIGMOD, pp 265 276.
- S. Chaudhuri. Data mining and database systems: Where is the intersection? Bulletin of the Technical Committee on Data Engineering, 21:4 8, March 1998.
- M. Garofalakis, R. Rastogi, and K. Shim. SPIRIT: Sequential Pattern Mining with Regular Expression Constraints, In Proc. 1999 VLDB, pp 223 234.
- J. Gray et al. Data Cube: A relational aggregation operator generalizing group-by, cross-tab, and sub- totals. Proc. 12th ICDE, 1996, pp. 152 159.
- R. Guttmann. A Dynamic Index Structure for Spatial Searching. In Proc. 1984 SIGMOD, pp. 47 57.
- M. Gyssens, J. Van den Bussche, and D. Van Gucht. Complete Geometric Query Languages. J. of Comput. & Syst. Sciences 58:3483-511 1999.
- J. Han and Y. Fu. Discovery of multiple-level associa- tion rules from large databases. In Proc. 1995 VLDB, pp 420 431.
- C. Hidber. Online Association Rule Mining. In Proc. 1999 SIGMOD, pp 145 156.
- T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58 64, 1996.
- H. Jagadish, L. Lakshmanan, and D. Srivastava. What can Hierarchies do for Data Warehouses? In Proc. 1999 VLDB, pp. 530 541.
- T. Johnson, I. Kwok, and R. Ng. Fast Computation of 2-Dimensional Depth Contours. In Proc. 1998 KDD, pp. 224 228.
- P. Kannellakis, G. Kuper, and P. Revesz. Constraint Query Languages. Journal of Computer and System Sciences, 51:1, 1995, pp. 26 52.
- M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivo- nen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. In Proc. 1994 CIKM, pp 401 408.
- L. V. S. Lakshmanan, R. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In Proc. 1999 SIGMOD, pp. 157 168.
- H. Mannila, H Toivonen, and A. I. Verkamo. Dis- covery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1, 1997, pp. 259- 289.
- R. Meo, G. Pasila, and S. Ceri. A New SQL-like Op- erator for Mining Association Rules. In Proc. 1996 VLDB, pp. 122 133.
- R. Ng and J. Han. E cient and E ective Cluster- ing Methods for Spatial Data Mining. In Proc. 1994 VLDB, pp. 144-155.
- R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of con- strained associations rules. In Proc. 1998 SIGMOD, pp. 13 24.
- J. Paradaens, J. Van dn Bussche, and D. Van Gucht. Towards a Theory of Spatial Database Queries. In Proc. 1994 PODS, pp. 279 288.
- J. Quinlan. Induction of Decision Trees. Machine Learning, 1, 1986, pp. 81 106.
- S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database sys- tems: Alternatives and implications. In Proc. 1998 SIGMOD, pp 343 354.
- C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. In Proc. 1998 VLDB, pp 594 605.
- D. Tsur, J. D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, and S. Nestorov. Query ocks: A gen- eralization of association-rule mining. In Proc. 1998 SIGMOD, pp 1 12.
- T. Zhang, R. Ramakrishnan and M. Livny. BIRCH: an E cient Data Clustering Method for Very Large Databases. In Proc. 1996 SIGMOD, pp. 103 114.