DATA MINING: A CONCEPTUAL OVERVIEW
Abstract
This tutorial provides an overview of the data mining process. The tutorial also provides a basic understanding of how to plan, evaluate and successfully refine a data mining project, particularly in terms of model building and model evaluation. Methodological considerations are discussed and illustrated. After explaining the nature of data mining and its importance in business, the tutorial describes the underlying machine learning and statistical techniques involved. It describes the CRISP-DM standard now being used in industry as the standard for a technology-neutral data mining process model. The paper concludes with a major illustration of the data mining process methodology and the unsolved problems that offer opportunities for research. The approach is both practical and conceptually sound in order to be useful to both academics and practitioners.
References (55)
- Berry, M. J., Linoff, G. S. (2000), "Mastering Data Mining: The Art and Science of Customer Relationship Management". Wiley Computer Publishing, New York.
- Chung, H. M., Gray, P. (1999), "Special Section: Data Mining". Journal of Management Information Systems, (16:1),11-17.
- Colin, S. (2000), "The CRISP-DM Model: The New Blueprint for Data Mining", Journal of Data Warehousing, (5:4), Fall, 13-22.
- Fayyad, U., Piatetsky-Shapiro, G., and Smyth, R (1996). "The KDD Process for Extracting Useful Knowledge from Volumes of Data," Communications of the ACM, (39:11), pp. 27-34. Fayyad, U., (2001), "The Digital Physics of Data Mining", Communications of the ACM, March, (44:3), 62-65.
- Glymour, C., Madigan D., et al (1996), "Statistical Inference and Data Mining". Communications of the ACM, (39:11), 35-41.
- Goebel, M., Gruenwald, L. (1999), "A Survey of Data Mining and Knowledge Discovery Software Tools", ACM SIGKDD (Special Interest Group on Knowledge Discovery and Data Mining) Explorations, June, (1:1), 20-28.
- Gray P., Watson, H.J. (1998a), "Professional Briefings...Present and Future Directions in
- Data Warehousing", Database for Advances in Information Systems, Summer, (29:3), 83-90.
- Gray, P., Watson, H.J. (1998b), Decision Support in the Data Warehouse, Upper Saddle River, N.J. Gray, P. (1997) " Mining for Data Warehousing Gems," Information Systems Management, Winter, 82-86.
- Han, J., Kamber, M. (2001), Data Mining: Concepts and Techniques, Morgan-Kaufmann Academic Press, San Francisco.
- Hand, D. J. (1998), "Data Mining: Statistics and More?", The American Statistician, May (52:2), 112-118.
- Johnson, R. & Wicheren,D.W. (1998). Applied Multivariate Statistical Analysis. Prentice Hall, New York.
- Kennedy, R. L., Lee, Y. Roy, B. V. Reed, C. D. & Lippman, R. P. (1997). Solving Data Mining Problems Through Pattern Recognition. New Jersey: Prentice Hall Professional Technical Reference.
- Kosala, R., Blockeel, H. (2000), "Web Mining Research: A Survey", ACM SIGKDD (Special Interest Group on Knowledge Discovery and Data Mining) Explorations, June, (2:1), 1-10.
- Langley, P., Simon, H. (1995), "Applications of Machine Learning and Rule Induction", Communications of the ACM, November, 55-65.
- Moeller, R. A. (2001), "Distributed Data Warehousing Using Web Technology", AMACOM, New York .
- Peacock, P. R. (1998a) "Data Mining in Marketing: Part 1", Marketing Management, Winter, 9-18.
- Peacock, P. R. (1998b) "Data Mining in Marketing: Part 2", Marketing Management, Spring, 15-25.
- Rajagopalan, B., Krovi, R. (2002), "Benchmarking Data Mining Algorithms", Journal of Database Management, Jan-Mar, 13, 25-36
- Ranjit, B., Sugumaran, V. (1999), "Application of Intelligent Agent Technology for Managerial Data Analysis and Mining", Database for Advances in Information Systems, (30:1), 77-
- Sharma, S., "Applied Multivariate Techniques", John Wiley & Sons, Inc. (1996).
- Srivastava, J., Cooley, R., Deshpande, M., Tan, P., "Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data", ACM SIGKDD (Special Interest Group on Knowledge Discovery and Data Mining) Explorations, January, (1:2)
- Tegarden, D.J. (1999) "Business Information Visualization" Communications of AIS (1)4
- Wells, M. T. (1999), "Feature Extraction Construction and Selection: A Data Mining Perspective", Journal of the American Statistical Association, (94:448), 1390. Berry, M. J., Linoff, G. S. (2000), Mastering Data Mining: The Art and Science of Customer Relationship Management. Wiley Computer Publishing, New York.
- Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., Zanasi, A. (1998), Discovering Data Mining: From Concept to Implementation. Upper Saddle River, NJ: Prentice Hall. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R, Editors (1996), Advances in Knowledge Discovery and Data Mining, MIT Press, Cambridge, MA Fayyad, U., Piatetsky-Shapiro, G., and Smyth, R (1996). "The KDD Process for Extracting Useful Knowledge from Volumes of Data," Communications of the ACM, (39:11), pp. 27-34.
- Groth, R. (1998) Data Mining: A Hands-on Approach for Business Professional, Upper Saddle River, NJ Piatetsky-Shapiro, G and Frawley W. J. (1991), Knowledge Discovery in Databases, MIT Press, Cambridge, MA A SLIGHTLY MORE TECHNICAL APPROACH TO DATA MINING Agrawal, R., Imielinski, T., Swami, A.(1993), "Database Mining: A Performance Perspective, IEEE Transactions Knowledge and Data Engineering, (5), 914-925.
- Chen, M.-S., Jan, J., Yu, P.S. (1996) "Data Mining: An Overview from a Database Perspective", IEEE Transactions on Knowledge and Data Engineering", (8:6), 866.883. Fayyad, U., (2001), "The Digital Physics of Data Mining", Communications of the ACM, March, (44:3), 62-65.
- Han, J., Kamber, M. (2001), Data Mining: Concepts and Techniques, Morgan-Kaufmann Academic Press, San Francisco.
- Lee, C. (2001), "The GeneMine System for Genome/Preteome Annotation and Collaborative Data Mining", IBM Systems Journal, (40:2), 592-604.
- Moore, A., Lee, M. S. (1998), "Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets", Journal of Artificial Intelligence Research, (8), 67.91.
- Witten, I. H. (2000), Data Mining : Practical Machine Learning Tools and Techniques With Java Implementations, Morgan Kaufman, San Francisco. EMPIRICAL DATA MINING STUDIES Lee, K.C. Hall, I. & Kwon, Y. (1996). "Hybrid Neural Network Models For Bankruptcy Predictions", Decision Support Systems, 18(1), 63-73.
- Kumar, N. K., R. & Rajagopalan, B. (1997), "Financial Decision Support With Hybrid Genetic And Neural Based Modeling Tools", European Journal of Operational Research, 103(2), 339-349.
- Nazem, S. & Shin, B. (1999), "Data Mining: New Arsenal For Strategic Decision Making", Journal of Database Management. 10(1), 39-42.
- Brachman, R.J. Khabaza, T. Kloesgen, W. Piatetsky-Shapiro, G. & Simoudis, E. (1996). "Mining Business Databases", Communications of the ACM, 39(11), 42-48.
- Perkowitz, M. (1999), "Towards Adaptive Web Sites: Conceptual Framework and Case Study", Computer Networks, 1245-1261.
- Ranjit, B., Sugumaran, V. (1999), "Application of Intelligent Agent Technology for Managerial Data Analysis and Mining", Database for Advances in Information Systems, (30:1), 77- 94. Spangler, W. E.; May, J. H., Vargas, L. G. (1999), "Choosing Data-Mining Methods For Multiple Classification: Representational And Performance Measurement Implications For Decision Support ",Journal of Management Information Systems, Summer, 37-62.
- Tam, K.Y. & Kiang, M.Y. (1992). "Managerial Applications Of Neural Networks: The Case Of Bank Failure Predictions", Decision Sciences, 38(7), 926-948.
- DATA MINING FROM A STATISTICAL PERSPECTIVE Glymour, C., Madigan D., et al (1996), "Statistical Inference and Data Mining". Communications of the ACM, (39:11), pp. 35-41.
- Hand, D. J. (1998), "Data Mining: Statistics and More?", The American Statistician, May (52:2), 112-118.
- Moore, A., Lee, M. S. (1998), "Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets", Journal of Artificial Intelligence Research, (8), 67.91.
- Wells, M. T. (1999), "Feature Extraction Construction and Selection: A Data Mining Perspective", Journal of the American Statistical Association, (94:448), p. 1390. Ye, Jianming (1998), "On Measuring and Correcting the Effects of Data Mining and Model Selection", Journal of the American Statistical Association, (93:441), 120-131
- Data Mining: A Conceptual Overview by J. Jackson DATA MINING FROM A DATABASE/DATA WAREHOUSING PERSPECTIVE Agrawal, R., Imielinski, T., Swami, A.(1993), "Database Mining: A Performance Perspective, IEEE Transactions Knowledge and Data Engineering, (5), 914-925.
- Chen, M.-S., Jan, J., Yu, P.S. (1996) "Data Mining: An Overview from a Database Perspective", IEEE Transactions on Knowledge and Data Engineering", (8:6), 866.883. Moeller, R. A. (2001), "Distributed Data Warehousing Using Web Technology", AMACOM, New York .
- Gray P., Watson, H.J. (1998a), "Professional Briefings...Present and Future Directions in
- Data Warehousing", Database for Advances in Information Systems, Summer, (29:3), 83-90.
- Gray, P., Watson, H.J. (1998b), Decision Support in the Data Warehouse, Upper Saddle River, N.J. Gray, P. (1997)"BOOKISMS: Mining for Data Warehousing Gems," Information Systems Management, Winter, 82-86.
- Moeller, R. A. (2001), Distributed Data Warehousing Using Web Technology, American Management Association (AMACOM) , New York.
- DATA MINING FROM A MACHINE LEARNING PERSPECTIVE Burges, C. J. (1998), "A Tutorial on Support Vector Machines for Pattern Recognition", Data Mining and Knowledge Discovery, (2:2), Kennedy, R. L. Lee, Y. Roy, B. V. Reed, C. D. & Lippman, R. P. (1997). Solving Data Mining Problems Through Pattern Recognition. New Jersey: Prentice Hall Professional Technical Reference.
- Langley, P., Simon, H. A. (1995), "Application of Machine Learning and Rule Induction", Communications of the ACM, (38:11), 55.64.
- Moore, A., Lee, M. S. (1998), "Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets", Journal of Artificial Intelligence Research, (8), 67.91.
- Witten, I. H. (2000), Data Mining : Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufman, San Francisco. DATA MINING FROM AN INTERNET/WEB-BASED PERSPECTIVE Edelstein, H., A. (2001), "Pan for Gold in the Clickstream", Informationweek.com, March 12,2001, 77-91
- Lee, C. (2001), "The GeneMine System for Genome/Preteome Annotation and Collaborative Data Mining", IBM Systems Journal, (40:2), 592-604.
- Moeller, R. A. (2001), Distributed Data Warehousing Using Web Technology, American Management Association (AMACOM) , New York.
- Perkowitz, M. (1999), "Towards Adaptive Web Sites: Conceptual Framework and Case Study", Computer Networks, 1245-1261.
- Srivastava, J., Cooley, R., Deshpande, M., Tan, P., " Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data", ACM SIGKDD (Special Interest Group on Knowledge Discovery and Data Mining) Explorations, January, (1:2).