Academia.eduAcademia.edu

Outline

Enhancing Data Analysis with Noise Removal

2006, IEEE Transactions on Knowledge and Data Engineering

Abstract

Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the result of low-level data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amounts of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels.

References (42)

  1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD, 1993.
  2. F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proceedings of the Sixth European Conference on the Principles of Data Mining and Knowledge Discovery, 2002.
  3. Stephen D. Bay and Mark Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 29-38, New York, NY, USA, 2003. ACM Press.
  4. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jorg Sander. Lof:identifing density based local outliers. In Proc. of the 200 ACM SIGMOD International Conference on management of Data, 2000.
  5. Carla E. Brodley and Mark A. Friedl. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11:131-167, 1999.
  6. Michael B. Eisen, Paul T. Spellman, Patrick O. Browndagger, and David Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 95:25, 1998.
  7. Levent Ertöz, Michael Steinbach, and Vipin Kumar. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of Third SIAM International Conference on Data Mining, San Francisco, CA, USA, May 2003.
  8. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discoverying clusters in large spatial databases with noise. In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, 1996.
  9. A. Gavin et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415:141-147, 2002.
  10. Volker Gaede and Oliver Günther. Multidimensional access methods. ACM Computing Surveys, 30(2):170-231, 1998.
  11. H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: An extensible data cleaning tool. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000.
  12. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model and algorithms. In Proceedings of the 2001 Very Large Data Bases (VLDB) Conference, 2001.
  13. Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: An efficient clustering algorithm for large databases. In Laura M. Haas and Ashutosh Tiwary, editors, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA, pages 73-84. ACM Press, June 1998.
  14. Eui-Hong Han, Daniel Boley, Maria Gini, Robert Gross, Kyle Hastings, George Karypis, Vipin Kumar, B. Mobasher, and Jerry Moore. Webace: A web agent for document categorization and exploration. In Proc. of the 2nd International Conference on Autonomous Agents, 1998.
  15. M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 127-138, May 1995.
  16. M.A. Hernandez and S.J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowldge Discovery, 2:9-37, 1998.
  17. Victoria J. Hodge and Jim Austin. A survey of outlier detection methodologies. Artificial Intelligence Review, 22:85-126, 2004.
  18. Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall Advanced Reference Series. Prentice Hall, Englewood Cliffs, New Jersey, March 1988. Book available online at http://www.cse.msu.edu/∼jain/Clustering Jain Dubes.pdf.
  19. George Karypis. Cluto: Software for clustering high dimensional datasets. /www.cs.umn.edu/∼karypis.
  20. E. M. Knorr, R. T .Ng, and V. Tucakov. Distance-based outliers: Algorithms and applications. VLDB Journal: Very Large Databases, 8:237-253, 2000.
  21. Ron Kohavi and George H. John. Wrappers for feature subset Artificial Intelligence, 97(1-2):273-324, 1997.
  22. Bjornar Larsen and Chinatsu Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 16-22. ACM Press, 1999.
  23. M. L. Lee, T. W. Ling, and W. L. Low. Intelliclean: A knowledge-based intelligent data cleaner. In Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000.
  24. D. Lewis. Reuters-21578 text categorization text collection 1.0. In http://www.research.att.com/ lewis, 1997.
  25. Infoshare Limited. Best value guide to data standardization. InfoDB, July 1998, Available from http://www.infoshare.ltd.uk.
  26. A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proc. of the ACM-SIGMOD Workshop on Research Issues on Knowledge Discovery and Data Mining, 1997.
  27. K. Orr. Data quality and systems theory. CACM, 41:66-71, 1998.
  28. M. F. Porter. An algorithm for suffix stripping. In Program, 14(3), 1980.
  29. Leonid Portnoy, Eleazar Eskin, and Salvatore J. Stolfo. Intrusion detection with unlabeled data using clustering. In Proceedings of ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001), 2001.
  30. S. Ramaswamy, R. Rastogi, and S. Kyuseok. Efficient algorithms for mining outliers from large data sets. In Proceedings of ACM SIGMOD International Conference on Management of Data, 2000.
  31. T. Redman. The impact of poor data quality on the typical enterprise. CACM, 41:79-82, 1998.
  32. C. J. Van Rijsbergen. Information Retrieval (2nd Edition). Butterworths, London, 1979.
  33. J. Sander, M. Ester, H.-P. Kriegel, and X. Xu. Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery, 2(2):169-194, 1998.
  34. G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of International Conference on Very Large Databases, 1998.
  35. Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava. Selecting the right objective measure for association analysis. Inf. Syst., 29(4):293-313, 2004.
  36. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. to Data Mining. Pearson Addison-Wesley, 2005.
  37. TREC. Text retrieval conference. In http://trec.nist.gov.
  38. Hui Xiong, Pang-Ning Tan, and Vipin Kumar. Mining hyperclique patterns with confidence pruning. In Technical Report 03-006, Department of computer science, University of Minnesota -Twin Cities, January 2003.
  39. Hui Xiong, Pang-Ning Tan, and Vipin Kumar. Mining strong affinity association patterns in data sets with skewed support distribution. In Proceedings of the third IEEE International Conference on Data Mining, pages 387-394, 2003.
  40. Yiming Yang. Noise reduction in a statistical approach to text categorization. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, SIGIR, pages 256-263. ACM Press, 1995.
  41. Lan Yi, Bing Liu, and Xiaoli Li. Eliminating noisy information in web pages for data mining. In Lise Getoor, Ted E. Senator, Pedro Domingos, and Christos Faloutsos, editors, KDD, pages 296-305. ACM, 2003.
  42. Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pages 103-114. ACM Press, 1996. Hui Xiong is an assistant professor in the Management Science and Information Systems department at Rutgers, the State University of New Jersey. He received the B.E. degree in Automation from the University of Science and Technology of China, China, the M.S. degree in Computer Science from the National University of Singapore, Singapore, and the Ph.D. degree in Computer Science from the University of Minnesota, MN, USA. His research interests include data mining, statistical computing, Geographic Information Systems (GIS), Biomedical informatics, and information security. He has published over 20 technical papers in peer-reviewed journals and conference proceedings and is the co-editor of the book entitled "Clustering and Information Retrieval". He has also served on the program committees for a number of conferences and workshops. Dr. Xiong is a member of the IEEE Computer Society and the ACM.