Academia.eduAcademia.edu

Outline

Managing data quality by identifying the noisiest data samples

2012, Proceedings of 2012 IEEE International Conference on Service Operations and Logistics, and Informatics

https://doi.org/10.1109/SOLI.2012.6273510

Abstract

Enterprise datasets are often noisy. Several columns can have non-standard, erroneous or missing information. Poor quality data can lead to incorrect reporting and wrong conclu sions being drawn. Data cleansing involves standardizing such data to improve its quality. Often data cleansing tasks involve writing rules manually. The step involves understanding the data quality issues and then writing data transformation rules to correct these issues. This is a human intensive task. In this study we propose a method to identify noisy subsets of huge unlabelled textual datasets. This is a two step process where in the first step we develop an estimation tool to predict the data quality on an unlabelled text dataset as produced by a segmentation model.

References (7)

  1. M. N. Dani, T. A. Faruquie, R. Garg, G. Kothari, M. K. Mohania, K. H. Prasad, L. Y. Subramaniam, and Y. N. Swamy. A knowledge acquisition method for improving data quality in services engagements. In IEEE SCC, pages 346-353. IEEE Computer Society, 2010.
  2. T. A. Faruquie, K. H. Prasad, L. Y. Subramaniam, M. K. Mohania, G. Venkatachaliah, G. Kulkarni, and S. Basu. Data cleansing as a transient service. In Proc. of TCDE, 2010.
  3. S. Sarawagi. Efficient inference on sequence segmentation models. In Proc. of ICML, 2006.
  4. S. Sarawagi and W. W. Cohen. Semi-markov conditional random fields for information extraction. In Proc. of NIPS, 2004.
  5. A. 1. Smola and B. Scholkopf. A tutorial on support vector regression. Statistics and Computing, 2003. in press.
  6. L. Todorovski, P. Brazdil, and C. Soares. Report on the experiments with feature selection in meta-level learning. In Proc. of Workshop on Data Mining, Decision Support, Meta-Learning and TLP at PKDD, 2000.
  7. I. H. Witten, E. Frank, and M. A. Hall. In Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2011.