A Role based Address Cleaner
2010, International Journal ofInformation Sciences and Computer Engineering
Abstract
About 80% to 90% of governmental and business data collections contain address information. In many cases, address records are captured and/or stored in a free-form or inconsistent manner. There are many causes to dirty data: misuse of abbreviations, data entry mistakes, control information hiding, missing fields, spelling, outdated codes etc. Due to the 'garbage in, garbage out' principle, dirty data will distort information obtained from it. The purpose of address cleaning is to maximize the value of address data and ensure that every address is spelt correctly and properly structured. This improves accuracy and standardization in mailing, boosts company image , reduces mailing costs, and through geocoding opens up a number of opportunities to support strategic decisions through accurate spatial analysis. We report the implementation of a role based address cleaner. In this address cleaner we define indicators and grammars for address cleaning and thus make the cleaning process configurable.
References (29)
- J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publisher, 2001.
- E. Rahman and H. H. Do, "Data cleaning: Problems and current approaches," IEEE Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 4, pp. 3247-3259, 2000.
- V. Raman and J. M. Hellerstein, "Potter's wheel: An interactive data cleaning system," in Proceedings of the 27th International Conference on Very Large Bases, pp. 381-390, 2001.
- J. I. Maletic and A. Marcus, "Data cleansing: Beyond integrity analysis," in Proceedings of the Conference on Information Qual- ity, pp. 200-209, MIT, 2000.
- H. Galhardas, D. Florescu, D. Shasha, and E. Simon, "Ajax: An ex- tensible data cleaning tool," in Proceedings of the ACM SIGMOD on Management of data, p. 590, 2000.
- P. Christen, T. Churches, and J. X. Zhu, "Probabilistic name and address cleaning and standardisation," Proceedings of the Aus- tralasian Data Mining Workshop, December 2002.
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, "Improving data cleaning quality using a data lineage facility," in Proceedings of the 3rd International Workshop on Design and Management of Data Warehouses, p. 3, 2001.
- P. Christen, T. Churches, and A. Willmore, "A probabilistic geocoding system based on a national address file," in In Proc. 3rd Australasian Data Mining Conf., 2004.
- M. L. Lee, H. Lu, T. W. Ling, and Y. T. Ko, "Cleansing data for mining and warehousing," in Proceedings of the 10th Interna- tional Conference on Database and Expert Systems Applications, pp. 751-760, 1999.
- J. Han and M. Kamber, Problems, Methods, and Challenges in Comprehensive Data Cleansing. Humboldt-Universitat zu Berlin zu Berlin, 2003.
- P. Christen and D. Belacic, "Automated probabilistic address stan- dardization and verification," in Australasian Data Mining Confer- ence, pp. 53-67, 2005.
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita, "Declarative data cleaning: Language, model, and algorithms," in Proceedings of the 27th VLDB Conference, pp. 371-380, 2001.
- K. Sattler, S. Conrad, and G. Saake, "Adding conflict resolution features to a query language for database federations," Australian Journal of Information Systems, vol. 8, no. 1, pp. 116-125, 2000.
- K.-U. Sattler and E. Schallehn, "A data preparation framework based on a multidatabase language," 2001.
- P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis, and T. Sellis, "Arktos: towards the modeling, design, control and exe- cution of etl processes," Information Systems, vol. 26, pp. 537-561, 2001.
- M. L. Lee, T. W. Ling, and W. L. Low, "Intelliclean: A knowledge-based intelligent data cleaner," in Proceedings of the ACM SIGKDD, pp. 290-294, 2000.
- W. L. Low, M. L. Lee, and T. W. Ling, "A knowledge-based ap- proach for duplicate elimination in data cleaning," Information Systems, vol. 26, pp. 585-606, 2001.
- A. Sleit, M. Al-Akhras, I. Juma, and M. Alian, "Applying ordinal association rules for cleansing data with missing values," Journal of American Science, vol. 5, no. 3, pp. 52-62, 2009.
- J. J. Tamilselvi and V. Saravanan, "Detection and elimination of duplicate data using token-based method for a data warehouse: A clustering based approach," International Journal of Dynamics of Fluids, vol. 5, no. 2, pp. 145-164, 2009.
- C. Gardent, B. Guillaume, G. Perrier, and I. Falk, "Maurice gross' grammar lexicon and natural language processing," in Language and Technology Conference, pp. 120-123, 2005.
- D. Chiang, Evaluating grammar formalisms for applications to natural language processing and biological sequence analysis. PhD thesis, University of Pennsylvania, 2004.
- R. Hwa, Learning Probabilistic Lexicalized Grammars for Natural Language Processing. PhD thesis, Harvard University, 2000.
- A. Arasu, S. Chaudhuri, and R. Kaushik, "Learning string trans- formations from examples," tech. rep., 2009.
- "Cs 164: Programming languages and compilers (class notes #2: Lexical)." http://www.cs.berkeley.edu/ hilfingr/cs164/public html/ lectures/note2.pdf.
- M. A. Bashar received the B.Sc. (Eng) degree in Computer Science and Engineering from Shah Jalal University of Science and Technology, Sylhet, Bangladesh, in 2008. He worked as Software Engi- neer in SDSL (It is the local development branch of AfriGIS, South Africa (http://www.afrigis.co.za))
- from August, 2008 to April, 2010. He has recently joined to the Department of Computer Science and Engineering at Comilla University as a Lecturer. His research interests are robotics, data analysis and pattern recognition, data mining, artificial intelligence, information retrieval, web intelligence, e-health, searching, web map, location based service, ubiquitous advertisement, and quan- tum computation.
- M. Mashiur Rahman received the B.Sc. (Eng) degree in Computer Science and Engineering from Shah Jalal University of Science and Technology, Sylhet, Bangladesh. He worked as Software Engineer in SDSL (It is the local development branch of AfriGIS, South Africa (http://www.afrigis.co.za)). He has recently joined to Asenic (a software firm). His research interests are data analysis and pattern recognition, data mining, information retrieval, and ubiquitous advertisement. Abdullah Al Rahed received the B.Sc. (Eng) degree in Computer Sci- ence and Engineering from Shah Jalal University of Science and Tech- nology, Sylhet, Bangladesh. He is working as Software Engineer in SDSL (It is the local development branch of AfriGIS, South Africa (http://www.afrigis.co.za)). His research interests are data analysis and pattern recognition, data mining, information retrieval, and ubiquitous advertisement.
- M. A. Chowdhury received the B.Sc. (Eng.) degree in Computer Science and Engineering from Shah Jalal University of Science and Technology, Sylhet, Bangladesh, in 2008. He worked as Software Engi- neer in SDSL (It is the local development branch of AfriGIS, South Africa (http://www.afrigis.co.za)). He has recently joined as software developer to Therap BD Ltd. (It is the technical brunch of Therap Services LLC, USA (http://www.therapservices.net)). His research in- terests are robotics, data analysis and pattern recognition, data mining, artificial intelligence, information retrieval, web intelligence, e-health, searching, web map, location based service, ubiquitous advertisement, and quantum computation.
- M. P. Sajjad received the B.Sc. degree in Computer Science and Engi- neering from Shahjalal University of Science and Technology, Sylhet, Bangladesh, in 2007. Currently he is working as Software Engineer in SDSL (It is the local development branch of AfriGIS, South Africa (http://www.afrigis.co.za)). His research interests include information retrieval, data analysis, pattern recognition, building and managing dis- tributed system.