Academia.eduAcademia.edu

Outline

Near Duplicate Detection In Relational Database

2013

Abstract

Near Duplicate detection is an important precess for many database based applications. Accurately identifying duplicate entities between multiple data sources is a big challenge to organizations and researchers. To detect the approximately duplicate records that refer to the same real-world entity is important to make the database more concrete and achieve higher data quality. In this process, ideally each record must be compared with every other record in dataset for finding duplicates. It is possible to reduce search space for record comparisons by using mutual exclusion property of tuples. In this research paper we analyze two types of blocking algorithms, namely, the adaptive sorted neighborhood method (ASNM), and iterative blocking and their combination with Jaro Winkler distance for string matching. Experimental evaluation on real dataset shows that, adaptive sorted neighborhood method along with Jaro Winkler distance algorithm outperforms in terms of precision and recall and requires very less number of comparisons than iterative blocking method. The experiments also highlight that, strings matching threshold gives optimal results if value is in range of 85% to 90%.

Key takeaways
sparkles

AI

  1. Adaptive Sorted Neighborhood Method with Jaro-Winkler outperforms iterative blocking in precision and recall.
  2. Optimal similarity threshold for restaurant dataset is 85%, while 90% generally yields best results across datasets.
  3. Blocking reduces search space, enhancing efficiency from O(n^2) to more manageable comparisons.
  4. Effective attribute weight allocation significantly influences duplicate detection outcomes.
  5. Nearly 1% to 5% duplicate records are common in integrated databases, impacting data quality.

References (8)

  1. Xiaochun Yang Bin Wang Guoren Wang Ge Yu Key "RSEARCH: Enhancing Keyword Search in Relational Databases Using Nearly Duplicate Records" ( 2010 IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering )
  2. Ahmed K. Elmagarmid, Senior Member, IEEE, Panagiotis G. Ipeirotis, Member, IEEE Computer Society, and Vassilios S. Verykios, Member, IEEE Computer Society Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S.Verykios. "Duplicate Record Detection: A Survey"( IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007 )
  3. Su Yan, Dongwon Lee, Min-Yen Kan, and Lee C. Giles. "Adaptive sorted neighborhood methods for efficient record linkage." In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, 2007.
  4. Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, and Hector Garcia-Molina. "Entity resolution with iterative blocking". In Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2009.
  5. A. E. Monge and C. Elkan, "An efficient domain- independent algorithm for detecting approximately S.E. Whang, D. Menestrina, G. Koutrika, M. Theobald, H. Garcia-Molina, Entity resolution with iterative blocking, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD '09), 2009, pp. 219-232
  6. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. "Adaptive name matching in information integration" IEEE Intelligent Systems, 18(5):16-23, Sep/Oct 2003. AUTHORS
  7. Bhagyashri A. Kelkar received the B.E. in Computer Sci. & Engg. from Walchand College of Engg., Sangli in 1995. She is currently pursuing M.E. degree in Comp. Science & Engg. Prof. K. B. Manwade received M.Tech. degree in computer science from Shivaji University, Kolhapur. He is working as an Head of Computer sceience Department in Ashokrao Mane Group of Institutes, Wathar.
  8. Prof. G. A. Patil received ME degree in computer sci. & engg. From Walchand College Sangli. He is working as head of department in D. Y. Patil College of engg. & tech., Kolhapur .