Near Duplicate Detection In Relational Database

Bhagyashri Kelkar

Outline

Near Duplicate Detection In Relational Database

Bhagyashri Kelkar

2013

Abstract

Near Duplicate detection is an important precess for many database based applications. Accurately identifying duplicate entities between multiple data sources is a big challenge to organizations and researchers. To detect the approximately duplicate records that refer to the same real-world entity is important to make the database more concrete and achieve higher data quality. In this process, ideally each record must be compared with every other record in dataset for finding duplicates. It is possible to reduce search space for record comparisons by using mutual exclusion property of tuples. In this research paper we analyze two types of blocking algorithms, namely, the adaptive sorted neighborhood method (ASNM), and iterative blocking and their combination with Jaro Winkler distance for string matching. Experimental evaluation on real dataset shows that, adaptive sorted neighborhood method along with Jaro Winkler distance algorithm outperforms in terms of precision and recall and requires very less number of comparisons than iterative blocking method. The experiments also highlight that, strings matching threshold gives optimal results if value is in range of 85% to 90%.

Key takeaways
AI

Adaptive Sorted Neighborhood Method with Jaro-Winkler outperforms iterative blocking in precision and recall.
Optimal similarity threshold for restaurant dataset is 85%, while 90% generally yields best results across datasets.
Blocking reduces search space, enhancing efficiency from O(n^2) to more manageable comparisons.
Effective attribute weight allocation significantly influences duplicate detection outcomes.
Nearly 1% to 5% duplicate records are common in integrated databases, impacting data quality.

Figures (9)

Figure 3. RECORD COMPARISONS REQUIRED FOR ITERATIVE BLOCKING(UB) AND ADAPTIVE SORTED NEIGHBORHOOD(ASNM) IN COMBINATION WITH JARO-WINKLER(JW) AND SOUNDEX(SND) SIMILARITY

TABLE I. RESULTS OF ADAPTIVE SORTED NEIGHBORHOOD METHOD WITH JARO- WINKLER SIMILARITY AND EQUAL FIELD WEIGHTS

Implementation of adaptive sorted neighborhood with Jaro- Winkler string similarity:

Figure 2. COMPARATIVE CHART OF ITERATIVE BLOCKING(IB) AND ADAPTIVE SORTED NEIGHBORHOOD(ASNM) IN COMBINATION WITH JARO-WINKLER(JW) AND SOUNDEX(SND) SIMILARITY

TABLE V. RESULTS OF ADAPTIVE SORTED NEIGHBORHOOD METHOD USING JARO- WINKLER SIMILARITY AND UNEQUAL FIELD WEIGHTS Dbgen Dataset:

TABLE III-RESULTS OF ITERATIVE BLOCKING USING JARO-WINKLER SIMILARITY AND UNEQUAL FIELD WEIGHTS Comparison of table II with table III indicate that, assigning unequal weights for forming similarity index results in improved precision, recall & F-measure and increase in number of record comparisons.

TABLE IV. RESULTS OF ADAPTIVE SORTED NEIGHBORHOOD METHOD WITH JARO- WINKLER SIMILARITY AND EQUAL FIELD WEIGHTS If highest weight-age is given to name field, following results are obtained.

TABLE II: RESULTS OF ITERATIVE BLOCKING WITH JARO-WINKLER SIMILARITY AND EQUAL FIELD WEIGHTS

References (8)

Xiaochun Yang Bin Wang Guoren Wang Ge Yu Key "RSEARCH: Enhancing Keyword Search in Relational Databases Using Nearly Duplicate Records" ( 2010 IEEE. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering )
Ahmed K. Elmagarmid, Senior Member, IEEE, Panagiotis G. Ipeirotis, Member, IEEE Computer Society, and Vassilios S. Verykios, Member, IEEE Computer Society Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S.Verykios. "Duplicate Record Detection: A Survey"( IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 1, JANUARY 2007 )
Su Yan, Dongwon Lee, Min-Yen Kan, and Lee C. Giles. "Adaptive sorted neighborhood methods for efficient record linkage." In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, 2007.
Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, and Hector Garcia-Molina. "Entity resolution with iterative blocking". In Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2009.
A. E. Monge and C. Elkan, "An efficient domain- independent algorithm for detecting approximately S.E. Whang, D. Menestrina, G. Koutrika, M. Theobald, H. Garcia-Molina, Entity resolution with iterative blocking, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD '09), 2009, pp. 219-232
Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. "Adaptive name matching in information integration" IEEE Intelligent Systems, 18(5):16-23, Sep/Oct 2003. AUTHORS
Bhagyashri A. Kelkar received the B.E. in Computer Sci. & Engg. from Walchand College of Engg., Sangli in 1995. She is currently pursuing M.E. degree in Comp. Science & Engg. Prof. K. B. Manwade received M.Tech. degree in computer science from Shivaji University, Kolhapur. He is working as an Head of Computer sceience Department in Ashokrao Mane Group of Institutes, Wathar.
Prof. G. A. Patil received ME degree in computer sci. & engg. From Walchand College Sangli. He is working as head of department in D. Y. Patil College of engg. & tech., Kolhapur .

Near Duplicate Detection In Relational Database

Sign up for access to the world's latest research

Abstract

Key takeawaysAI

Related papers

References (8)

Related papers

Key takeaways
AI