Academia.eduAcademia.edu

Outline

Meta-Blocking: Taking Entity Resolution to the Next Level

2000, IEEE Transactions on Knowledge and Data Engineering

https://doi.org/10.1109/TKDE.2013.54

Abstract

Entity Resolution is an inherently quadratic task that typically scales to large data collections through blocking. In the context of highly heterogeneous information spaces, blocking methods rely on redundancy in order to ensure high effectiveness at the cost of lower efficiency (i.e., more comparisons). This effect is partially ameliorated by coarse-grained block processing techniques that discard entire blocks either a-priori or during the resolution process. In this paper, we introduce meta-blocking as a generic procedure that intervenes between the creation and the processing of blocks, transforming an initial set of blocks into a new one with substantially fewer comparisons and equally high effectiveness. In essence, meta-blocking aims at extracting the most similar pairs of entities by leveraging the information that is encapsulated in the block-to-entity relationships. To this end, it first builds an abstract graph representation of the original set of blocks, with the nodes corresponding to entity profiles and the edges connecting the co-occurring ones. During the creation of this structure all redundant comparisons are discarded, while the superfluous ones can be removed by pruning of the edges with the lowest weight. We analytically examine both procedures, proposing a multitude of edge weighting schemes, graph pruning algorithms as well as pruning criteria. Our approaches are schema-agnostic, thus accommodating any type of blocks. We evaluate their performance through a thorough experimental study over three large-scale, real-world datasets, with the outcomes verifying significant efficiency enhancements at a negligible cost in effectiveness.

References (31)

  1. A. N. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In WIRI, pages 30-39, 2005.
  2. R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In SIGKDD, volume 3, pages 25-27, 2003.
  3. M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, pages 87-96, 2006.
  4. C. Bizer, T. Heath, T. Berners-Lee, and T. Berners-Lee. Linked data -the story so far. Int. J. Semantic Web Inf. Syst., 5(3):1-22, 2009.
  5. P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng., 24(9):1537-1555, 2012.
  6. W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, pages 73-78, 2003.
  7. T. de Vries, H. Ke, S. Chawla, and P. Christen. Robust record linkage blocking using suffix arrays. In CIKM, pages 1565-1568, 2009.
  8. A. Doan and A. Halevy. Semantic integration research in the database community: A brief survey. AI Magazine, 26(1):83-94, 2005.
  9. X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD, pages 85-96, 2005.
  10. U. Draisbach and F. Naumann. A comparison and generalization of blocking and windowing algorithms for duplicate detection. In Proceedings of the International Workshop on Quality in Databases (QDB), pages 51-56, 2009.
  11. A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1-16, 2007.
  12. I. Fellegi and A. Sunter. A theory for record linkage. Journal of the American Statistical Association, pages 1183-1210, 1969.
  13. L. Getoor and C. Diehl. Link mining: a survey. SIGKDD Expl., 7(2):3-12, 2005.
  14. L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.
  15. A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1-9, 2006.
  16. M. Hernández and S. Stolfo. The merge/purge problem for large databases. In SIGMOD, pages 127-138, 1995.
  17. H. Kim and D. Lee. HARRA: fast iterative hashed record linkage for large- scale data collections. In EDBT, pages 525-536, 2010.
  18. N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, pages 802-803, 2006.
  19. J. Madhavan, S. Cohen, X. L. Dong, A. Y. Halevy, S. R. Jeffery, D. Ko, and C. Yu. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342-350, 2007.
  20. W. Masek and M. Paterson. A faster algorithm computing string edit distances. Journal of Computer and System sciences, 20(1):18-31, 1980.
  21. A. McCallum, K. Nigam, and L. Ungar. Efficient clustering of high- dimensional data sets with application to reference matching. In KDD, pages 169-178, 2000.
  22. M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, pages 440-445, 2006.
  23. J. Nin, V. Muntés-Mulero, N. Martínez-Bazan, and J.-L. Larriba-Pey. On the use of semantic blocking techniques for data cleansing and integration. In IDEAS, pages 190-198, 2007.
  24. A. Ouksel and A. Sheth. Semantic interoperability in global information systems: A brief introduction to the research area and the special section. SIGMOD Record, pages 5-12, 1999.
  25. G. Papadakis, E. Ioannou, C. Niederée, and P. Fankhauser. Efficient entity resolution for large heterogeneous information spaces. In WSDM, pages 535- 544, 2011.
  26. G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas, and W. Nejdl. To compare or not to compare: making entity resolution more efficient. In SWIM Workshop, 2011.
  27. G. Papadakis, E. Ioannou, C. Niederée, T. Palpanas, and W. Nejdl. Beyond 100 million entities: Large-scale blocking-based resolution for heterogeneous data. In WSDM, pages 53-62, 2012.
  28. S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In KDD, pages 350-359, 2002.
  29. S. Whang, D. Marmaros, and H. Garcia-Molina. Pay-as-you-go entity resolu- tion. IEEE Trans. Knowl. Data Eng. (to appear), 2012.
  30. S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD, pages 219-232, 2009.
  31. S. Yan, D. Lee, M.-Y. Kan, and C. L. Giles. Adaptive sorted neighborhood methods for efficient record linkage. In JCDL, pages 185-194, 2007.