Academia.eduAcademia.edu

Outline

Scaling entity resolution: A loosely schema-aware approach

2019, Information Systems

https://doi.org/10.1016/J.IS.2019.03.006

Abstract

In big data sources, real-world entities are typically represented with a variety of schemata and formats (e.g., relational records, JSON objects, etc.). Different profiles (i.e., representations) of an entity often contain redundant and/or inconsistent information. Thus identifying which profiles refer to the same entity is a fundamental task (called Entity Resolution) to unleash the value of big data. The naïve all-pairs comparison solution is impractical on large data, hence blocking methods are employed to partition a profile collection into (possibly overlapping) blocks and limit the comparisons to profiles that appear in the same block together. Meta-blocking is the task of restructuring a block collection, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features, under the assumption that handling the schema variety of big data does not pay-off for such a task. In this paper, we demonstrate how "loose" schema information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract the loose schema information by adopting an LSH-based step for efficiently handling volume and schema heterogeneity of the data. Furthermore, we introduce a novel meta-blocking algorithm that can be employed to efficiently execute Blast on MapReduce-like systems (such as Apache Spark). Finally, we experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art (meta-)blocking approaches.

References (61)

  1. P. Christen, A survey of indexing techniques for scalable record linkage and deduplication, IEEE transactions on knowledge and data engineering 24 (9) (2012) 1537-1555.
  2. X. L. Dong, D. Srivastava, Big data integration, Synthesis Lectures on Data Management 7 (1) (2015) 1-198.
  3. S. Bergamaschi, D. Beneventano, F. Mandreoli, R. Martoglia, F. Guerra, M. Orsini, L. Po, M. Vincini, G. Simonini, S. Zhu, et al., From data in- tegration to big data integration, in: A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, Springer, 2018, pp. 43-59.
  4. R. Baxter, P. Christen, T. Churches, et al., A comparison of fast blocking methods for record linkage, in: ACM SIGKDD, Vol. 3, Citeseer, 2003, pp. 25-27.
  5. P. Konda, S. Das, P. Suganthan GC, A. Doan, A. Ardalan, J. R. Bal- lard, H. Li, F. Panahi, H. Zhang, J. Naughton, et al., Magellan: Toward building entity matching management systems, Proceedings of the VLDB Endowment 9 (12) (2016) 1197-1208.
  6. J. Madhavan, S. R. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, A. Halevy, Web-scale data integration: You can only afford to pay as you go, in: Proceedings of CIDR, 2007, pp. 342-350.
  7. G. Papadakis, E. Ioannou, T. Palpanas, C. Niederee, W. Nejdl, A blocking framework for entity resolution in highly heterogeneous information spaces, IEEE Transactions on Knowledge and Data Engineering 25 (12) (2013) 2665-2682.
  8. G. Papadakis, G. Koutrika, T. Palpanas, W. Nejdl, Meta-blocking: Taking entity resolution to the next level, IEEE Transactions on Knowledge and Data Engineering 26 (8) (2014) 1946-1960.
  9. Y. Ma, T. Tran, Typimatch: Type-specific unsupervised learning of keys and key values for heterogeneous web data integration, in: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, 2013, pp. 325-334.
  10. C. E. Shannon, A mathematical theory of communication, SIGMOBILE Mob. Comput. Commun. Rev. 5 (1) (2001) 3-55. doi:10.1145/584091. 584093.
  11. G. Simonini, S. Bergamaschi, H. Jagadish, Blast: a loosely schema-aware meta-blocking approach for entity resolution, Proceedings of the VLDB Endowment 9 (12) (2016) 1173-1184.
  12. G. Papadakis, G. Papastefanatos, T. Palpanas, M. Koubarakis, Scaling en- tity resolution to large, heterogeneous data with enhanced meta-blocking., in: EDBT, 2016, pp. 221-232.
  13. V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis, T. Pal- panas, Parallel meta-blocking for scaling entity resolution over big hetero- geneous data, Information Systems 65 (2017) 137-157.
  14. P. Christen, Data Matching -Concepts and Techniques for Record Link- age, Entity Resolution, and Duplicate Detection, Data-Centric Systems and Applications, Springer, 2012. doi:10.1007/978-3-642-31164-2.
  15. V. Christophides, V. Efthymiou, K. Stefanidis, Entity resolution in the web of data, Synthesis Lectures on the Semantic Web 5 (3) (2015) 1-122.
  16. G. Simonini, G. Papadakis, T. Palpanas, S. Bergamaschi, Schema-agnostic progressive entity resolution, IEEE Trans. Knowl. Data Eng. (2018)doi: 10.1109/TKDE.2018.2852763.
  17. P. Shvaiko, J. Euzenat, Ontology matching: state of the art and future challenges, IEEE Transactions on knowledge and data engineering 25 (1) (2013) 158-176.
  18. T. Ranbaduge, D. Vatsalan, P. Christen, A scalable and efficient subgroup blocking scheme for multidatabase record linkage, in: Pacific-Asia Confer- ence on Knowledge Discovery and Data Mining, Springer, 2018, pp. 15-27.
  19. A. Z. Broder, On the resemblance and containment of documents, in: Com- pression and complexity of sequences 1997. proceedings, IEEE, 1997, pp. 21-29.
  20. J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of massive datasets, Cambridge university press, 2014.
  21. T. M. Cover, J. A. Thomas, Elements of information theory, John Wiley & Sons, 2012.
  22. T. De Vries, H. Ke, S. Chawla, P. Christen, Robust record linkage block- ing using suffix arrays, in: Proceedings of the 18th ACM conference on Information and knowledge management, ACM, 2009, pp. 305-314.
  23. A. Agresti, M. Kateri, Categorical data analysis, in: International encyclo- pedia of statistical science, Springer, 2011, pp. 206-208.
  24. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, I. Stoica, Resilient distributed datasets: A fault- tolerant abstraction for in-memory cluster computing, in: Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), USENIX, San Jose, CA, 2012, pp. 15-28.
  25. J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters, Communications of the ACM 51 (1) (2008) 107-113.
  26. URL https://spark.apache.org/docs/2.1.0/programming-guide. html#shuffle-operations
  27. A. S. Das, M. Datar, A. Garg, S. Rajaram, Google news personalization: scalable online collaborative filtering, in: Proceedings of the 16th interna- tional conference on World Wide Web, ACM, 2007, pp. 271-280.
  28. S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, Y. Tian, A com- parison of join algorithms for log processing in mapreduce, in: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, ACM, 2010, pp. 975-986.
  29. H. Köpcke, A. Thor, E. Rahm, Evaluation of entity resolution approaches on real-world match problems, Proceedings of the VLDB Endowment 3 (1- 2) (2010) 484-493.
  30. S. Das, A. Doan, P. S. G. C., C. Gokhale, P. Konda, The mag- ellan data repository, https://sites.google.com/site/anhaidgroup/ projects/data.
  31. A. Harth, Billion triples challenge data set (2012).
  32. D. Hand, P. Christen, A note on using the f-measure for evaluating record linkage algorithms, Statistics and Computing 28 (3) (2018) 539-547.
  33. M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, N. Tang, Dis- tributed representations of tuples for entity resolution, Proceedings of the VLDB Endowment 11 (11) (2018) 1454-1467.
  34. S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, Deep learning for entity matching: A design space exploration, in: Proceedings of the 2018 International Conference on Management of Data, ACM, 2018, pp. 19-34.
  35. O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, J. Widom, Swoosh: a generic approach to entity resolution, The VLDB Journal-The International Journal on Very Large Data Bases 18 (1) (2009) 255-276.
  36. H. Köpcke, E. Rahm, Frameworks for entity matching: A comparison, Data & Knowledge Engineering 69 (2) (2010) 197-210.
  37. F. Naumann, M. Herschel, An Introduction to Duplicate Detection, Synthe- sis Lectures on Data Management, Morgan & Claypool Publishers, 2010. doi:10.2200/S00262ED1V01Y201003DTM003.
  38. M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, S. Xu, Data curation at scale: The data tamer system, in: CIDR 2013, Sixth Biennial Conference on Innovative Data Systems Re- search, Asilomar, CA, USA, January 6-9, 2013, Online Proceedings, 2013.
  39. G. Papadakis, L. Tsekouras, E. Thanos, G. Giannakopoulos, T. Palpanas, M. Koubarakis, The return of jedai: End-to-end entity resolution for structured and semi-structured data, PVLDB 11 (12) (2018) 1950-1953. doi:10.14778/3229863.3236232.
  40. V. Efthymiou, G. Papadakis, K. Stefanidis, V. Christophides, Simplifying entity resolution on web data with schema-agnostic, non-iterative matching, in: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, 2018, pp. 1296-1299. URL https://doi.org/10.1109/ICDE.2018.00134
  41. A. D. Sarma, A. Jain, A. Machanavajjhala, P. Bohannon, CBLOCK: an automatic blocking mechanism for large-scale de-duplication tasks, CoRR abs/1111.3689. arXiv:1111.3689. URL http://arxiv.org/abs/1111.3689
  42. U. Draisbach, F. Naumann, A generalization of blocking and windowing algorithms for duplicate detection, in: 2011 International Conference on Data and Knowledge Engineering, ICDKE 2011, Milano, Italy, September 6, 2011, 2011, pp. 18-24. doi:10.1109/ICDKE.2011.6053920.
  43. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, Approximate string joins in a database (almost) for free, in: VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, 2001, pp. 491-500.
  44. A. McCallum, K. Nigam, L. H. Ungar, Efficient clustering of high- dimensional data sets with application to reference matching, in: Proceed- ings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, MA, USA, August 20-23, 2000, 2000, pp. 169-178. doi:10.1145/347090.347123.
  45. G. Simonini, G. Papadakis, T. Palpanas, S. Bergamaschi, Schema-agnostic progressive entity resolution, in: 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, 2018, pp. 53-64. doi:10.1109/ICDE.2018.00015.
  46. S. E. Whang, D. Marmaros, H. Garcia-Molina, Pay-as-you-go entity res- olution, IEEE Trans. Knowl. Data Eng. 25 (5) (2013) 1111-1124. doi: 10.1109/TKDE.2012.43.
  47. T. Papenbrock, A. Heise, F. Naumann, Progressive duplicate detection, IEEE Trans. Knowl. Data Eng. 27 (5) (2015) 1316-1329. doi:10.1109/ TKDE.2014.2359666.
  48. D. Firmani, B. Saha, D. Srivastava, Online entity resolution using an oracle, PVLDB 9 (5) (2016) 384-395. doi:10.14778/2876473.2876474. URL http://www.vldb.org/pvldb/vol9/p384-firmani.pdf
  49. D. Firmani, S. Galhotra, B. Saha, D. Srivastava, Robust entity resolution using a crowdoracle, IEEE Data Eng. Bull. 41 (2) (2018) 91-103. URL http://sites.computer.org/debull/A18june/p91.pdf
  50. G. Papadakis, G. Papastefanatos, G. Koutrika, Supervised meta-blocking, PVLDB 7 (14) (2014) 1929-1940. doi:10.14778/2733085.2733098.
  51. G. dal Bianco, M. A. Gonçalves, D. Duarte, Bloss: Effective meta-blocking with almost no effort, Information Systems 75 (2018) 75-89.
  52. A. N. Ngomo, S. Auer, LIMES -A time-efficient approach for large-scale link discovery on the web of data, in: IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011, 2011, pp. 2312-2317. doi:10.5591/ 978-1-57735-516-8/IJCAI11-385.
  53. P. Vandenbussche, B. Vatant, Linked open vocabularies, ERCIM News 2014 (96).
  54. S. Bergamaschi, D. Ferrari, F. Guerra, G. Simonini, Y. Velegrakis, Provid- ing insight into data source topics, J. Data Semantics 5 (4) (2016) 211-228. doi:10.1007/s13740-016-0063-6.
  55. L. Kolb, A. Thor, E. Rahm, Dedoop: Efficient deduplication with hadoop, PVLDB 5 (12) (2012) 1878-1881. doi:10.14778/2367502.2367527.
  56. S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, Y. Park, Falcon: Scaling up hands-off crowd- sourced entity matching to build cloud services, in: Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Con- ference 2017, Chicago, IL, USA, May 14-19, 2017, 2017, pp. 1431-1446. doi:10.1145/3035918.3035960.
  57. Y. Altowim, S. Mehrotra, Parallel progressive approach to entity resolution using mapreduce, in: 33rd IEEE International Conference on Data Engi- neering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, 2017, pp. 909-920. doi:10.1109/ICDE.2017.139.
  58. T. B. Araújo, C. E. S. Pires, T. P. da Nóbrega, Spark-based streamlined metablocking, in: Computers and Communications (ISCC), 2017 IEEE Symposium on, IEEE, 2017, pp. 844-850.
  59. F. Benedetti, D. Beneventano, S. Bergamaschi, G. Simonini, Computing inter-document similarity with context semantic analysis, Inf. Syst. 80 (2019) 136-147. doi:10.1016/j.is.2018.02.009.
  60. S. Bergamaschi, L. Gagliardelli, G. Simonini, S. Zhu, Bigbench workload executed by using apache flink, Procedia Manufacturing 11 (2017) 695-702.
  61. F. Guerra, G. Simonini, M. Vincini, Supporting image search with tag clouds: A preliminary approach, Adv. in MM 2015 (2015) 439020:1- 439020:10. doi:10.1155/2015/439020.