Generalized supervised meta-blocking
Proceedings of the VLDB Endowment
https://doi.org/10.14778/3538598.3538611Abstract
Entity Resolution is a core data integration task that relies on Blocking to scale to large datasets. Schema-agnostic blocking achieves very high recall, requires no domain knowledge and applies to data of any structuredness and schema heterogeneity. This comes at the cost of many irrelevant candidate pairs (i.e., comparisons), which can be significantly reduced by Meta-blocking techniques that leverage the entity co-occurrence patterns inside blocks: first, pairs of candidate entities are weighted in proportion to their matching likelihood, and then, pruning discards the pairs with the lowest scores. Supervised Meta-blocking goes beyond this approach by combining multiple scores per comparison into a feature vector that is fed to a binary classifier. By using probabilistic classifiers, Generalized Supervised Meta-blocking associates every pair of candidates with a score that can be used by any pruning algorithm. For higher effectiveness, new weighting schemes are examined as featur...
References (37)
- N. Augsten, R. Kwitt, M. Lissandrini, W. Mann, T. Palpanas, and G. Papadakis. 2021. New Weighting Schemes for Meta-blocking. Technical Report LIPADE-TR
- Laboratoire d'Informatique PAris DEscartes (LIPADE). Available at http: //lipade.mi.parisdescartes.fr/wp-content/uploads/2021/10/LipadeTR-5.pdf.
- Domenico Beneventano, Sonia Bergamaschi, Luca Gagliardelli, and Giovanni Simonini. 2020. BLAST2: An Ecient Technique for Loose Schema Information Extraction from Heterogeneous Big Data Sources. ACM J. Data Inf. Qual. 12, 4 (2020), 18:1-18:22.
- Guilherme Dal Bianco, Marcos André Gonçalves, and Denio Duarte. 2018. BLOSS: Eective meta-blocking with almost no eort. Inf. Syst. 75 (2018), 75-89.
- Peter Christen. 2012. Data Matching -Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
- Peter Christen. 2012. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. TKDE 24, 9 (2012), 1537-1555.
- Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2021. An Overview of End-to-End Entity Resolution for Big Data. ACM Comput. Surv. 53, 6 (2021), 127:1-127:42. https://doi.org/10.1145/ 3418896
- Vassilis Christophides, Vasilis Efthymiou, and Kostas Stefanidis. 2015. Entity Resolution in the Web of Data. Morgan & Claypool.
- Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, Pradap Konda, Yash Govind, and Derek Paulsen. [n.d.]. The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data.
- Dimas Cassimiro do Nascimento, Carlos Eduardo Santos Pires, and Demetrio Gomes Mestre. 2020. Exploiting block co-occurrence to control block sizes for entity resolution. Knowl. Inf. Syst. 62, 1 (2020), 359-400.
- Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Morgan & Claypool Publishers.
- Vasilis Efthymiou, George Papadakis, Kostas Stefanidis, and Vassilis Christophides. 2019. MinoanER: Schema-Agnostic, Non-Iterative, Mas- sively Parallel Resolution of Web Entities. In EDBT. 373-384.
- Luca Gagliardelli, George Papadakis, Giovanni Simonini, Sonia Bergamaschi, and Themis Palpanas. 2022. Generalized Supervised Meta-blocking (Extended Version). Technical Report. Available at http://arxiv.org/abs/2204.08801.
- Luca Gagliardelli, Giovanni Simonini, Domenico Beneventano, and Sonia Berga- maschi. 2019. SparkER: Scaling Entity Resolution in Spark. In EDBT. 602-605.
- Luca Gagliardelli, Giovanni Simonini, and Sonia Bergamaschi. 2020. RulER: Scal- ing Up Record-level Matching Rules. In EDBT. OpenProceedings.org, 611-614. https://doi.org/10.5441/002/edbt.2020.76
- Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. 2021. BEER: Blocking for Eective Entity Resolution. In SIGMOD. 2711-2715.
- Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. 2021. Ecient and eective ER with progressive blocking. VLDB J. 30, 4 (2021), 537- 557.
- Mark Hall, Eibe Frank, Georey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. 2009. The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11, 1 (2009), 10-18.
- Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. PVLDB 3, 1-2 (2010), 484-493.
- Daniel Obraczka, Jonathan Schuchart, and Erhard Rahm. 2021. EAGER: Embedding-Assisted Entity Resolution for Knowledge Graphs. arXiv preprint arXiv:2101.06126 (2021).
- George Papadakis, George Alexiou, George Papastefanatos, and Georgia Koutrika. 2015. Schema-agnostic vs Schema-based Congurations for Blocking Methods on Homogeneous Data. PVLDB 9, 4 (2015), 312-323.
- George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederée, and Wolfgang Nejdl. 2012. A blocking framework for entity resolution in highly heterogeneous information spaces. TKDE 25, 12 (2012), 2665-2682.
- George Papadakis, Ekaterini Ioannou, Emanouil Thanos, and Themis Palpanas. 2021. The Four Generations of Entity Resolution. Morgan & Claypool Publishers. https://doi.org/10.2200/S01067ED1V01Y202012DTM064
- George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. 2014. Meta-Blocking: Taking Entity Resolution to the Next Level. TKDE 26, 8 (2014), 1946-1960.
- George Papadakis, Georgios M. Mandilaras, Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Pal- panas, and Manolis Koubarakis. 2020. Three-dimensional Entity Resolution with JedAI. Inf. Syst. 93 (2020), 101565. https://doi.org/10.1016/j.is.2020.101565
- George Papadakis, George Papastefanatos, and Georgia Koutrika. 2014. Super- vised meta-blocking. PVLDB 7, 14 (2014), 1929-1940.
- George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis Koubarakis. 2016. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking.. In EDBT. 221-232.
- George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput. Surv. 53, 2 (2020), 31:1-31:42. https://doi.org/10.1145/3377455
- George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. 2016. Comparative Analysis of Approximate Blocking Techniques for Entity Resolu- tion. PVLDB 9, 9 (2016), 684-695.
- Thorsten Papenbrock, Arvid Heise, and Felix Naumann. 2015. Progressive Du- plicate Detection. IEEE Trans. Knowl. Data Eng. 27, 5 (2015), 1316-1329.
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 12 (2011), 2825-2830.
- Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution. PVLDB 9, 12 (2016), 1173-1184.
- Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, and H. V. Jagadish. 2019. Scaling entity resolution: A loosely schema-aware approach. Inf. Syst. 83 (2019), 145-165. https://doi.org/10.1016/j.is.2019.03.006
- Giovanni Simonini, George Papadakis, Themis Palpanas, and Sonia Bergamaschi. 2019. Schema-Agnostic Progressive Entity Resolution. IEEE Trans. Knowl. Data Eng. 31, 6 (2019), 1208-1221. https://doi.org/10.1109/TKDE.2018.2852763
- Giovanni Simonini, Luca Zecchini, Sonia Bergamaschi, and Felix Naumann. 2022. Entity Resolution On-Demand. PVLDB 15, 7 (2022), 1506-1518. https: //doi.org/10.14778/3523210.3523226
- Steven Euijong Whang, David Marmaros, and Hector Garcia-Molina. 2013. Pay- As-You-Go Entity Resolution. IEEE Trans. Knowl. Data Eng. 25, 5 (2013), 1111- 1124.
- Fulin Zhang, Zhipeng Gao, and Kun Niu. 2017. A pruning algorithm for meta- blocking based on cumulative weight. In Journal of Physics, Vol. 887.