Performance Bounds for Pairwise Entity Resolution
2015, arXiv (Cornell University)
https://doi.org/10.48550/ARXIV.1509.03302Abstract
One significant challenge to scaling entity resolution algorithms to massive datasets is understanding how performance changes after moving beyond the realm of small, manually labeled reference datasets. Unlike traditional machine learning tasks, when an entity resolution algorithm performs well on small holdout datasets, there is no guarantee this performance holds on larger hold-out datasets. We prove simple bounding properties between the performance of a match function on a small validation set and the performance of a pairwise entity resolution algorithm on arbitrarily sized datasets. Thus, our approach enables optimization of pairwise entity resolution algorithms for large datasets, using a small set of labeled data.
References (18)
- Lise Getoor and Ashwin Machanavajjhala. Entity resolution: theory, practice & open challenges. Pro- ceedings of the VLDB Endowment, 5(12):2018-2019, 2012.
- David Menestrina, Steven Euijong Whang, and Hector Garcia-Molina. Evaluating entity resolution re- sults. Proceedings of the VLDB Endowment, 3(1-2):208-219, 2010.
- Georgios Papadakis. Blocking Techniques for efficient Entity Resolution over large, highly heterogeneous Information Spaces. PhD thesis, Leibniz Universität Hannover, 2013.
- Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. Swoosh: a generic approach to entity resolution. The VLDB Journal -The International Journal on Very Large Data Bases, 18(1):255-276, 2009.
- Mikhail Bilenko, S Basil, and Mehran Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In 5th IEEE International Conference on Data Mining. IEEE, 2005.
- Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), 55(5):23, 2008.
- Indrajit Bhattacharya and Lise Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007.
- Parag Singla and Pedro Domingos. Entity resolution with markov logic. In Sixth IEEE International Conference on Data Mining, pages 572-582. IEEE, 2006.
- Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11):1483-1494, 2012.
- Marthinus Christoffel Du Plessis and Masashi Sugiyama. Semi-supervised learning of class balance under class-prior change by distribution matching. Neural Networks, 50:110-119, 2014.
- Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21-41, 2002.
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations (ICRL), 2013.
- Peter F Brown, Peter V Desouza, Robert L Mercer, Vincent J Della Pietra, and Jenifer C Lai. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467-479, 1992.
- Sheila Tejada, Craig A Knoblock, and Steven Minton. Learning object identification rules for information integration. Information Systems, 26(8):607-633, 2001.
- Hanna Köpcke and Erhard Rahm. Training selection for tuning entity matching. In QDB/MUD, pages 3-12, 2008.
- Hanna Köpcke and Erhard Rahm. Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 69(2):197-210, 2010.
- Larry Greenemeier. Human Traffickers Caught on Hidden Internet. Scientific American, 2015.
- Edwin B Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(158):209-212, 1927.