Abstract
Entity Resolution (ER) lies at the core of data integration, with a bulk of research focusing on its effectiveness and its time efficiency. Most past relevant works were crafted for addressing Veracity over structured (relational) data. They typically rely on schema, expert and external knowledge to maximize accuracy. Part of these methods have been recently extended to process large volumes of data through massive parallelization techniques, such as the MapReduce paradigm. With the present advent of Big Web Data, the scope moved towards Variety, aiming to handle semi-structured data collections, with noisy and highly heterogeneous information. Relevant works adopt a novel, loosely schema-aware functionality that emphasizes scalability and robustness to noise. Another line of present research focuses on Velocity, i.e., processing data collections of a continuously increasing volume. In this tutorial, we present the ER generations by discussing past, present, and yet-to-come mechanis...
References (78)
- Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha U. Nabar, Tomoe Sugihara, and Jennifer Widom. 2006. Trio: A System for Data, Uncertainty, and Lineage. In VLDB. 1151-1154.
- Hotham Altwaijry, Dmitri Kalashnikov, and Sharad Mehrotra. 2013. Query- Driven Approach to Entity Resolution. PVLDB 6, 14 (2013), 1846-1857.
- Hotham Altwaijry, Dmitri Kalashnikov, and Sharad Mehrotra. 2017. QDA: A Query-Driven Approach to Entity Resolution. TKDE (2017).
- Hotham Altwaijry, Sharad Mehrotra, and Dmitri V. Kalashnikov. 2015. QuERy: A Framework for Integrating Entity Resolution with Query Processing. PVLDB 9, 3 (2015), 120-131.
- Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Eui- jong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. VLDB J. 18, 1 (2009), 255-276.
- Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic Schema Matching, Ten Years Later. PVLDB 4, 11 (2011), 695-701.
- Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in relational data. TKDD 1, 1 (2007), 5.
- Guilherme Dal Bianco, Marcos André Gonçalves, and Denio Duarte. 2018. BLOSS: Effective meta-blocking with almost no effort. Inf. Syst. 75 (2018), 75-89.
- Christoph Böhm, Gerard de Melo, Felix Naumann, and Gerhard Weikum. 2012. LINDA: distributed web-of-data-scale entity matching. In CIKM. 2104-2108.
- Chengliang Chai, Guoliang Li, Jian Li, Dong Deng, and Jianhua Feng. 2018. A partial-order-based framework for cost-effective crowdsourced entity resolu- tion. VLDB J. 27, 6 (2018), 745-770.
- Peter Christen. 2012. Data Matching. Springer.
- Vassilis Christophides, Vasilis Efthymiou, and Kostas Stefanidis. 2015. Entity Resolution in the Web of Data. Morgan & Claypool Publishers.
- Sanjib Das, Paul Suganthan G. C., AnHai Doan, Jeffrey F. Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. 2017. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services. In SIGMOD. 1431-1446.
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data pro- cessing on large clusters. Commun. ACM 51, 1 (2008), 107-113.
- Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. 2013. Large-scale linked data integration using probabilistic reasoning and crowdsourcing. VLDB J. 22, 5 (2013), 665-687.
- Xin Dong, Alon Y. Halevy, and Jayant Madhavan. 2005. Reference Reconcilia- tion in Complex Information Spaces. In SIGMOD. 85-96.
- Xin Luna Dong and Divesh Srivastava. 2013. Big Data Integration. PVLDB 6, 11 (2013), 1188-1189.
- Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Morgan & Claypool Publishers.
- Songyun Duan, Achille Fokoue, Oktie Hassanzadeh, Anastasios Kementsiet- sidis, Kavitha Srinivas, and Michael J. Ward. 2012. Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing. In ISWC. 49-64.
- Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. PVLDB 11, 11 (2018), 1454-1467.
- Vasilis Efthymiou, George Papadakis, George Papastefanatos, Kostas Stefani- dis, and Themis Palpanas. 2017. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Inf. Syst. (2017).
- Vasilis Efthymiou, George Papadakis, Kostas Stefanidis, and Vassilis Christophides. 2019. MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities. In EDBT. 373-384.
- Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. TKDE 19, 1 (2007), 1-16.
- Donatella Firmani, Sainyam Galhotra, Barna Saha, and Divesh Srivastava. 2018. Robust Entity Resolution Using a CrowdOracle. IEEE Data Eng. Bull. 41, 2 (2018), 91-103.
- Donatella Firmani, Barna Saha, and Divesh Srivastava. 2016. Online Entity Resolution Using an Oracle. PVLDB 9, 5 (2016), 384-395.
- Jeffrey Fisher, Peter Christen, Qing Wang, and Erhard Rahm. 2015. A Clustering-Based Framework to Control Block Sizes for Entity Resolution. In KDD. 279-288.
- Avigdor Gal. 2014. Tutorial: Uncertain Entity Resolution. PVLDB 7, 13 (2014), 1711-1712.
- Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. 2018. Robust Entity Resolution using Random Graphs. In SIGMOD. 3-18.
- Lise Getoor and Ashwin Machanavajjhala. 2012. Entity Resolution: Theory, Practice & Open Challenges. PVLDB 5, 12 (2012), 2018-2019.
- Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowd- sourcing for entity matching. In SIGMOD. 601-612.
- Behzad Golshan, Alon Halevy, George Mihaila, and Wang-Chiew Tan. 2017. Data Integration: After the Teenage Years. In PODS. 101-106.
- Yash Govind, Erik Paulson, Palaniappan Nagarajan, Paul Suganthan G. C., AnHai Doan, Youngchoon Park, Glenn Fung, Devin Conathan, Marshall Carter, and Mingju Sun. 2018. CloudMatcher: A Hands-Off Cloud/Crowd Service for Entity Matching. PVLDB 11, 12 (2018), 2042-2045.
- Anja Gruenheid, Xin Luna Dong, and Divesh Srivastava. 2014. Incremental Record Linkage. PVLDB 7, 9 (2014), 697-708.
- Oktie Hassanzadeh, Fei Chiang, Renée J. Miller, and Hyun Chul Lee. 2009. Framework for Evaluating Clustering Algorithms in Duplicate Detection. PVLDB 2, 1 (2009), 1282-1293.
- Ekaterini Ioannou and Minos Garofalakis. 2015. Query Analytics over Proba- bilistic Databases with Unmerged Duplicates. TKDE 27, 8 (2015), 2245-2260.
- Ekaterini Ioannou, Wolfgang Nejdl, Claudia Niederée, and Yannis Velegrakis. 2010. On-the-Fly Entity-Aware Query Processing in the Presence of Linkage. PVLDB 3, 1 (2010), 429-438.
- Mayank Kejriwal and Daniel P. Miranker. 2013. An Unsupervised Algorithm for Learning Blocking Schemes. In ICDM. 340-349.
- Lars Kolb, Andreas Thor, and Erhard Rahm. 2012. Dedoop: Efficient Dedupli- cation with Hadoop. PVLDB 5, 12 (2012), 1878-1881.
- Lars Kolb, Andreas Thor, and Erhard Rahm. 2012. Load Balancing for MapReduce-based Entity Resolution. In ICDE. 618-629.
- Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward Building Entity Matching Management Systems. PVLDB 9, 12 (2016), 1197-1208.
- Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. PVLDB 3, 1 (2010), 484-493.
- Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, and Zoubin Ghahramani. 2013. SIGMa: simple greedy matching for aligning large knowledge bases. In KDD. 572-580.
- Guoliang Li, Yudian Zheng, Ju Fan, Jiannan Wang, and Reynold Cheng. 2017. Crowdsourced Data Management: Overview and Challenges. In SIGMOD. 1711-1716.
- Juanzi Li, Jie Tang, Yi Li, and Qiong Luo. 2009. RiMOM: A Dynamic Multi- strategy Ontology Alignment Framework. TKDE 21, 8 (2009), 1218-1232.
- Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. 2001. Generic Schema Matching with Cupid. In VLDB. 49-58.
- Ruhaila Maskat, Norman W. Paton, and Suzanne M. Embury. 2016. Pay-as-you- go Configuration of Entity Resolution. T. Large-Scale Data-and Knowledge- Centered Systems (2016), 40-65.
- Davide Mottin, Matteo Lissandrini, Yannis Velegrakis, and Themis Palpanas. 2016. Exemplar queries: a new way of searching. VLDB J. 25, 6 (2016), 741-765.
- Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD. 19-34.
- Kevin O'Hare, Anna Jurek, and Cassio de Campos. 2018. A new technique of selecting an optimal blocking method for better record linkage. Inf. Syst. 77 (2018), 151-166.
- George Papadakis, George Alexiou, George Papastefanatos, and Georgia Koutrika. 2015. Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data. PVLDB 9, 4 (2015), 312-323.
- George Papadakis, Ekaterini Ioannou, Claudia Niederée, Themis Palpanas, and Wolfgang Nejdl. 2012. Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In WSDM. 53-62.
- George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederée, and Wolfgang Nejdl. 2013. A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. TKDE 25, 12 (2013), 2665-2682.
- George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. 2014. Meta-Blocking: Taking Entity Resolutionto the Next Level. TKDE 26, 8 (2014), 1946-1960.
- George Papadakis and Wolfgang Nejdl. 2011. Efficient entity resolution meth- ods for heterogeneous information spaces. In ICDE Workshops. 304-307.
- George Papadakis, George Papastefanatos, and Georgia Koutrika. 2014. Su- pervised Meta-blocking. PVLDB 7, 14 (2014), 1929-1940.
- George Papadakis, Jonathan Svirsky, Avigdor Gal, and Themis Palpanas. 2016. Comparative Analysis of Approximate Blocking Techniques for Entity Reso- lution. PVLDB 9, 9 (2016), 684-695.
- George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Gian- nakopoulos, Themis Palpanas, and Manolis Koubarakis. 2018. The return of JedAI: End-to-End Entity Resolution for Structured and Semi-Structured Data. PVLDB 11, 12 (2018), 1950-1953.
- Thorsten Papenbrock, Arvid Heise, and Felix Naumann. 2015. Progressive Duplicate Detection. TKDE 27, 5 (2015), 1316-1329.
- Vibhor Rastogi, Nilesh N. Dalvi, and Minos N. Garofalakis. 2011. Large-Scale Collective Entity Matching. PVLDB 4, 4 (2011), 208-218.
- Orion Fausto Reyes-Galaviz, Witold Pedrycz, Ziyue He, and Nick J. Pizzi. 2017. A supervised gradient-based learning algorithm for optimized entity resolution. DKE (2017).
- Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, and Philip Bohannon. 2012. An automatic blocking mechanism for large-scale de-duplication tasks. In CIKM. 1055-1064.
- Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution. PVLDB 9, 12 (2016), 1173-1184.
- Giovanni Simonini, George Papadakis, Themis Palpanas, and Sonia Berga- maschi. 2019. Schema-Agnostic Progressive Entity Resolution. IEEE Trans. Knowl. Data Eng. 31, 6, 1208-1221.
- Yannis Sismanis, Ling Wang, Ariel Fuxman, Peter J. Haas, and Berthold Rein- wald. 2009. Resolution-Aware Query Answering for Business Intelligence. In ICDE. 976-987.
- Kostas Stefanidis, Vasilis Efthymiou, Melanie Herschel, and Vassilis Christophides. 2014. Entity resolution in the web of data. In WWW.
- Fabian M. Suchanek, Serge Abiteboul, and Pierre Senellart. 2011. PARIS: Probabilistic Alignment of Relations, Instances, and Schema. PVLDB 5, 3 (2011), 157-168.
- Vasilis Verroios and Hector Garcia-Molina. 2015. Entity Resolution with crowd errors. In ICDE. 219-230.
- Vasilis Verroios, Hector Garcia-Molina, and Yannis Papakonstantinou. 2017. Waldo: An Adaptive Human Interface for Crowd Entity Resolution. In SIGMOD. 1133-1148.
- Norases Vesdapunt, Kedar Bellare, and Nilesh N. Dalvi. 2014. Crowdsourcing Algorithms for Entity Resolution. PVLDB 7, 12 (2014), 1071-1082.
- Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. Crow- dER: Crowdsourcing Entity Resolution. PVLDB 5, 11 (2012), 1483-1494.
- Jiannan Wang, Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, Tim Kraska, and Tova Milo. 2014. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD. 469-480.
- Jiannan Wang, Guoliang Li, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2013. Leveraging transitive relations for crowdsourced joins. In SIGMOD. 229-240.
- Sibo Wang, Xiaokui Xiao, and Chun-Hee Lee. 2015. Crowd-Based Deduplica- tion: An Adaptive Approach. In SIGMOD. 1263-1277.
- Steven Euijong Whang and Hector Garcia-Molina. 2014. Incremental entity resolution on rules and data. VLDB J. 23, 1 (2014), 77-102.
- Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013. Ques- tion Selection for Crowd Entity Resolution. PVLDB 6, 6 (2013), 349-360.
- Steven Euijong Whang, David Marmaros, and Hector Garcia-Molina. 2013. Pay-As-You-Go Entity Resolution. TKDE 25, 5 (2013), 1111-1124.
- Wei Yan, Yuan Xue, and Bradley Malin. 2013. Scalable load balancing for mapreduce-based record linkage. In IPCCC. 1-10.
- Chen Jason Zhang, Rui Meng, Lei Chen, and Feida Zhu. 2015. CrowdLink: An Error-Tolerant Model for Linking Complex Records. In ExploreDB. 15-20.