End-to-End Entity Resolution for Big Data: A Survey
2019, ArXiv
Abstract
One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with...
References (200)
- KB1:Manha)an rdf:type KB1:Loca1on rdfs:label "Manha9an" foaf:name "Manha9an" KB2:SKubrick foaf:name "Stanley Kubrick" KB2:place_of_birth KB2:MNHT rdf:type foaf:Person KB2:ac1veYearsEndYear 7/3/1999 KB2:directed KB2:A_Clockwork_Orange KB2:MNHT KB2:name "Manha9an" rdf:type KB2:loca1on (b) KB1:Kubrick KB1:name "Stanley Kubrick" KB1:bornIn 1928 KB1:father KB1:Jacques Leonard Kubrick KB1:deathPlace KB1:StAlbans_United_Kingdom rdf:type yago:AmericanFilmDirectors KB1:Stanley_Kubrick KB1:birthPlace KB1:Manha9an KB1:bornIn 1928--7--26 KB1:parents KB1:Gertrude Kubrick KB1:parents KB1:Jacques Leonard Kubrick rdf:type yago:AmericanFilmDirectors KB1:SKBRK KB1:name "S. Kubrick" KB1:birthPlace KB1:Manha9an KB1:deathPlace KB1:UnitedKingdom KB1:diedIn 1999 (a) KB1:SKBRK KB1:name "S. Kubrick" KB1:birthPlace KB1:Manha9an KB1:deathPlace KB1:UnitedKingdom KB1:diedIn 1999 References
- Aizawa, A.N., Oyama, K.: A fast linkage detection scheme for multi-source information integration. In: WIRI, pp. 30-39 (2005)
- Altowim, Y., Kalashnikov, D.V., Mehrotra, S.: Progres- sive approach to relational entity resolution. PVLDB 7(11), 999-1010 (2014)
- Altowim, Y., Mehrotra, S.: Parallel progressive ap- proach to entity resolution using mapreduce. In: ICDE, pp. 909-920 (2017)
- Altwaijry, H., Kalashnikov, D.V., Mehrotra, S.: QDA: A query-driven approach to entity resolution. TKDE 29(2), 402-417 (2017)
- Altwaijry, H., Mehrotra, S., Kalashnikov, D.V.: Query: A framework for integrating entity resolution with query processing. PVLDB 9(3), 120-131 (2015)
- Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Elimi- nating fuzzy duplicates in data warehouses. In: VLDB, pp. 586-597 (2002)
- Araújo, S., Tran, D.T., de Vries, A.P., Schwabe, D.: SERIMI: class-based matching for instance matching across heterogeneous datasets. TKDE 27(5), 1397-1410 (2015)
- Aslam, J.A., Pelekhov, E., Rus, D.: The star clustering algorithm for static and dynamic information organiza- tion. J. Graph Algorithms Appl. 8, 95-129 (2004)
- Baltrusaitis, T., Ahuja, C., Morency, L.P.: Challenges and applications in multimodal machine learning. In: The Handbook of Multimodal-Multisensor Interfaces, pp. 17-48. ACM and Morgan & Claypool (2019)
- Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Machine Learning 56(1-3), 89-113 (2004)
- Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. Journal of Machine Learning Research 3, 1137-1155 (2003)
- Benjelloun, O., Garcia-Molina, H., Gong, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D- swoosh: A family of algorithms for generic, distributed entity resolution. In: ICDCS, p. 37 (2007)
- Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic ap- proach to entity resolution. VLDB J. 18(1), 255-276 (2009)
- Bharadwaj, S., Chiticariu, L., et al.: Creation and inter- action with large-scale domain-specific knowledge bases. PVLDB 10(12), 1965-1968 (2017)
- Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1), 5 (2007)
- Bhattacharya, I., Getoor, L.: Query-time entity resolu- tion. J. Artif. Intell. Res. 30, 621-657 (2007)
- Bianco, G.D., Gonçalves, M.A., Duarte, D.: BLOSS: ef- fective meta-blocking with almost no effort. Inf. Syst. 75, 75-89 (2018)
- Bilenko, M., Kamath, B., Mooney, R.J.: Adaptive block- ing: Learning to scale up record linkage. In: ICDM (2006)
- Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detec- tion using Learnable String Similarity Measures. In: SIGKDD, pp. 39-48 (2003)
- Böhm, C., de Melo, G., Naumann, F., Weikum, G.: LINDA: distributed web-of-data-scale entity matching. In: CIKM, pp. 2104-2108 (2012)
- Bordes, A., Gabrilovich, E.: Constructing and mining web-scale knowledge graphs: Kdd 2014 tutorial. In: KDD, p. 1967 (2014)
- Cao, Y., Chen, Z., Zhu, J., Yue, P., Lin, C., Yu, Y.: Leveraging unlabeled data to scale blocking for record linkage. In: IJCAI, pp. 2211-2217 (2011)
- Chen, X.: Crowdsourcing entity resolution: a short overview and open issues. In: GvDB, pp. 72-77 (2015)
- Chen, X., Schallehn, E., Saake, G.: Cloud-scale entity resolution: Current state and open challenges. OJBD 4(1), 30-51 (2018)
- Chiang, Y., Doan, A., Naughton, J.F.: Modeling entity evolution for temporal record matching. In: SIGMOD, pp. 1175-1186 (2014)
- Chiang, Y., Doan, A., Naughton, J.F.: Tracking entities in the dynamic world: A fast algorithm for matching temporal records. PVLDB 7(6), 469-480 (2014)
- Cho, K., van Merrienboer, B., Gülçehre, C ¸., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)
- Christen, P.: Febrl -: an open source data cleaning, dedu- plication and record linkage system with a graphical user interface. In: KDD, pp. 1065-1068 (2008)
- Christen, P.: Data Matching. Springer (2012)
- Christen, P.: A survey of indexing techniques for scal- able record linkage and deduplication. TKDE 24(9), 1537-1555 (2012)
- Christen, P., Gayler, R.W., Hawking, D.: Similarity- aware indexing for real-time entity resolution. In: CIKM, pp. 1565-1568 (2009)
- Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in the Web of Data. Morgan & Claypool (2015)
- Chu, X., Ilyas, I.F., Koutris, P.: Distributed data dedu- plication. PVLDB 9(11), 864-875 (2016)
- Chung, Y., Kraska, T., Polyzotis, N., Tae, K., Whang, S.E.: Slice finder: Automated data slicing for model val- idation. In: ICDE (2019)
- Cohen, W.W., Richman, J.: Learning to match and clus- ter large high-dimensional data sets for data integration. In: SIGKDD, pp. 475-480 (2002)
- Dalvi, N., Machanavajjhala, A., Pang, B.: An analysis of structured data on the web. PVLDB 5(7), 680-691 (2012)
- Das, S., C., P.S.G., Doan, A., Naughton, J.F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y.: Falcon: Scaling up hands-off crowdsourced entity match- ing to build cloud services. In: SIGMOD (2017)
- Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107- 113 (2008)
- Debattista, J., Lange, C., Auer, S., Cortis, D.: Evaluat- ing the quality of the LOD cloud: An empirical investi- gation. Semantic Web 9(6), 859-901 (2018)
- Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zen- crowd: leveraging probabilistic reasoning and crowd- sourcing techniques for large-scale entity linking. In: WWW, pp. 469-478 (2012)
- Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Large-scale linked data integration using probabilistic reasoning and crowdsourcing. VLDB J. 22(5), 665-687 (2013)
- Díaz, J.A., Fernández, E.: A tabu search heuristic for the generalized assignment problem. European Journal of Operational Research 132(1), 22-38 (2001)
- Doan, A., Ardalan, A., Ballard, J.R., Das, S., Govind, Y., Konda, P., Li, H., Mudgal, S., Paulson, E., C., P.S.G., Zhang, H.: Human-in-the-loop challenges for en- tity matching: A midterm report. In: HILDA (2017)
- Dong, L., Rekatsinas, T.: Data integration and machine learning: A natural synergy. PVLDB 11(12), 2094-2097 (2018)
- Dong, X., Halevy, A.Y., Madhavan, J.: Reference recon- ciliation in complex information spaces. In: SIGMOD, pp. 85-96 (2005)
- Dong, X.L., Srivastava, D.: Big Data Integration. Mor- gan & Claypool (2015)
- Dorneles, C.F., Gonçalves, R., dos Santos Mello, R.: Ap- proximate data instance matching: a survey. Knowledge and Information Systems 27(1), 1-21 (2011)
- Draisbach, U., Naumann, F.: Dude: The duplicate de- tection toolkit. In: QDB (2010)
- Draisbach, U., Naumann, F.: A generalization of block- ing and windowing algorithms for duplicate detection. In: ICDKE, pp. 18-24 (2011)
- Ebraheem, M., Thirumuruganathan, S., Joty, S.R., Ouz- zani, M., Tang, N.: Distributed representations of tuples for entity resolution. PVLDB 11(11), 1454-1467 (2018)
- Efthymiou, V., Papadakis, G., Papastefanatos, G., Ste- fanidis, K., Palpanas, T.: Parallel meta-blocking: Real- izing scalable entity resolution over large, heterogeneous data. In: IEEE Big Data (2015)
- Efthymiou, V., Papadakis, G., Papastefanatos, G., Ste- fanidis, K., Palpanas, T.: Parallel meta-blocking for scal- ing entity resolution over big heterogeneous data. Inf. Syst. 65, 137-157 (2017)
- Efthymiou, V., Papadakis, G., Stefanidis, K., Christophides, V.: Simplifying entity resolution on web data with schema-agnostic, non-iterative match- ing. In: ICDE, pp. 1296-1299 (2018)
- Efthymiou, V., Papadakis, G., Stefanidis, K., Christophides, V.: MinoanER: Schema-agnostic, non-iterative, massively parallel resolution of web entities. In: EDBT, pp. 373-384 (2019)
- Efthymiou, V., Stefanidis, K., Christophides, V.: Big data entity resolution: From highly to somehow similar entity descriptions in the web. In: IEEE Big Data, pp. 401-410 (2015)
- Efthymiou, V., Stefanidis, K., Christophides, V.: Mi- noan ER: progressive entity resolution in the web of data. In: EDBT, pp. 670-671 (2016)
- Elfeky, M.G., Elmagarmid, A.K., Verykios, V.S.: TAI- LOR: A record linkage tool box. In: ICDE (2002)
- Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Du- plicate record detection: A survey. TKDE 19(1), 1-16 (2007)
- Elman, J.L.: Finding structure in time. Cognitive Sci- ence 14(2), 179-211 (1990)
- van Erp, M., Mendes, P.N., Paulheim, H., Ilievski, F., Plu, J., Rizzo, G., Waitelonis, J.: Evaluating entity link- ing: An analysis of current benchmark datasets and a roadmap for doing a better job. In: LREC (2016)
- Esquivel, J., Albakour, D., Martinez-Alvarez, M., Cor- ney, D., Moussa, S.: On the long-tail entities in news. In: ECIR, pp. 691-697 (2017)
- Faloutsos, C., Lin, K.I.: Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: SIGMOD (1995)
- Fan, W., Gao, H., Jia, X., Li, J., Ma, S.: Dynamic con- straints for record matching. VLDB J. 20(4), 495-520 (2011)
- Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. PVLDB 2(1), 407-418 (2009)
- Fellegi, I.P., Sunter, A.B.: A theory for record link- age. Journal of the American Statistical Association 64, 1183-1210 (1969)
- Fier, F., Augsten, N., Bouros, P., Leser, U., Freytag, J.: Set similarity joins on mapreduce: An experimental survey. PVLDB 11(10), 1110-1122 (2018)
- Fisher, J., Christen, P., Wang, Q., Rahm, E.: A clustering-based framework to control block sizes for en- tity resolution. In: SIGKDD, pp. 279-288 (2015)
- Flake, G.W., Tarjan, R.E., Tsioutsiouliklis, K.: Graph clustering and minimum cut trees. Internet Mathemat- ics 1(4), 385-408 (2003)
- Gal, A.: Tutorial: Uncertain entity resolution. PVLDB 7(13), 1711-1712 (2014)
- Gao, N., Huang, S.J., Yan, Y., Chen, S.: Cross modal similarity learning with active queries. Pattern Recogn. 75(C), 214-222 (2018)
- Getoor, L., Machanavajjhala, A.: Entity resolution: Theory, practice & open challenges. PVLDB 5(12), 2018-2019 (2012)
- Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: KDD (2013)
- Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggre- gation. TKDD 1(1), 4 (2007)
- Gokhale, C., Das, S., Doan, A., Naughton, J.F., Ram- palli, N., Shavlik, J.W., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD (2014)
- Golshan, B., Halevy, A.Y., Mihaila, G.A., Tan, W.: Data integration: After the teenage years. In: PODS (2017)
- Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB (2001)
- Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. PVLDB 7(9), 697-708 (2014)
- Gulzar, M.A., Interlandi, M., Yoo, S., Tetali, S.D., Condie, T., Millstein, T., Kim, M.: Bigdebug: Debug- ging primitives for interactive big data processing in spark. In: ICSE, pp. 784-795 (2016)
- Hajian, S., Bonchi, F., Castillo, C.: Algorithmic bias: From discrimination discovery to fairness-aware data mining. In: SIGKDD, pp. 2125-2126 (2016)
- Hassanzadeh, O., Chiang, F., Miller, R.J., Lee, H.C.: Framework for evaluating clustering algorithms in du- plicate detection. PVLDB 2(1), 1282-1293 (2009)
- Hassanzadeh, O., Miller, R.J.: Creating probabilistic databases from duplicated data. VLDB J. 18(5), 1141- 1166 (2009)
- Haveliwala, T.H., Gionis, A., Indyk, P.: Scalable tech- niques for clustering the web. In: WebDB (2000)
- Hernández, M.A., Koutrika, G., Krishnamurthy, R., Popa, L., Wisnesky, R.: HIL: a high-level scripting lan- guage for entity integration. In: EDBT (2013)
- Hernàndez, M.A., Stolfo, S.J.: The merge/purge prob- lem for large databases. In: SIGMOD (1995)
- Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: A spatially and temporally enhanced knowl- edge base from wikipedia. Artif. Intell. 194, 28-61 (2013)
- Ioannou, E., Nejdl, W., Niederée, C., Velegrakis, Y.: On- the-fly entity-aware query processing in the presence of linkage. PVLDB 3(1), 429-438 (2010)
- Ioannou, E., Niederée, C., Nejdl, W.: Probabilistic en- tity linkage for heterogeneous information spaces. In: CAiSE (2008)
- Ioannou, E., Rassadko, N., Velegrakis, Y.: On generat- ing benchmark data for entity matching. J. Data Se- mantics 2(1), 37-56 (2013)
- Isele, R., Bizer, C.: Learning expressive linkage rules using genetic programming. PVLDB 5(11), 1638-1649 (2012)
- Jiang, Y., Li, G., Feng, J., Li, W.: String similarity joins: An experimental evaluation. PVLDB 7(8), 625- 636 (2014)
- Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: DASFAA, pp. 137-146 (2003)
- Jurczyk, P., Lu, J.J., Xiong, L., Cragan, J.D., Correa, A.: Fine-grained record integration and linkage tool. Birth Defects Research Part A: Clinical and Molecular Teratology 82(11), 822-829 (2008)
- Jurek, A., Hong, J., Chi, Y., Liu, W.: A novel ensemble learning approach to unsupervised record linkage. Inf. Syst. 71, 40-54 (2017)
- Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.S.: Summarization algorithms for record linkage. In: EDBT, pp. 73-84 (2018)
- Kejriwal, M., Miranker, D.P.: An unsupervised algo- rithm for learning blocking schemes. In: ICDM (2013)
- Kejriwal, M., Miranker, D.P.: A two-step blocking scheme learner for scalable link discovery. In: OM (2014)
- Kejriwal, M., Miranker, D.P.: A DNF blocking scheme learner for heterogeneous datasets. CoRR abs/1501.01694 (2015)
- Kejriwal, M., Miranker, D.P.: An unsupervised instance matcher for schema-free RDF data. J. Web Sem. 35, 102-123 (2015)
- Kenig, B., Gal, A.: MFIBlocks: An effective blocking algorithm for entity resolution. Inf. Syst. 38(6), 908- 926 (2013)
- Khan, A.R., Garcia-Molina, H.: Attribute-based crowd entity resolution. In: CIKM, pp. 549-558 (2016)
- Kim, H., Lee, D.: HARRA: fast iterative hashed record linkage for large-scale data collections. In: EDBT (2010)
- Kolb, L., Thor, A., Rahm, E.: Dedoop: Efficient dedu- plication with hadoop. PVLDB 5(12), 1878-1881 (2012)
- Kolb, L., Thor, A., Rahm, E.: Load balancing for mapreduce-based entity resolution. In: ICDE (2012)
- Kolb, L., Thor, A., Rahm, E.: Multi-pass sorted neigh- borhood blocking with mapreduce. Computer Science - R&D 27(1), 45-63 (2012)
- Konda, P., Das, S., et al.: Magellan: Toward building entity matching management systems. PVLDB 9(12), 1197-1208 (2016)
- Köpcke, H., Rahm, E.: Frameworks for entity matching: A comparison. Data Knowl. Eng. 69(2), 197-210 (2010)
- Köpcke, H., Thor, A., Rahm, E.: Comparative eval- uation of entity resolution approaches with FEVER. PVLDB 2(2), 1574-1577 (2009)
- Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484-493 (2010)
- Koudas, N., Sarawagi, S., Srivastava, D.: Record link- age: Similarity measures and algorithms. In: SIGMOD (2006)
- Kurtzberg, J.M.: On approximation methods for the as- signment problem. J. ACM 9(4), 419-439 (1962)
- Kushagra, S., Saxena, H., Ilyas, I.F., Ben-David, S.: A semi-supervised framework of clustering selection for de- duplication. In: ICDE (2019)
- Kwashie, S., Liu, J., Li, J., Liu, L., Stumptner, M., Yang, L.: Certus: An effective entity resolution approach with graph differential dependencies (gdds). PVLDB 12(6), 653-666 (2019)
- Lacoste-Julien, S., Palla, K., Davies, A., Kasneci, G., Graepel, T., Ghahramani, Z.: Sigma: simple greedy matching for aligning large knowledge bases. In: SIGKDD, pp. 572-580 (2013)
- Li, H., Konda, P., C., P.S.G., Doan, A., Snyder, B., Park, Y., Krishnan, G., Deep, R., Raghavendra, V.: Matchcatcher: A debugger for blocking in entity match- ing. In: EDBT, pp. 193-204 (2018)
- Li, J., Tang, J., Li, Y., Luo, Q.: Rimom: A dynamic mul- tistrategy ontology alignment framework. TKDE 21(8), 1218-1232 (2009)
- Logothetis, D., De, S., Yocum, K.: Scalable lineage cap- ture for debugging disc analytics. In: SoCC (2013)
- Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high- dimensional similarity search. In: VLDB (2007)
- Ma, Y., Tran, T.: Typimatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration. In: WSDM, pp. 325-334 (2013)
- Mann, W., Augsten, N., Bouros, P.: An empirical eval- uation of set similarity join techniques. PVLDB 9(9), 636-647 (2016)
- McCallum, A., Nigam, K., Ungar, L.H.: Efficient clus- tering of high-dimensional data sets with application to reference matching. In: SIGKDD, pp. 169-178 (2000)
- McNeill, W., Kardes, H., Borthwick, A.: Dynamic record blocking: efficient linking of massive databases in mapre- duce. In: QDB (2012)
- McVitie, D.G., Wilson, L.B.: Stable marriage assign- ment for unequal sets. BIT Numerical Mathematics 10(3), 295-309 (1970)
- Mesnil, G., He, X., Deng, L., Bengio, Y.: Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In: IN- TERSPEECH, pp. 3771-3775 (2013)
- Michelson, M., Knoblock, C.A.: Learning blocking schemes for record linkage. In: AAAI (2006)
- Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: A design space ex- ploration. In: SIGMOD, pp. 19-34 (2018)
- Nanayakkara, C., Christen, P., Ranbaduge, T.: Robust temporal graph clustering for group record linkage. In: PAKDD, pp. 526-538 (2019)
- Naumann, F., Herschel, M.: An Introduction to Dupli- cate Detection. Synthesis Lectures on Data Manage- ment. Morgan & Claypool (2010)
- Nelson, E., Talburt, J.: Entity resolution for longitudinal studies in education using oyster. In: IKE (2011)
- Nentwig, M., Groß, A., Möller, M., Rahm, E.: Dis- tributed holistic clustering on linked data. In: OTM Conferences II, pp. 371-382 (2017)
- Nentwig, M., Groß, A., Rahm, E.: Holistic entity cluster- ing for linked data. In: IEEE ICDM Workshops (2016)
- Nentwig, M., Hartung, M., Ngomo, A.N., Rahm, E.: A survey of current link discovery frameworks. Semantic Web 8(3), 419-436 (2017)
- Ngomo, A.N., Auer, S.: LIMES -A time-efficient ap- proach for large-scale link discovery on the web of data. In: IJCAI, pp. 2312-2317 (2011)
- Nikolov, A., Uren, V.S., Motta, E., Roeck, A.N.D.: In- tegration of semantically annotated data by the knofuss architecture. In: EKAW, pp. 265-274 (2008)
- Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based config- urations for blocking methods on homogeneous data. PVLDB 9(4), 312-323 (2015)
- Papadakis, G., Bereta, K., Palpanas, T., Koubarakis, M.: Multi-core meta-blocking for big linked data. In: SEMANTICS (2017)
- Papadakis, G., Demartini, G., Fankhauser, P., Kärger, P.: The missing links: discovering hidden same-as links among a billion of triples. In: iiWAS, pp. 453-460 (2010)
- Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous in- formation spaces. In: WSDM, pp. 535-544 (2011)
- Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Eliminating the redundancy in blocking- based entity resolution methods. In: JCDL (2011)
- Papadakis, G., Ioannou, E., Niederée, C., Palpanas, T., Nejdl, W.: Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data. In: WSDM (2012)
- Papadakis, G., Ioannou, E., Palpanas, T., Niederée, C., Nejdl, W.: A blocking framework for entity resolution in highly heterogeneous information spaces. TKDE 25(12), 2665-2682 (2013)
- Papadakis, G., Koutrika, G., Palpanas, T., Nejdl, W.: Meta-blocking: Taking entity resolutionto the next level. TKDE 26(8), 1946-1960 (2014)
- Papadakis, G., Nejdl, W.: Efficient entity resolution methods for heterogeneous information spaces. In: ICDE PhD Workshop, pp. 304-307 (2011)
- Papadakis, G., Palpanas, T.: Blocking for large-scale en- tity resolution: Challenges, algorithms, and practical ex- amples. In: ICDE (2016)
- Papadakis, G., Palpanas, T.: Web-scale, schema- agnostic, end-to-end entity resolution. In: WWW (2018)
- Papadakis, G., Papastefanatos, G., Koutrika, G.: Super- vised meta-blocking. PVLDB 7(14), 1929-1940 (2014)
- Papadakis, G., Papastefanatos, G., Palpanas, T., Koubarakis, M.: Scaling entity resolution to large, het- erogeneous data with enhanced meta-blocking. In: EDBT (2016)
- Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Com- parative analysis of approximate blocking techniques for entity resolution. PVLDB 9(9), 684-695 (2016)
- Papadakis, G., Tsekouras, L., et al.: The return of jedai: End-to-end entity resolution for structured and semi- structured data. PVLDB 11(12), 1950-1953 (2018)
- Papenbrock, T., Heise, A., Naumann, F.: Progressive duplicate detection. TKDE 27(5), 1316-1329 (2015)
- Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, pp. 1532-1543 (2014)
- Ramadan, B., Christen, P.: Forest-based dynamic sorted neighborhood indexing for real-time entity resolution. In: CIKM, pp. 1787-1790 (2014)
- Ramadan, B., Christen, P., Liang, H., Gayler, R.W.: Dy- namic sorted neighborhood indexing for real-time entity resolution. J. Data and Information Quality 6(4), 15:1- 15:29 (2015)
- Ramadan, B., Christen, P., Liang, H., Gayler, R.W., Hawking, D.: Dynamic similarity-aware inverted index- ing for real-time entity resolution. In: PAKDD Work- shops, pp. 47-58 (2013)
- Rastogi, V., Dalvi, N.N., Garofalakis, M.N.: Large-scale collective entity matching. PVLDB 4(4), 208-218 (2011)
- Ratner, A., Bach, S.H., Ehrenberg, H.R., Fries, J.A., Wu, S., Ré, C.: Snorkel: Rapid training data creation with weak supervision. PVLDB 11(3), 269-282 (2017)
- Reyes-Galaviz, O.F., Pedrycz, W., He, Z., Pizzi, N.J.: A supervised gradient-based learning algorithm for opti- mized entity resolution. Data Knowl. Eng. 112, 106-129 (2017)
- Rice, S.V.: Braided avl trees for efficient event sets and ranked sets in the simscript iii simulation programming language. In: Western MultiConference on Computer Simulation, pp. 150-155 (2007)
- Rong, S., Niu, X., Xiang, E.W., Wang, H., Yang, Q., Yu, Y.: A machine learning approach for instance matching based on similarity metrics. In: ISWC (2012)
- Saeedi, A., Nentwig, M., Peukert, E., Rahm, E.: Scal- able matching and clustering of entities with FAMER. CSIMQ 16, 61-83 (2018)
- Saeedi, A., Peukert, E., Rahm, E.: Comparative evalu- ation of distributed clustering schemes for multi-source entity resolution. In: ADBIS, pp. 278-293 (2017)
- Saeedi, A., Peukert, E., Rahm, E.: Using link features for entity clustering in knowledge graphs. In: ESWC, pp. 576-592 (2018)
- Sariyar, M., Borg, A., Pommerening, K.: Controlling false match rates in record linkage using extreme value theory. Journal of biomedical informatics 44(4), 648- 654 (2011)
- Sarma, A.D., Jain, A., Machanavajjhala, A., Bohannon, P.: An automatic blocking mechanism for large-scale de- duplication tasks. In: CIKM, pp. 1055-1064 (2012)
- Shao, C., Hu, L., Li, J., Wang, Z., Chung, T.L., Xia, J.: Rimom-im: A novel iterative framework for instance matching. J. Comput. Sci. Technol. 31(1), 185-197 (2016)
- Simonini, G., Bergamaschi, S., Jagadish, H.V.: BLAST: a loosely schema-aware meta-blocking approach for en- tity resolution. PVLDB 9(12), 1173-1184 (2016)
- Simonini, G., Papadakis, G., Palpanas, T., Bergam- aschi, S.: Schema-agnostic progressive entity resolution. TKDE (2018)
- Sismanis, Y., Wang, L., Fuxman, A., Haas, P.J., Rein- wald, B.: Resolution-Aware Query Answering for Busi- ness Intelligence. In: ICDE, pp. 976-987 (2009)
- Stefanidis, K., Christophides, V., Efthymiou, V.: Web- scale blocking, iterative and progressive entity resolu- tion. In: ICDE, pp. 1459-1462 (2017)
- Stefanidis, K., Efthymiou, V., Herschel, M., Christophides, V.: Entity resolution in the web of data. In: CIKM (2013)
- Stefanidis, K., Efthymiou, V., Herschel, M., Christophides, V.: Entity resolution in the web of data. In: WWW (2014)
- Su, W., Wang, J., Lochovsky, F.H.: Record matching over query results from multiple web databases. TKDE 22(4), 578-589 (2010)
- Suchanek, F.M., Abiteboul, S., Senellart, P.: PARIS: probabilistic alignment of relations, instances, and schema. PVLDB 5(3), 157-168 (2011)
- Thirumuruganathan, S., Parambath, S.A.P., Ouzzani, M., Tang, N., Joty, S.: Reuse and adaptation for entity resolution through transfer learning. CoRR abs/1809.11084 (2018)
- Van Dongen, S.M.: Graph clustering by flow simulation. Ph.D. thesis, Utrecht University (2000)
- Verroios, V., Garcia-Molina, H.: Entity resolution with crowd errors. In: ICDE, pp. 219-230 (2015)
- Verroios, V., Garcia-Molina, H., Papakonstantinou, Y.: Waldo: An adaptive human interface for crowd entity resolution. In: SIGMOD, pp. 1133-1148 (2017)
- Vesdapunt, N., Bellare, K., Dalvi, N.N.: Crowdsourcing algorithms for entity resolution. PVLDB 7(12), 1071- 1082 (2014)
- Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Silk -A link discovery framework for the web of data. In: LDOW (2009)
- Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: Crowdsourcing entity resolution. PVLDB 5(11), 1483- 1494 (2012)
- Wang, J., Krishnan, S., Franklin, M.J., Goldberg, K., Kraska, T., Milo, T.: A sample-and-clean framework for fast and accurate query processing on dirty data. In: SIGMOD, pp. 469-480 (2014)
- Wang, J., Li, G., Yu, J.X., Feng, J.: Entity matching: How similar is similar. PVLDB 4(10), 622-633 (2011)
- Wang, Q., Vatsalan, D., Christen, P.: Efficient interac- tive training selection for large-scale entity resolution. In: PAKDD, pp. 562-573 (2015)
- Wang, Y., Song, S., Chen, L., Yu, J.X., Cheng, H.: Dis- covering conditional matching rules. TKDD 11(4), 46:1- 46:38 (2017)
- Weis, M., Naumann, F.: Detecting duplicate objects in XML documents. In: IQIS, pp. 10-19 (2004)
- Weis, M., Naumann, F.: Detecting duplicates in complex XML data. In: ICDE, p. 109 (2006)
- Welch, M.J., Sane, A., Drome, C.: Fast and accurate in- cremental entity resolution relative to an entity knowl- edge base. In: CIKM, pp. 2667-2670 (2012)
- Whang, S.E., Garcia-Molina, H.: Disinformation tech- niques for entity resolution. In: CIKM (2013)
- Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. PVLDB 6(6), 349- 360 (2013)
- Whang, S.E., Marmaros, D., Garcia-Molina, H.: Pay- as-you-go entity resolution. TKDE 25(5), 1111-1124 (2013)
- Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity Resolution with Iterative Blocking. In: SIGMOD, pp. 219-232 (2009)
- Wijaya, D.T., Bressan, S.: Ricochet: A family of uncon- strained algorithms for graph clustering. In: DASFAA (2009)
- Williams, R.J., Zipser, D.: A learning algorithm for con- tinually running fully recurrent neural networks. Neural Computation 1(2), 270-280 (1989)
- Yan, S., Lee, D., Kan, M.Y., Giles, C.L.: Adaptive sorted neighborhood methods for efficient record linkage. In: JCDL, pp. 185-194 (2007)
- Yan, W., Xue, Y., Malin, B.: Scalable load balancing for mapreduce-based record linkage. In: IPCCC (2013)
- Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Frontiers Comput. Sci. 10(3), 399-417 (2016)
- Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similar- ity Search -The Metric Space Approach, Advances in Database Systems, vol. 32. Kluwer (2006)
- Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in mapreduce. In: EDBT (2012)
- Zhang, F., Gao, Z., Niu, K.: A pruning algorithm for meta-blocking based on cumulative weight. In: Journal of Physics: Conference Series, vol. 887 (2017)
- Zheng, Q., Diao, X., Cao, J., Zhou, X., Liu, Y., Li, H.: Multi-modal space structure: a new kind of latent correlation for multi-modal entity resolution. CoRR abs/1804.08010 (2018)