Academia.eduAcademia.edu

Outline

Collective entity resolution in relational data

2007

https://doi.org/10.1145/1217299.1217304

Abstract

Abstract Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities.

References (117)

  1. 2 Entity Resolution Queries: Formulation Let us revisit the four example papers from Section 2.1:
  2. W. Wang, C. Chen, A. Ansari, "A mouse immunity model"
  3. W. Wang, A. Ansari, "A better mouse immunity model"
  4. L. Li, C. Chen, W. Wang,"Measuring protein-bound fluxetine"
  5. W. W. Wang, A. Ansari, "Autoimmunity in biliary cirrhosis" Representing them in the notation introduced for the entity resolution problem in Section 2.2, we have 10 references {r 1 , . . . , r 10 } in R, where r 1 .N ame = 'W Wang', etc. There are 4 hyper-edges {h 1 , . . . , h 4 } in H for the four papers. According to the ground truth, we have six underlying entities. This is illustrated in Figure 2.1 using a different shading for each entity. For example, the 'Wang's of papers 1, 2 and 4 are the same individual but that from paper 3 is a different person. Also, the 'Chen's from papers 1 and 3 are different individuals. Then, the correct resolution for the example database with 10 references returns 6 entity clusters: {{r 1 , r 4 , r 9 }, {r
  6. r 2 }, {r 7 }, {r 3 , r 5 , r 10 }, {r 6 }}. The first two clusters correspond to 'Wang', the next two to 'Chen', the fifth to 'Ansari' and the last to 'Li'. Instead of clustering all database references, in many applications, users are interested in just a few of the clusters. For example, we may want to retrieve all papers written by some person named 'W Wang'. I refer to this as an entity resolution query on 'W Wang', since answering it involves knowing the underlying entities. I will assume that queries are specified using R.N ame, which is a noisy Creation Stage
  7. Repeat R times
  8. Generate reference r using N (e.x, 1)
  9. Generate reference r j using N (e j .x, 1)
  10. Add r j hyper-edge h
  11. Figure A.1: High-level description of synthetic data generation algorithm Bibliography
  12. Lada Adamic and Eytan Adar. Friends and neighbors on the web. Social Networks, 25(3):211-230, July 2003.
  13. Eneko Agirre, Jordi Atserias, Lluis Padro, and German Rigau. Combining supervised and unsupervised lexical knowledge methods for word sense disam- biguation computers and the humanities. In Computers and the Humanities, Special Double Issue on SensEval. Eds. Martha Palmer and Adam Kilgarriff. 34:1,2, 2000.
  14. Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of the 28th International Conference on Very Large Databases (VLDB-2002), Hong Kong, China, 2002.
  15. Charles Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics, 2:1152-1174, 1974.
  16. Yoshua Bengio and Christopher Kermorvant. Extracting hidden sense proba- bilities from bitexts. Technical report, TR 1231, Departement d'informatique et recherche operationnelle, Universite de Montreal, 2003.
  17. Omar Benjelloun, Hector Garcia-Molina, Qi Su, and Jennifer Widom. Swoosh: A generic approach to entity resolution. Technical report, Stanford University, March 2005.
  18. Indrajit Bhattacharya and Lise Getoor. Deduplication and group detection using links. In Proceedings of the 10th ACM SIGKDD Workshop on Link Analysis and Group Detection (LinkKDD-04), Seattle, WA, USA, 2004.
  19. Indrajit Bhattacharya and Lise Getoor. Iterative record linkage for cleaning and integration. In SIGMOD 2004 Workshop on Research Issues on Data Mining and Knowledge Discovery, Paris, France, 2004.
  20. Indrajit Bhattacharya and Lise Getoor. Relational clustering for multi-type entity resolution. In The 11th ACM SIGKDD Workshop on Multi Relational Data Mining (MRDM-05), Chicago, IL, USA, 2005.
  21. Indrajit Bhattacharya and Lise Getoor. Collective entity resolution in rela- tional data. IEEE Data Engineering Bulletin, Special Issue on Data Cleaning, pages 4-12, June 2006.
  22. Indrajit Bhattacharya and Lise Getoor. Entity Resolution in Graphs, chapter Mining Graph Data (L. Holder and D. Cook, eds.). Wiley, 2006.
  23. Indrajit Bhattacharya and Lise Getoor. A latent dirichlet model for unsuper- vised entity resolution. In SIAM Conference on Data Mining (SIAM-SDM), Bethesda, MD, USA, 2006.
  24. Indrajit Bhattacharya and Lise Getoor. Query-time entity resolution. In ACM Conference on Knowledge Discovery and Data Mining (KDD), Philadelphia, PA, USA, 2006.
  25. Indrajit Bhattacharya, Lise Getoor, and Yoshua Bengio. Unsupervised sense disambiguation using bilingual probabilistic models. In Proceedings of The 42nd Annual Meeting of the Association for Computational Linguistics (ACL), Barcelona, Spain, 2004.
  26. Mikhail Bilenko and Raymond Mooney. Adaptive duplicate detection us- ing learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), Washington DC, USA, 2003.
  27. Mikhail Bilenko, Raymond Mooney, William Cohen, Pradeep Ravikumar, and Stephen Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):16-23, 2003.
  28. Mustafa Bilgic, Louis Licamele, Lise Getoor, and Ben Shneiderman. D-dupe: An interactive tool for entity resolution in social networks. In Visual Analytics Science and Technology (VAST), Baltimore, 2006.
  29. David Blei, Thomas Griffiths, Michael Jordan, and Josh Tenenbaum. Hier- archical topic models and the nested chinese restaurant process. In Advances In Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 2003.
  30. David Blei and Michael Jordan. Variational methods for the dirichlet process. In International Conference on Machine Learning (ICML), Banff, Alberta, Canada, 2004.
  31. David Blei and John Lafferty. Correlated topic models. In Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 2006.
  32. David Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:951-991, Jan 2003.
  33. Peter Brown, Stephen Della Pietra, Vincent Della Pietra, and Robert Mer- cer. Word-sense disambiguation using statistical methods. In Meeting of the Association for Computational Linguistics, pages 264-270, 1991.
  34. Rebecca Bruce and Janyce Wiebe. A new approach to sense identification. In ARPA Workshop on Human Language Technology, 1994.
  35. Claire Cardie and Kiri Wagstaff. Noun phrase coreference as clustering. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP), College Park, MD, USA, 1999.
  36. Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 1998.
  37. Amit Chandel, P. C. Nagesh, and Sunita Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), Washington, DC, USA, 2006.
  38. Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. Ro- bust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, CA, USA, 2003.
  39. William Cohen. Data integration using similarity joins and a word-based infor- mation representation language. ACM Transactions on Information Systems, 18:288-321, 2000.
  40. William Cohen, Henry Kautz, and David McAllester. Hardening soft informa- tion sources. In Proceedings of the Sixth International Conference on Knowl- edge Discovery and Data Mining (KDD-2000), Boston, MA, USA, 2000.
  41. William Cohen, Pradeep Ravikumar, and Stephen Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI- 2003 Workshop on Information Integration on the Web, Acapulco, Mexico, 2003.
  42. William Cohen and Jacob Richman. Learning to match and cluster large high- dimensional data sets for data integration. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmonton, Alberta, CA, 2002.
  43. David Cohn and Thomas Hofmann. The missing link: A probabilistic model of document content and hypertext connectivity. In Advances in Neural In- formation Processing Systems (NIPS), Vancouver, BC, Canada, 2001.
  44. Ido Dagan. Lexical disambiguation: Sources of information and their statisti- cal realization. In Meeting of the Association for Computational Linguistics, Berkeley, CA, USA, 1991.
  45. Ido Dagan and Alon Itai. Word sense disambiguation using a second language monolingual corpus. Computational Linguistics, 20(4):563-596, 1994.
  46. Hal Daumé and Daniel Marcu. A bayesian model for supervised clustering with the dirichlet process prior. Journal of Machine Learning Research, 6:1551- 1577, Sep 2005.
  47. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B 39:1-38, 1977.
  48. Mona Diab. Word Sense Disambiguation Within a Multilingual Framework. PhD thesis, University of Maryland, College Park, 2003.
  49. Mona Diab and Philip Resnik. An unsupervised method for word sense tagging using parallel corpora. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL-02), Philadelphia, PA, USA, 2002.
  50. Christopher Diehl, Lise Getoor, and Galileo Namata. Name reference resolu- tion in organizational email archives. In SIAM Conference on Data Mining (SDM), Bethesda, MD, USA, 2006.
  51. AnHai Doan, Ying Lu, Yoonkyong Lee, and Jiawei Han. Object matching for data integration: A profile-based approach. In Proceedings of the IJCAI Workshop on Information Integration on the W eb, Acapulco, MX, 2003.
  52. Xin Dong, Alon Halevy, and Jayant Madhavan. Reference reconciliation in complex information spaces. In ACM SIGMOD International Conference on Management of Data, Baltimore, MD, USA, 2005.
  53. Denise Draper and Steve Hanks. Localized partial evaluation of belief net- works. In Proceedings of the Annual Conference on Uncertainty in Artificial Intelligence, Seattle, WA, USA, 1994.
  54. Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.
  55. I. Fellegi and A. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183-1210, 1969.
  56. Thomas Ferguson. A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209-230, 1973.
  57. Daniela Florescu, Eric Simon, and Dennis Shasha. An extensible framework for data cleaning. In Proceedings of the 16th International Conference on Data Engine ering, San Diego, CA, USA, 2000.
  58. Ariel Fuxman, Elham Fazli, and Rene Miller. Conquer: Efficient management of inconsistent databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, MD, USA, 2005.
  59. Lise Getoor, Nir Friedman, Daphne Koller, and Ben Taskar. Learning prob- abilistic models of link structure. Journal of Machine Learning Research, 3:679-707, December 2002.
  60. C. Lee Giles, Kurt Bollacker, and Steve Lawrence. CiteSeer: An automatic citation indexing system. In Proceedings of the Third ACM Conference on Digital Libraries, Pittsburgh, PA, USA, 1998.
  61. Luis Gravano, Panagiotis Ipeirotis, Nick Koudas, and Divesh Srivastava. Text joins for data cleansing and integration in an rdbms. In 19th IEEE Interna- tional Conference on Data Engineering, Bangalore, India, 2003.
  62. Thomas Griffiths and Mark Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101, pages 5228-5235, April 2004.
  63. Mauricio Hernández and Salvatore Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Confer- ence on Management of Data (SIGMOD-95), San Jose, CA, USA, 1995.
  64. Thomas Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncer- tainty in Artificial Intelligence, UAI'99, Stockholm, Sweden, 1999.
  65. Jeremy Hylton. Identifying and merging related bibliographic records. Mas- ter's thesis, Department of Electrical Engineering and Computer Science, MIT, 1996.
  66. Nancy Ide. Cross-lingual sense determination: Can it work? In Computers and the Humanities: Special Issue on Senseval, 34:147-152, 2000.
  67. Nancy Ide and Jean Veronis. Word sense disambiguation: The state of the art. Computational Linguistics, 28(1):1-40, 1998.
  68. Sonia Jain and Radford Neal. A split-merge markov chain monte carlo pro- cedure for the dirichlet process mixture model. Technical report, Dept. of Statistics, University of Toronto, 2000.
  69. Sonia Jain and Radford Neal. Splitting and merging components of a noncon- jugate dirichlet process mixture model. Technical report, Dept. of Statistics, University of Toronto, 2005.
  70. Istvan Jonyer, Lawrence Holder, and Diane Cook. Graph-based hierarchical conceptual clustering. Journal of Machine Learning Research, 2(1-2):19-43, 2001.
  71. Dmitri Kalashnikov, Sharad Mehrotra, and Zhaoqi Chen. Exploiting relation- ships for domain-independent data cleaning. In SIAM International Confer- ence on Data Mining (SIAM SDM), Newport Beach, CA, USA, April 21-23 2005.
  72. Adam Kilgarrif and Joseph Rosenzweig. Framework and results for english senseval. Computers and the Humanities, 34(1):15-48, 2000.
  73. Jon Kleinberg. Authoritative sources in n hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999.
  74. Jeremy Kubica, Andrew Moore, Jeff Schneider, and Yiming Yang. Stochastic link and group detection. In Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI), Edmonton, Alberta, Canada, 2002.
  75. John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional ran- dom fields: Probabilistic models for segmenting and labeling sequence data. In 18th International Conference on Machine Learning (ICML), Williams Col- lege, MA, USA, 2001.
  76. Steve Lawrence, Kurt Bollacker, and C. Lee Giles. Autonomous citation matching. In Proceedings of the Third International Conference on Au- tonomous Agents, Seattle, WA, USA, 1999.
  77. Wei Li and Andrew McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 2006.
  78. Xin Li, Paul Morie, and Dan Roth. Semantic integration in text: From am- biguous names to identifiable entities. AI Magazine. Special Issue on Semantic Integration, 2005.
  79. David Liben-Nowell and Jon Kleinberg. The link prediction problem for social networks. In 12th International Conference on Information and Knowledge Management (CIKM), New Orleans, LA, USA, 2003.
  80. Dekang Lin. Word sense disambiguation with a similarity smoothed case li- brary. In Computers and the Humanities: Special Issue on Senseval, 34:147- 152, 2000.
  81. Kenneth Litkowski. Senseval: The cl research experience. In Computers and the Humanities, 34(1-2), pp. 153-8, 2000.
  82. Andrew McCallum, Andres Corrada-Emmanuel, and Xuerui Wang. Topic and role discovery in social networks. In International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, Scotland, 2005.
  83. Andrew McCallum, Kamal Nigam, and Lyle Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Pro- ceedings of the Sixth International Conference On Knowledge Discovery and Data Mining (KDD-2000), Boston, MA, USA, 2000.
  84. Andrew McCallum and Ben Wellner. Conditional models of identity uncer- tainty with application to noun coreference. In Advances In Neural Informa- tion Processing Systems (NIPS), Vancouver, BC, Canada, 2004.
  85. Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Finding pre- dominant senses in untagged text. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 2004.
  86. Rada Mihalcea. The role of non-ambiguous words in natural language disam- biguation. In Proceedings of the Conference on Recent Advances in Natural Language Processing, Borovetz, Bulgaria, 2003.
  87. Brian Milch, Bhaskara Marthi, David Sontag, Stuart Russell, Daniel L. Ong, and Andrey Kolobov. Blog: Probabilistic models with unknown objects. In International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, Scotland, 2005.
  88. Thomas Minka. Expectation propagation for approximate bayesian inference. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, 2001.
  89. Alvaro Monge and Charles Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 1996.
  90. Alvaro Monge and Charles Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, Tuscon, AZ, USA, 1997.
  91. Gonzalo Navarro. A guided tour to approximate string matching. ACM Com- puting Surveys, 33(1):31-88, 2001.
  92. Radford Neal. Markov chain sampling methods for dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249-265, 2000.
  93. Jennifer Neville, Micah Adler, and David Jensen. Clustering relational data using attribute and link information. In Proceedings of the Text Mining and Link Analysis Workshop, Eighteenth International Joint Conference on Arti- ficial Intelligence (IJCAI), Acapulco, Mexico, 2003.
  94. H. Newcombe, J. Kennedy, S. Axford, and A. James. Automatic linkage of vital records. Science, 130:954-959, 1959.
  95. BBC News. Google 'aids doctors' diagnoses'. http://news.bbc.co.uk/2/hi/health/6132856.stm, 10 November 2006.
  96. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stan- ford Digital Library Technologies Project, 1998.
  97. Hanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, and Ilya Sh- pitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 2003.
  98. Vijayshankar Raman and Joseph M. Hellerstein. Potter's wheel: An interac- tive data cleaning system. In Proceedings of the 27th International Conference on Very Large Databases (VLDB-2001), Rome, Italy, 2001.
  99. Pradeep Ravikumar and William Cohen. A hierarchical graphical model for record linkage. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Banff, Alberta, Canada, July 2004.
  100. Philip Resnik. Using information content to evaluate semantic similarity in a taxonomy. In International Joint Conference on Artificial Intelligence (IJ- CAI), Montreal, Quebec, Canada, 1995.
  101. Philip Resnik. Selectional preference and sense disambiguation. In Proceedings of ACL Siglex Workshop on Tagging Text with Lexical Semantics, Why, What and How?, Washington, DC, USA, 1997.
  102. Philip Resnik and David Yarowsky. Distinguishing systems and distinguish- ing senses: new evaluation methods for word sense disambiguation. Natural Language Engineering, 5(2), 1999.
  103. Eric Ristad and Peter Yianilos. Learning string edit distance. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 20(5):522-532, 1998.
  104. Michal Rosen-Zvi, Tom Griffiths, Mark Steyvers, and Padhraic Smyth. The author-topic model for authors and documents. In Proceedings of the Confer- ence on Uncertainty in Artificial Intelligence, Banff, Alberta, Canada, 2004.
  105. Sunita Sarawagi and Anuradha Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), Edmon- ton, Alberta, Canada, 2002.
  106. Hinrich Schutze. Automatic word sense discrimination. Computational Lin- guistics, 24(1):97-123, 1998.
  107. Vivek Sehgal, Lise Getoor, and Peter Viechnicki. Entity resolution in geospa- tial data integration. In ACM International Symposium on Advances in Geo- graphic Information Systems, Arlington, VA, USA, 2006.
  108. Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888- 905, 2000.
  109. Parag Singla and Pedro Domingos. Multi-relational record linkage. In Pro- ceedings of 3rd Workshop on Multi-Relational Data Mining at ACM SIGKDD, Seattle, WA, USA, 2004.
  110. Ben Taskar, Abbeel Pieter, and Daphne Koller. Discriminative probabilistic models for relational data. In Uncertainty in Artificial Intelligence: Proceed- ings of the Eighteenth Conference (UAI), San Francisco, CA, USA, 2002.
  111. Sheila Tejada, Craig Knoblock, and Steven Minton. Learning object iden- tification rules for information integration. Information Systems Journal, 26(8):635-656, 2001.
  112. Ben Wellner, Andrew McCallum, Fuchun Peng, and Michael Hay. An in- tegrated, conditional model of information extraction and coreference with application to citation matching. In Conference on Uncertainty in Artificial Intelligence (UAI), Banff, Alberta, Canada, 2004.
  113. William Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau, Wash- ington, DC, 1999.
  114. William Winkler. Methods for record linkage and Bayesian networks. Tech- nical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, 2002.
  115. David Yarowsky. Word-sense disambiguation using statistical models of Ro- get's categories trained on large corpora. In Proceedings of the International Conference on Computational Linguistics, Nantes, France, 1992.
  116. David Yarowsky. One sense per collocation. In Proceedings os the ARPA Human Language Technology Workshop, Princeton, NJ, USA, 1993.
  117. David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Meeting of the Association for Computational Linguistics, pages 189-196, Cambridge, MA, USA, 1995.