A generic Web-based entity resolution framework
2011, Journal of the American Society for Information Science and Technology
https://doi.org/10.1002/ASI.21518Abstract
Web data repositories usually contain references to thousands of real-world entities from multiple sources. It is not uncommon that multiple entities share the same label (polysemes) and that distinct label variations are associated with the same entity (synonyms), which frequently leads to ambiguous interpretations. Further, spelling variants, acronyms, abbreviated forms, and misspellings compound to worsen the problem. Solving this problem requires identifying which labels correspond to the same real-world entity, a process known as entity resolution. One approach to solve the entity resolution problem is to associate an authority identifier and a list of variant forms with each entity-a data structure known as an authority file. In this work, we propose a generic framework for implementing a method for generating authority files. Our method uses information from the Web to improve the quality of the authority file and, because of that, is referred to as WER-Web-based Entity Resolution. Our contribution here is threefold: (a) we discuss how to implement the WER framework, which is flexible and easy to adapt to new domains; (b) we run extended experimentation with our WER framework to show that it outperforms selected baselines; and (c) we compare the results of a specialized solution for author name resolution with those produced by the generic WER framework, and show that the WER results remain competitive.
References (61)
- Auld, L. (1982). Authority control: An eight-year review. Library Resources and Technical Services, 26, 319-330.
- Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th World Wide Web Conference (pp. 463-470). New York: ACM Press.
- Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T., Menestrina, D., & Thavisomboon, S. (2007). D-swoosh: A family of algorithms for generic, distributed entity resolution. In Proceedings of the 27th International Con- ference on Distributed Computing Systems (pp. 37-46). Washington, DC: IEEE Computer Society.
- Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S.E., & Widom, J. (2009). Swoosh: A generic approach to entity resolution. The VLDB Journal-The International Journal on Very Large Databases, 18(1), 255-276.
- Bennett, R., Hengel-Dittrich, C., O'Neill, E.T., & Tillett, B.B. (2006, August). Viaf (virtual international authority file): Linking Die Deutsche Bibliothek and Library of Congress name authority files. Paper presented at the World Library and Information Congress: 72nd IFLA General Conference and Council, Seoul, Korea.
- Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in rela- tional data.ACM Transaction on Knowledge Discovery from Data, 1(1), 5.
- Bilenko, M., Kamath, B., & Mooney, R.J. (2006). Adaptive blocking: Learning to scale up record linkage. Proceedings of the Sixth IEEE Inter- national Conference on Data Mining (pp. 87-96). Washington, DC: IEEE Computer Society.
- Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., & Fienberg, S. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5), 16-23.
- Bollegala, D., Honma, T., Matsuo, Y., & Ishizuka, M. (2008). Mining for personal name aliases on the web. In Proceedings of the 17th World Wide Web Conference (pp. 1107-1108). New York: ACM Press.
- Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
- Bunescu, R., & Pasca, M. (2006). Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the 11th Conference of the Euro- pean Chapter of the Association for Computational Linguistics (pp. 9-16). Stroudsburg, PA: Association for Computational Linguistics.
- Cohen, W.W. (1998). Integration of heterogeneous databases without com- mon domains using queries based on textual similarity. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (pp. 201-212). New York, NY: ACM Press.
- Cohen, W.W., Ravikumar, P., & Fienberg, S.E. (2003). A comparison of string distance metrics for name-matching tasks. In Proceedings of IJCAI Workshop on Information Integration on the Web (pp. 73-78). Menlo Park, CA: Association for the Advancement of Artificial Intelligence.
- Cota, R.G., Ferreira, A.A., Nascimento, C., Gonçalves, M.A., & Laender, A.H.F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9), 1853-1870.
- Croft, W.B., Metzler, D., & Strohman, T. (2009). Search engines: Information retrieval in practice. Reading, MA: Addison-Wesley.
- Davis, P.T., Elson, D.K., & Klavans, J.L. (2003). Methods for precise named entity matching in digital collections. In Proceedings of the Third ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 125-127). New York: ACM Press.
- de Carvalho, M.G., Gonçalves, M.A., Laender, A.H.F., & da Silva, A.S. (2006). Learning to deduplicate. In Proceedings of the Sixth ACM/IEEE- CS Joint Conference on Digital Libraries (pp. 41-50). New York: ACM Press.
- Dong, X., Halevy, A., & Madhavan, J. (2005). Reference reconciliation in complex information spaces. In Proceedings of the 25th ACM SIGMOD International Conference on Management of Data (pp. 85-96). NewYork: ACM Press.
- Elmacioglu, E., Kan, M.-Y., Lee, D., & Zhang,Y. (2007). Web based linkage. In Proceedings of the Ninth Annual ACM International Workshop on Web Information and Data Management (pp. 121-128). NewYork: ACM Press.
- Elmagarmid, A.K., Ipeirotis, P.G., & Verykios, V.S. (2007). Duplicate record detection: A survey. IEEE Transaction on Knowledge and Data Engineering, 19(1), 1-16.
- Fellegi, I.P., & Sunter, A.B. (1969). A theory for record linkage. Journal of the American Statistical Association, 64(328), 1183-1210.
- French, J.C., Powell, A.L., & Schulman, E. (2000). Using clustering strate- gies for creating authority files. Journal of the American Society for Information Science, 51(8), 774-786.
- Google, A.P.I. (2009). Google ajax search api. Retrieved from http://code.google.com/apis/ajaxsearch
- Han, H., Giles, C.L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author cita- tions. In Proceedings of the Fourth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 296-305). New York: ACM Press.
- Han, H., Zha, H., & Giles, C.L. (2005). Name disambiguation in author cita- tions using a k-way spectral clustering method. In Proceedings of the Fifth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 334-343). New York: ACM Press.
- Hernández, M.A., & Stolfo, S.J. (1995). The merge/purge problem for large databases. ACM SIGMOD Record, 24(2), 127-138.
- Hickey, T.B., & O'Neill, J.T.E.T. (2006). NACO normalization: A detailed examination of the authority file comparison rules. Library Resources & Technical Services, 50(3), 166-172.
- Huang, J., Ertekin, S., & Giles, C.L. (2006). Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European Confer- ence on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Artificial Intelligence, 4213, 536-544.
- Jaccard, P. (1901). Étude comparative de la distribution florale dans une por- tion des Alpes et des Jura [Comparative study of the floral distribution in a portion of the Alps and the Jura]. Bulletin del la Société Vaudoise des Sciences Naturelles, Vol. 37 (1901), pp. 547-579.
- Jaro, M.A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414-420.
- Kalashnikov, D.V., Nuray-Turan, R., & Mehrotra, S. (2008). Towards break- ing the quality curse: A web-querying approach to web people search. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 27-34. New York: ACM Press.
- Kan, M.-Y., & Tan, Y.F. (2008). Record matching in digital library metadata. Communications of the ACM, 51(2), 91-94.
- Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., & Lee, J.-H. (2009). On co-authorship for author disambiguation. Information Processing & Management, 45(1), 84-97.
- Karypis, G., Han, E.-H., & Kumar, V. (1999). Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8), 68-75.
- Kulkarni, S., Singh, A., Ramakrishnan, G., & Chakrabarti, S. (2009). Col- lective annotation of Wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 457-466). New York: ACM Press.
- Laender, A.H.F., Gonçalves, M.A., Cota, R.G., Ferreira, A.A., Santos, R.L.T., & Silva, A.J.C. (2008). Keeping a digital library clean: New solu- tions to old problems. In Proceedings of the Eighth ACM Symposium on Document Engineering (pp. 257-262). New York: ACM Press.
- Larkey, L.S., Ogilvie, P., Price, M.A., & Tamilio, B. (2000). Acrophile: An automated acronym extractor and server. In Proceedings of the Fifth ACM International Conference on Digital Libraries (pp. 205-214). New York: ACM Press.
- Lawrence, S., Giles, C.L., & Bollacker, K. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67-71.
- LEAF (2010). LEAF project consortium. Retrieved from http:// www.crxnet.com/leaf/
- Lee, D., Kang, J., Mitra, P., Giles, C.L., & On, B.-W. (2007). Are your citations clean? Communications of the ACM, 50(12), 33-38.
- Lee, D., On, B.-W., Kang, J., & Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceedings of the Second International Workshop on Information Quality in Information Systems (pp. 69-76). New York: ACM Press.
- MacEwan, A. (2004). Project interparty: From library authority files to e-commerce. Cataloging & Classification Quarterly, 39(1 & 2), 429-442.
- McCallum,A., Nigam, K., & Ungar, L.H. (2000). Efficient clustering of high- dimensional data sets with application to reference matching. In Proceed- ings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 169-178). New York: ACM Press.
- Mihalcea, R., & Csomai, A. (2007). Wikify!: Linking documents to ency- clopedic knowledge. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (pp. 233-242). NewYork:ACM Press.
- Milne, D., & Witten, I.H. (2008). Learning to link with Wikipedia. In Pro- ceedings of the 17th ACM Conference on Information and Knowledge Management (pp. 509-518). New York: ACM Press.
- On, B.-W., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based frame- work. In Proceedings of the Fifth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 344-353). New York: ACM Press.
- Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2002). Identity uncertainty and citation matching. Paper presented at Proceedings of the Advances in Neural Information Processing Systems (pp. 1401-1408). Cambridge, MA: MIT Press.
- Pereira, D.A., Ribeiro-Neto, B., Ziviani, N., & Laender,A.H.F. (2008). Using web information for creating publication venue authority files. In Proceed- ings of the Eighth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 295-304). New York: ACM Press.
- Pereira, D.A., Ribeiro-Neto, B., Ziviani, N., Laender, A.H.F., Gonçalves, M.A., & Ferreira, A.A. (2009). Using web information for author name disambiguation. In Proceedings of the Ninth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 49-58). New York: ACM Press.
- Sarawagi, S., & Bhamidipaty, A. (2002). Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 269-278). New York: ACM Press.
- sik Kim, H., & Lee, D. (2007). Parallel linkage. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (pp. 283-292). New York: ACM Press.
- Snyman, M.M.M., & van Rensburg, M.J. (2000). Revolutionizing name authority control. In Proceedings of the Fifth ACM International Conference on Digital Libraries (pp. 185-194), San Antonio, TX.
- Song, Y., Huang, J., Councill, I.G., Li, J., & Giles, C.L. (2007). Effi- cient topic-based unsupervised name disambiguation. In Proceedings of the Seventh ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 342-351). New York: ACM Press.
- Tan, Y.F., Kan, M.-Y., & Lee, D. (2006). Search engine driven author disam- biguation. In Proceedings of the Sixth ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 314-315). New York: ACM Press.
- Tejada, S., Knoblock, C.A., & Minton, S. (2001). Learning object iden- tification rules for information integration. Information Systems, 26(8), 607-633.
- Tillett, B.B. (2004). Authority control: State of the art and new perspectives. Cataloging and Classification Quarterly, 38(3-4), 23-41.
- Vapnik, V.N. (1995). The nature of statistical learning theory. NewYork, NY: Springer.
- VIAF (2010). VIAF: The virtual international authority file. Retrieved from http://viaf.org/
- Warner, J.W., & Brown, E.W. (2001). Automated name authority control. In Proceedings of the First ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 21-22). New York: ACM Press.
- Wick, M., Culotta, A., Rohanimanesh, K., & McCallum, A. (2009). An entity based model for coreference resolution. In Proceedings of the Ninth Society for Industrial and Applied Mathematics (SIAM) International Conference on Data Mining (pp. 365-376). Philadelphia, PA: Society for Industrial and Applied Mathematics.
- Xia, J. (2006). Personal name identification in the practice of digital repos- itories. Program: Electronic Library and Information Systems, 40(3), 256-267.