Cost-effective on-demand associative author name disambiguation

anderson ferreira

doi:10.1016/J.IPM.2011.08.005

Outline

Cost-effective on-demand associative author name disambiguation

anderson ferreira

2012

https://doi.org/10.1016/J.IPM.2011.08.005

visibility

…

description

18 pages

link

1 file

Abstract

Authorship disambiguation is an urgent issue that affects the quality of digital library services and for which supervised solutions have been proposed, delivering state-of-the-art effectiveness. However, particular challenges such as the prohibitive cost of labeling vast amounts of examples (there are many ambiguous authors), the huge hypothesis space (there are several features and authors from which many different disambiguation functions may be derived), and the skewed author popularity distribution (few authors are very prolific, while most appear in only few citations), may prevent the full potential of such techniques. In this article, we introduce an associative author name disambiguation approach that identifies authorship by extracting, from training examples, rules associating citation features (e.g., coauthor names, work title, publication venue) to specific authors. As our main contribution we propose three associative author name disambiguators: (1) EAND (Eager Associative Name Disambiguation), our basic method that explores association rules for name disambiguation; (2) LAND (Lazy Associative Name Disambiguation), that extracts rules on a demand-driven basis at disambiguation time, reducing the hypothesis space by focusing on examples that are most suitable for the task; and (3) SLAND (Self-Training LAND), that extends LAND with self-training capabilities, thus drastically reducing the amount of examples required for building effective disambiguation functions, besides being able to detect novel/unseen authors in the test set. Experiments demonstrate that all our disambigutators are effective and that, in particular, SLAND is able to outperform stateof-the-art supervised disambiguators, providing gains that range from 12% to more than 400%, being extremely effective and practical.

References (37)

Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on management of data (pp. 207-216). Washington, USA.
Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international conference on world wide web (pp. 463-470). Chiba, Japan.
Bhattacharya, I., & Getoor, L. (2006). A latent dirichlet model for unsupervised entity resolution. In Proceedings of the Sixth SIAM international conference on data mining. Bethesda, MD, USA.
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1.
Chang, C. -C., & Lin, C. -J. (2001). LibSVM: A library for support vector machines. Software available at <http://www.csie.ntu.edu.tw/$cjlin/libsvm>.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273-297.
Cota, R. G., Ferreira, A. A., Nascimento, C., Gonçalves, M. A., & Laender, A. H. F. (2010). An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. JASIST, 61, 1853-1870.
Cota, R. G., Gonçalves, M. A., & Laender, A. H. F. (2007). A heuristic-based hierarchical clustering method for author name disambiguation in digital libraries. In Proceedings of the XXII Brazilian symposium on databases (pp. 20-34). João Pessoa, Paraiba, Brazil.
Culotta, A., Kanani, P., Hall, R., Wick, M., & McCallum, A. (2007). Author disambiguation using error-driven machine learning with a ranking loss function. In International workshop on information integration on the web. Vancouver, Canada.
Dhillon, I. S., Guan, Y., & Kulis, B. (2005). A fast kernel-based multilevel algorithm for graph clustering. In Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 629-634). Chicago, Illinois, USA.
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29, 103-137.
Ester, M., Kriegel, H. -P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd international conference on knowledge discovery and data mining (pp. 226-231). Portland, Oregon.
Ferreira, A. A., Veloso, A., Gonçalves, M. A., & Laender, A. H. F. (2010). Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 2010 ACM/IEEE joint conference on digital libraries (pp. 39-48). Gold Coast, Queensland, Australia.
Goethals, B., & Zaki, M. (2004). Advances in frequent itemset mining implementations: report on FIMI'03. SIGKDD Explorations, 6, 109-117.
Han, H., Giles, C. L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS joint conference on digital libraries (pp. 296-305). Tuscon, USA.
Han, H., Xu, W., Zha, H., & Giles, C. L. (2005). A hierarchical naive Bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM symposium on applied computing (pp. 1065-1069). Santa Fe, New Mexico, USA.
Han, H., Zha, H., & Giles, C. L. (2005). Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the 5th ACM/IEEE joint conference on digital libraries (pp. 334-343). Denver, CO, USA.
Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (pp. 536-544). Berlin, Germany.
Kanani, P., McCallum, A., & Pal, C. (2007). Improving author coreference by resource-bounded information gathering from the web. In Proceedings of the 20th international joint conference on artificial intelligence (pp. 429-434). Hyderabad, India.
Kang, I.-S., Na, S.-H., Lee, S., Jung, H., Kim, P., Sung, W.-K., et al (2009). On co-authorship for author disambiguation. Information Processing & Management, 45, 84-97.
Lee, D., Kang, J., Mitra, P., Giles, C. L., & On, B.-W. (2007). Are your citations clean? Communications of the ACM, 50, 33-38.
Liming, L., & Lihua, L. (2005). Scientific publication activities of 32 countries. Scientometrics, 26, 263-273.
Malin, B. (2005). Unsupervised name disambiguation via social network similarity. In Proceedings of the workshop on link analysis, counterterrorism, and security, at the SIAM international conference on data mining (pp. 93-102). Newport Beach, CA.
Mitchell, T. M. (1997). Machine learning. New York, NY, USA: McGraw-Hill.
On, B. -W., Elmacioglu, E., Lee, D., Kang, J., & Pei, J. (2006). An effective approach to entity resolution problem using quasi-clique and its application to digital libraries. In Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (pp. 51-52). Chapel Hill, NC, USA.
On, B. -W., & Lee, D. (2007). Scalable name disambiguation using multi-level graph partition. In Proceedings of the 7th SIAM international conference on data mining (pp. 575-580). Minneapolis, Minnesota, USA.
On, B. -W., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the 5th ACM/IEEE joint conference on digital libraries (pp. 344-353). Denver, CO, USA.
Pereira, D. A., Ribeiro-Neto, B. A., Ziviani, N., Laender, A. H. F., Gonçalves, M. A., & Ferreira, A. A. (2009). Using web information for author name disambiguation. In Proceedings of the 2009 ACM/IEEE joint conference on digital libraries (pp. 49-58). Austin, TX, USA.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130-137.
Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE joint conference on digital libraries (pp. 342-351). Vancouver, BC, Canada.
Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56, 140-158.
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in medline. ACM Transactions on Knowledge Discovery from Data, 3, 1-29.
Treeratpituk, P., & Giles, C.L. (2009). Disambiguating authors in academic publications using random forests. In Proceedings of the 2009 ACM/IEEE joint conference on digital libraries (pp. 39-48). Austin, TX, USA.
Veloso, A., Meira Jr., W., Cristo, M., Gonçalves, M., & Zaki, M. (2006). Multi-evidence, multi-criteria, lazy associative document classification. In Proceedings of the 2006 ACM CIKM international conference on information and knowledge management (pp. 218-227). Arlington, USA.
Veloso, A., Meira Jr., W., & Zaki, M. J. (2006). Lazy associative classification. In Proceedings of the 6th IEEE international conference on data mining (pp. 645- 654). Hong Kong, China.
Veloso, A., Mosrri, H., Gonçalves, M., & Meira Jr., W. (2008). Learning to rank at query-time using association rules. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 267-274). Singapore.
Vu, Q. M., Masada, T., Takasu, A., & Adachi, J. (2007). Using a knowledge base to disambiguate personal name in web search results. In Proceedings of the 2007 ACM symposium on applied computing (pp. 839-843). Seoul, Korea.

This book deals with a hard problem that is inherent to human language: ambiguity. In particular, we focus on author name ambiguity, a type of ambiguity that exists in digital bibliographic repositories, which occurs when an author publishes works under distinct names or distinct authors publish works under similar names. This problem may be caused by a number of reasons, including the lack of standards and common practices, and the decentralized generation of bibliographic content. As a consequence, the quality of the main services of digital bibliographic repositories such as search, browsing, and recommendation may be severely affected by author name ambiguity. The focal point of the book is on automatic methods, since manual solutions do not scale to the size of the current repositories or the speed in which they are updated. Accordingly, we provide an ample view on the problem of automatic disambiguation of author names, summarizing the results of more than a decade of research on this topic conducted by our group, which were reported in more than a dozen publications that received over 900 citations so far, according to Google Scholar. We start by discussing its motivational issues (Chapter 1). Next, we formally define the author name disambiguation task (Chapter 2) and use this formalization to provide a brief, taxonomically organized, overview of the literature on the topic (Chapter 3). We then organize, summarize and integrate the efforts of our own group on developing solutions for the problem that have historically produced state-of-the-art (by the time of their proposals) results in terms of the quality of the disambiguation results. Thus, Chapter 4 covers HHC -Heuristic-based Clustering, an author name disambiguation method that is based on two specific real-world assumptions regarding scientific authorship. Then, Chapter 5 describes SAND -Self-training Author Name Disambiguator and Chapter 6 presents two incremental author name disambiguation methods, namely INDi -Incremental Unsupervised Name Disambiguation and INC-Incremental Nearest Cluster. Finally, Chapter 7 provides an overview of recent author name disambiguation methods that address new specific approaches such as graph-based representations, alternative predefined similarity functions, visualization facilities and approaches based on artificial neural networks. The chapters are followed by three appendices that cover, respectively: (i) a pattern matching function for comparing proper names and used by some of the methods addressed in this book; (ii) a tool for generating synthetic collections of citation records for distinct experimental tasks; and (iii) a number of datasets commonly used to evaluate author name disambiguation methods. In summary, the book organizes a large body of knowledge and work in the area of author name disambiguation in the last decade, hoping to consolidate a solid basis for future developments in the field.

Cost-effective on-demand associative author name disambiguation

Sign up for access to the world's latest research

Abstract

Related papers

References (37)

Related papers

Related topics

Cited by