Author name disambiguation forPubMed

Lana Yeganova

doi:10.1002/ASI.23063

Outline

Author name disambiguation forPubMed

Lana Yeganova

2013, Journal of the Association for Information Science and Technology

https://doi.org/10.1002/ASI.23063

visibility

…

description

37 pages

link

1 file

Abstract

Log analysis shows that PubMed users frequently use author names in queries for retrieving scientific literature. However, author name ambiguity may lead to irrelevant retrieval results. To improve the PubMed user experience with author name queries, we designed an author name disambiguation system consisting of similarity estimation and agglomerative clustering. A machine-learning method was employed to score the features for disambiguating a pair of papers with ambiguous names. These features enable the computation of pairwise similarity scores to estimate the probability of a pair of papers belonging to the same author, which drives an agglomerative clustering algorithm regulated by 2 factors: name compatibility and probability level. With transitivity violation correction, high precision author clustering is achieved by focusing on minimizing false-positive pairing. Disambiguation performance is evaluated with manual verification of random samples of pairs from clustering results. When compared with a state-of-the-art system, our evaluation shows that among all the pairs the lumping error rate drops from 10.1% to 2.2% for our system, while the splitting error rises from 1.8% to 7.7%. This results in an overall error rate of 9.9%, compared with 11.9% for the state-of-the-art method. Other evaluations based on gold standard data also show the increase in accuracy of our clustering. We attribute the performance improvement to the machine-learning method driven by a large-scale training set and the clustering algorithm regulated by a name compatibility scheme preferring precision. With integration of the author name disambiguation system into the PubMed search engine, the overall click-through-rate of PubMed users on author name query results improved from 34.9% to 36.9%.

References (41)

Bernardi, R., Le, D-T. Proceedings of the 2009 International Conference on Advanced Language Technologies for Digital Libraries. Viareggio, Italy: Springer; 2011. Metadata enrichment via topic models for author name disambiguation; p. 92-113.
Bhattacharya I, Getoor L. Collective entity resolution in relational data. ACM Trans Knowl Discov Data. 2007; 1(1):5.
Cota RG, Ferreira AA, Nascimento C, Gonçalves MA, Laender AHF. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology. 2010; 61(9):1853-1870.
Culotta, A., Kanani, P., Hall, R., Wick, M., McCallum, A. Author disambiguation using error-driven machine learning with a ranking loss function; Proceedings of the AAAI 6th International Workshop on Information Integration on the Web; 2007.
Elliot S. Survey of author name disambiguation: 2004 to 2010. Library Philosophy and Practice. 2010
Fan X, Wang J, Pu X, Zhou L, Lv B. On graph-based name disambiguation. Journal of Data and Information Quality. 2011; 2(2):1-23.
Ferreira AA, Gonçalves MA, Laender AHF. A brief survey of automatic methods for author name disambiguation. SIGMOD Record. 2012; 41(2):15-26.
Ferreira, AA., Veloso, A., Gonçalves, MA., Laender, AHF. Proceedings of the 10th Annual Joint Conference on Digital Libraries. Gold Coast, Queensland, Australia: ACM; 2010. Effective self- training author name disambiguation in scholarly digital libraries; p. 39-48.
Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. Tuscon, AZ: ACM; 2004. Two supervised learning approaches for name disambiguation in author citations; p. 296-305.
Han, H., Zha, H., Giles, CL. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. Denver, CO: ACM; 2005. Name disambiguation in author citations using a K-way spectral clustering method; p. 334-343.
Huang J, Ertekin S, Giles C. Efficient name disambiguation for large-scale databases. Knowledge Discovery in Databases; PKDD 2006. 2006:536-544.
Islamaj Dogan R, Murray GC, Névéol A, Lu Z. Understanding PubMed® user search behavior through log analysis. Database. 2009
Kanani, P., McCallum, A., Pal, C. Proceedings of the 20th International Joint Conference on Artificial Intelligence. Hyderabad, India: Morgan Kaufmann; 2007. Improving author coreference by resource-bounded information gathering from the web; p. 429-434.
Kang I-S, Na S-H, Lee S, Jung H, Kim P, Sung W-K, Lee J-H. On co-authorship for author disambiguation. Information Processing and Management. 2009; 45(1):84-97.
Levin FH, Heuser CA. Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management. 2010; 1(2):183-197.
Levin M, Krawczyk S, Bethard S, Jurafsky D. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology. 2012; 63(5):1030-1047.
Lin J, Wilbur WJ. PubMed related articles: A probabilistic topic-based model for content similarity. BMC Bioinformatics. 2007; 8(1):423. [PubMed: 17971238]
Mann, GS., Yarowsky, D. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. Vol. 4. Edmonton, Canada: Association for Computational Linguistics; 2003. Unsupervised personal name disambiguation; p. 33-40.
McRae-Spencer, DM., Shadbolt, NR. Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation; Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, 2006; 2006.
Mendenhall W, Wackerly DD, Scheaffer RL. Nonparametric statistics. Mathematical statistics with applications. PWS-Kent. 1989:674-679.
Neuhäuser, M. Nonparametric statistical tests: A computational approach. London: Taylor & Francis; 2012.
On, B-W., Elmacioglu, E., Lee, D., Kang, J., Pei, J. Improving grouped-entity resolution using quasi- cliques; Proceedings of the Sixth International Conference on Data Mining, IEEE Computer Society; 2006. p. 1008-1015.
On, B-W., Lee, D., Kang, J., Mitra, P. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. Denver, CO: ACM; 2005. Comparative study of name disambiguation problem using a scalable blocking-based framework; p. 344-353.
Pereira, DA., Ribeiro-Neto, B., Ziviani, N., Laender, AHF., Gonçalves, MA., Ferreira, AA. Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries. Austin, TX: ACM; 2009. Using web information for author name disambiguation; p. 49-58.
Shu, L., Long, B., Meng, W. A latent topic model for complete entity resolution; Proceedings of the 2009 IEEE International Conference on Data Engineering, IEEE Computer Society; 2009. p. 880-891.
Smalheiser NR, Torvik VI. Author name disambiguation. Annual Review of Information Science and Technology. 2009; 43(1):1-43.
Smith, LH., Kim, WG., Wilbur, WJ. AAAI 2012 Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text. Arlington, VA: AAAI; 2012. PROBE: periodic random orbiter algorithm for machine learning.
Soler J. Separating the articles of authors with the same name. Scientometrics. 2007; 72(2):281-290.
Song, Y., Huang, J., Councill, IG., Li, J., Giles, CL. Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. Vancouver, BC, Canada: ACM; 2007. Efficient topic-based unsupervised name disambiguation; p. 342-351.
Spärck Jones K. Index term weighting. Information Storage and Retrieval. 1973; 9(11):619-633.
Tang J. Aunified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering. 2012; 24(6):975-987.
Torvik VI, Smalheiser NR. Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery Data. 2009; 3(3):1-29.
Torvik VI, Weeber M, Swanson DR, Smalheiser NR. A probabilistic similarity metric for Medline records: A model for author name disambiguation: Research articles. Journal of the American Society for Information Science and Technology. 2005; 56(2):140-158.
Treeratpituk, P., Giles, CL. Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries. Austin, TX: ACM; 2009. Disambiguating authors in academic publications using random forests; p. 39-48.
Varol C, Stafford B, Kovvuri K, Chitti R. Author disambiguation using an hybrid approach of queries and string matching techniques. International Journal of Intelligent Information Processing. 2010; 1(1):3-11.
Wang J, Berzins K, Hicks D, Melkers J, Xiao F, Pinheiro D. A boosted-trees method for name disambiguation. Scientometrics. 2012:1-21.
Wilbur W, Yeganova L, Won K. The synergy between PAV and AdaBoost. Machine Learning. 2005; 61(1):71-103.
Yang, K-H., Peng, H-T., Jiang, J-Y., Lee, H-M., Ho, J-M. Proceedings of the 12th European Conference on Research and Advanced Technology for Digital Libraries. Aarhus, Denmark: Springer; 2008. Author name disambiguation for citations using topic and web correlation; p. 185-196.
Yin, X., Han, J., Yu, PS. Object distinction: Distinguishing objects with identical names by link analysis; IEEE 23rd International Conference on Data Engineering; 2007.
Zadrozny, B., Elkan, C. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada: ACM; 2002. Transforming classifier scores into accurate multiclass probability estimates; p. 694-699.
Zhang, T. Proceedings of the Twenty-first International Conference on Machine Learning. Banff, Alberta, Canada: ACM; 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms; p. 116

This book deals with a hard problem that is inherent to human language: ambiguity. In particular, we focus on author name ambiguity, a type of ambiguity that exists in digital bibliographic repositories, which occurs when an author publishes works under distinct names or distinct authors publish works under similar names. This problem may be caused by a number of reasons, including the lack of standards and common practices, and the decentralized generation of bibliographic content. As a consequence, the quality of the main services of digital bibliographic repositories such as search, browsing, and recommendation may be severely affected by author name ambiguity. The focal point of the book is on automatic methods, since manual solutions do not scale to the size of the current repositories or the speed in which they are updated. Accordingly, we provide an ample view on the problem of automatic disambiguation of author names, summarizing the results of more than a decade of research on this topic conducted by our group, which were reported in more than a dozen publications that received over 900 citations so far, according to Google Scholar. We start by discussing its motivational issues (Chapter 1). Next, we formally define the author name disambiguation task (Chapter 2) and use this formalization to provide a brief, taxonomically organized, overview of the literature on the topic (Chapter 3). We then organize, summarize and integrate the efforts of our own group on developing solutions for the problem that have historically produced state-of-the-art (by the time of their proposals) results in terms of the quality of the disambiguation results. Thus, Chapter 4 covers HHC -Heuristic-based Clustering, an author name disambiguation method that is based on two specific real-world assumptions regarding scientific authorship. Then, Chapter 5 describes SAND -Self-training Author Name Disambiguator and Chapter 6 presents two incremental author name disambiguation methods, namely INDi -Incremental Unsupervised Name Disambiguation and INC-Incremental Nearest Cluster. Finally, Chapter 7 provides an overview of recent author name disambiguation methods that address new specific approaches such as graph-based representations, alternative predefined similarity functions, visualization facilities and approaches based on artificial neural networks. The chapters are followed by three appendices that cover, respectively: (i) a pattern matching function for comparing proper names and used by some of the methods addressed in this book; (ii) a tool for generating synthetic collections of citation records for distinct experimental tasks; and (iii) a number of datasets commonly used to evaluate author name disambiguation methods. In summary, the book organizes a large body of knowledge and work in the area of author name disambiguation in the last decade, hoping to consolidate a solid basis for future developments in the field.

Author name disambiguation forPubMed

Sign up for access to the world's latest research

Abstract

Related papers

References (41)

Related papers

Related topics

Cited by