Academia.eduAcademia.edu

Outline

Author name disambiguation forPubMed

2013, Journal of the Association for Information Science and Technology

https://doi.org/10.1002/ASI.23063

Abstract

Log analysis shows that PubMed users frequently use author names in queries for retrieving scientific literature. However, author name ambiguity may lead to irrelevant retrieval results. To improve the PubMed user experience with author name queries, we designed an author name disambiguation system consisting of similarity estimation and agglomerative clustering. A machine-learning method was employed to score the features for disambiguating a pair of papers with ambiguous names. These features enable the computation of pairwise similarity scores to estimate the probability of a pair of papers belonging to the same author, which drives an agglomerative clustering algorithm regulated by 2 factors: name compatibility and probability level. With transitivity violation correction, high precision author clustering is achieved by focusing on minimizing false-positive pairing. Disambiguation performance is evaluated with manual verification of random samples of pairs from clustering results. When compared with a state-of-the-art system, our evaluation shows that among all the pairs the lumping error rate drops from 10.1% to 2.2% for our system, while the splitting error rises from 1.8% to 7.7%. This results in an overall error rate of 9.9%, compared with 11.9% for the state-of-the-art method. Other evaluations based on gold standard data also show the increase in accuracy of our clustering. We attribute the performance improvement to the machine-learning method driven by a large-scale training set and the clustering algorithm regulated by a name compatibility scheme preferring precision. With integration of the author name disambiguation system into the PubMed search engine, the overall click-through-rate of PubMed users on author name query results improved from 34.9% to 36.9%.

References (41)

  1. Bernardi, R., Le, D-T. Proceedings of the 2009 International Conference on Advanced Language Technologies for Digital Libraries. Viareggio, Italy: Springer; 2011. Metadata enrichment via topic models for author name disambiguation; p. 92-113.
  2. Bhattacharya I, Getoor L. Collective entity resolution in relational data. ACM Trans Knowl Discov Data. 2007; 1(1):5.
  3. Cota RG, Ferreira AA, Nascimento C, Gonçalves MA, Laender AHF. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology. 2010; 61(9):1853-1870.
  4. Culotta, A., Kanani, P., Hall, R., Wick, M., McCallum, A. Author disambiguation using error-driven machine learning with a ranking loss function; Proceedings of the AAAI 6th International Workshop on Information Integration on the Web; 2007.
  5. Elliot S. Survey of author name disambiguation: 2004 to 2010. Library Philosophy and Practice. 2010
  6. Fan X, Wang J, Pu X, Zhou L, Lv B. On graph-based name disambiguation. Journal of Data and Information Quality. 2011; 2(2):1-23.
  7. Ferreira AA, Gonçalves MA, Laender AHF. A brief survey of automatic methods for author name disambiguation. SIGMOD Record. 2012; 41(2):15-26.
  8. Ferreira, AA., Veloso, A., Gonçalves, MA., Laender, AHF. Proceedings of the 10th Annual Joint Conference on Digital Libraries. Gold Coast, Queensland, Australia: ACM; 2010. Effective self- training author name disambiguation in scholarly digital libraries; p. 39-48.
  9. Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K. Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries. Tuscon, AZ: ACM; 2004. Two supervised learning approaches for name disambiguation in author citations; p. 296-305.
  10. Han, H., Zha, H., Giles, CL. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. Denver, CO: ACM; 2005. Name disambiguation in author citations using a K-way spectral clustering method; p. 334-343.
  11. Huang J, Ertekin S, Giles C. Efficient name disambiguation for large-scale databases. Knowledge Discovery in Databases; PKDD 2006. 2006:536-544.
  12. Islamaj Dogan R, Murray GC, Névéol A, Lu Z. Understanding PubMed® user search behavior through log analysis. Database. 2009
  13. Kanani, P., McCallum, A., Pal, C. Proceedings of the 20th International Joint Conference on Artificial Intelligence. Hyderabad, India: Morgan Kaufmann; 2007. Improving author coreference by resource-bounded information gathering from the web; p. 429-434.
  14. Kang I-S, Na S-H, Lee S, Jung H, Kim P, Sung W-K, Lee J-H. On co-authorship for author disambiguation. Information Processing and Management. 2009; 45(1):84-97.
  15. Levin FH, Heuser CA. Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management. 2010; 1(2):183-197.
  16. Levin M, Krawczyk S, Bethard S, Jurafsky D. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology. 2012; 63(5):1030-1047.
  17. Lin J, Wilbur WJ. PubMed related articles: A probabilistic topic-based model for content similarity. BMC Bioinformatics. 2007; 8(1):423. [PubMed: 17971238]
  18. Mann, GS., Yarowsky, D. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. Vol. 4. Edmonton, Canada: Association for Computational Linguistics; 2003. Unsupervised personal name disambiguation; p. 33-40.
  19. McRae-Spencer, DM., Shadbolt, NR. Also by the same author: AKTiveAuthor, a citation graph approach to name disambiguation; Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, 2006; 2006.
  20. Mendenhall W, Wackerly DD, Scheaffer RL. Nonparametric statistics. Mathematical statistics with applications. PWS-Kent. 1989:674-679.
  21. Neuhäuser, M. Nonparametric statistical tests: A computational approach. London: Taylor & Francis; 2012.
  22. On, B-W., Elmacioglu, E., Lee, D., Kang, J., Pei, J. Improving grouped-entity resolution using quasi- cliques; Proceedings of the Sixth International Conference on Data Mining, IEEE Computer Society; 2006. p. 1008-1015.
  23. On, B-W., Lee, D., Kang, J., Mitra, P. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. Denver, CO: ACM; 2005. Comparative study of name disambiguation problem using a scalable blocking-based framework; p. 344-353.
  24. Pereira, DA., Ribeiro-Neto, B., Ziviani, N., Laender, AHF., Gonçalves, MA., Ferreira, AA. Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries. Austin, TX: ACM; 2009. Using web information for author name disambiguation; p. 49-58.
  25. Shu, L., Long, B., Meng, W. A latent topic model for complete entity resolution; Proceedings of the 2009 IEEE International Conference on Data Engineering, IEEE Computer Society; 2009. p. 880-891.
  26. Smalheiser NR, Torvik VI. Author name disambiguation. Annual Review of Information Science and Technology. 2009; 43(1):1-43.
  27. Smith, LH., Kim, WG., Wilbur, WJ. AAAI 2012 Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text. Arlington, VA: AAAI; 2012. PROBE: periodic random orbiter algorithm for machine learning.
  28. Soler J. Separating the articles of authors with the same name. Scientometrics. 2007; 72(2):281-290.
  29. Song, Y., Huang, J., Councill, IG., Li, J., Giles, CL. Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. Vancouver, BC, Canada: ACM; 2007. Efficient topic-based unsupervised name disambiguation; p. 342-351.
  30. Spärck Jones K. Index term weighting. Information Storage and Retrieval. 1973; 9(11):619-633.
  31. Tang J. Aunified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering. 2012; 24(6):975-987.
  32. Torvik VI, Smalheiser NR. Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery Data. 2009; 3(3):1-29.
  33. Torvik VI, Weeber M, Swanson DR, Smalheiser NR. A probabilistic similarity metric for Medline records: A model for author name disambiguation: Research articles. Journal of the American Society for Information Science and Technology. 2005; 56(2):140-158.
  34. Treeratpituk, P., Giles, CL. Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries. Austin, TX: ACM; 2009. Disambiguating authors in academic publications using random forests; p. 39-48.
  35. Varol C, Stafford B, Kovvuri K, Chitti R. Author disambiguation using an hybrid approach of queries and string matching techniques. International Journal of Intelligent Information Processing. 2010; 1(1):3-11.
  36. Wang J, Berzins K, Hicks D, Melkers J, Xiao F, Pinheiro D. A boosted-trees method for name disambiguation. Scientometrics. 2012:1-21.
  37. Wilbur W, Yeganova L, Won K. The synergy between PAV and AdaBoost. Machine Learning. 2005; 61(1):71-103.
  38. Yang, K-H., Peng, H-T., Jiang, J-Y., Lee, H-M., Ho, J-M. Proceedings of the 12th European Conference on Research and Advanced Technology for Digital Libraries. Aarhus, Denmark: Springer; 2008. Author name disambiguation for citations using topic and web correlation; p. 185-196.
  39. Yin, X., Han, J., Yu, PS. Object distinction: Distinguishing objects with identical names by link analysis; IEEE 23rd International Conference on Data Engineering; 2007.
  40. Zadrozny, B., Elkan, C. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada: ACM; 2002. Transforming classifier scores into accurate multiclass probability estimates; p. 694-699.
  41. Zhang, T. Proceedings of the Twenty-first International Conference on Machine Learning. Banff, Alberta, Canada: ACM; 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms; p. 116