Search engine driven author disambiguation

Yee Fan Tan; Min Yen Kan; Dongwon Lee

doi:10.1145/1141753.1141826

Outline

Title

Abstract

Introduction

Discussion and Conclusion

Search engine driven author disambiguation

Min-Yen Kan

2006, Proceedings of the 6th ACM/IEEE-CS joint …

https://doi.org/10.1145/1141753.1141826

visibility

…

description

2 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

In scholarly digital libraries, author disambiguation is an important task that attributes a scholarly work with specific authors. This is critical when individuals share the same name. We present an approach to this task that analyzes the results of automatically-crafted web searches. A key observation is that pages from rare web sites are stronger source of evidence than pages from common web sites, which we model as Inverse Host Frequency (IHF). Our system is able to achieve an average accuracy of 0.836.

Seok-Hyoung Lee

International Journal of Software Engineering and Its Applications, 2016

When using search engine services to search for scholarly articles, obtaining quick and accurate search results from a huge set of scholarly information is always important. However, most of the domestic and foreign search engine services for scholarly articles present a broad range of the results that correspond to the query of the researcher's name. Such results contribute in lowering the search precision and require users to spend time and effort to verify the results and find the necessary information. Such a problem is called "author ambiguity", while solving this problem is called "author disambiguation." An author disambiguation method classifies the authors with the same name into an actual person. By resolving author ambiguity, better search results can be obtained; this increases the recall rate and accuracy when searching for scholarly articles. In order to resolve author ambiguity in this paper, we shall expand the co-author network and identify the author using the co-author network information and basic bibliographic information as the features for machine learning Support Vector Machine. To examine the effectiveness of the proposed method, we test the author disambiguation method by targeting 92,100 IT-related scholarly data generated in Korea. Author disambiguation results through the expansion of co-author network are shown to have an F-1 measure of 94.79%. The result confirms that the author disambiguation method through the implementation of the co-author network is effective.

downloadDownload free PDF View PDFchevron_right

A brief survey of automatic methods for author name disambiguation

Marcos Goncalves

ACM SIGMOD Record, 2012

Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. The challenges of dealing with author name ambiguity have led to a myriad of disambiguation methods. Generally speaking, the proposed methods usually attempt to group citation records of a same author by finding some similarity among them or try to directly assign them to their respective authors. Both approaches may either exploit supervised or unsupervised techniques. In this article, we propose a taxonomy for characterizing the current author name disambiguation methods described in the literature, present a brief survey of the most representative ones and discuss several open challenges.

downloadDownload free PDF View PDFchevron_right

Towards a New Paradigm for Author Name Disambiguation

Tehmina Amjad

IEEE Access

Author Name Disambiguation (AND) has emerged as a significant challenge in the bibliometric context with the growing volume of scientific literature. When citations written by different authors have the same names (polysemy or homonym names), and when an author has different names, there is ambiguity (synonyms or name variants). It is difficult to associate a citation with the correct author. Polysemy and synonyms cause merging and splitting anomalies in the citations. These anomalies affect the quantification of an author's productivity (bibliometric analysis) and the reliability and quality of the information retrieved. Many techniques for AND have been proposed in the literature; most of them do not go beyond string matching or text matching. Most do not consider the context or semantics of the terms used in the citations. The AND problem is resolved semantically in this paper using the deep learning technique on the PubMed dataset. The experimental results show that the proposed method achieves overall (11.72%, 12.5%, and 12.1%) higher precision, recall, and f-measure than the pairwise class classification.

downloadDownload free PDF View PDFchevron_right

Disambiguation of People in Web Search Using a Knowledge Base

Tomonari Masada

Research, Innovation and …, 2007

Results of queries by personal names often contain documents related to several people because of the namesake problem. In order to differentiate documents related to different people, an effective method is needed to measure document similarities and to find documents related to the same person. Some previous researchers have used the vector space model or have tried to extract common named entities for measuring similarities. We propose a new method that uses Web directories as a knowledge base to find shared contexts in document pairs and uses the measurement of shared contexts to determine similarities between document pairs. Experimental results show that our proposed method outperforms the vector space model method and the named entity recognition method.

downloadDownload free PDF View PDFchevron_right

Author Name Disambiguation in Bibliographic Databases: A Survey

Tehmina Amjad

2020

Entity resolution is a challenging and hot research area in the field of Information Systems since last decade. Author Name Disambiguation (AND) in Bibliographic Databases (BD) like DBLP , Citeseer , and Scopus is a specialized field of entity resolution. Given many citations of underlying authors, the AND task is to find which citations belong to the same author. In this survey, we start with three basic AND problems, followed by need for solution and challenges. A generic, five-step framework is provided for handling AND issues. These steps are; (1) Preparation of dataset (2) Selection of publication attributes (3) Selection of similarity metrics (4) Selection of models and (5) Clustering Performance evaluation. Categorization and elaboration of similarity metrics and methods are also provided. Finally, future directions and recommendations are given for this dynamic area of research.

downloadDownload free PDF View PDFchevron_right

Improving Author Name Disambiguation with User Relevance Feedback

Marcos Goncalves

2012

Author name ambiguity in the context of bibliographic citations is a very hard problem. It occurs when there are citation records of a same author under distinct names or when there exists citation records belonging to distinct authors with very similar names. Among the several methods proposed in the literature, the most effective ones are those that perform a direct assignment of the records to their respective authors by means of the application of supervised machine learning techniques. However, those methods usually need large amounts of labeled training examples to properly disambiguate the author names. To deal with this issue, in previous work, we have proposed a method that automatically obtains and labels the training examples, showing competitive performance compared to representative author name disambiguation methods. In this work, we propose to improve our previous method by exploiting user relevance feedback. In more details we select a very small portion of the citation records for which our method was mostly unsure about the correct authorship and ask the administrators for labeling them. This feedback is then used to improve the effectiveness of the whole process. In our experimental evaluation, we observed that with a very small labeling effort (usually around 5% of the records), the overall disambiguation effectiveness improves by almost 10% on average, with gains of up to 61% in some of the largest ambiguous groups.

downloadDownload free PDF View PDFchevron_right

Exploring Graph Based Approaches for Author Name Disambiguation

Chetanya Rastogi

2019

In many applications, such as scientific literature management, researcher search, social network analysis and etc, Name Disambiguation (aiming at disambiguating WhoIsWho) has been a challenging problem. In addition, the growth of scientific literature makes the problem more difficult and urgent. Although name disambiguation has been extensively studied in academia and industry, the problem has not been solved well due to the clutter of data and the complexity of the same name scenario. In this work, we aim to explore models that can perform the task of name disambiguation using the network structure that is intrinsic to the problem and present an analysis of the models.

downloadDownload free PDF View PDFchevron_right

Who's who in the world wide web: Approaches to name disambiguation

Vanessa Klaas

INSTITUT FUR INFORMATIK. der Ludwig-Maximilians- …, 2007

downloadDownload free PDF View PDFchevron_right

Whois? Deep Author Name Disambiguation Using Bibliographic Data

Nagaraj Asundi

Lecture Notes in Computer Science, 2022

As the number of authors is increasing exponentially over years, the number of authors sharing the same names is increasing proportionally. This makes it challenging to assign newly published papers to their adequate authors. Therefore, Author Name Ambiguity (ANA) is considered a critical open problem in digital libraries. This paper proposes an Author Name Disambiguation (AND) approach that links author names to their real-world entities by leveraging their co-authors and domain of research. To this end, we use a collection from the DBLP repository that contains more than 5 million bibliographic records authored by around 2.6 million co-authors. Our approach first groups authors who share the same last names and same first name initials. The author within each group is identified by capturing the relation with his/her co-authors and area of research, which is represented by the titles of the validated publications of the corresponding author. To this end, we train a neural network model that learns from the representations of the co-authors and titles. We validated the effectiveness of our approach by conducting extensive experiments on a large dataset.

downloadDownload free PDF View PDFchevron_right

Detecting Ambiguous Author Names in Crowdsourced Scholarly Data

Jasleen Kaur

2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing, 2011

The name ambiguity problem is a challenge in many areas, especially in the field of bibliographic digital libraries. For example, in services that use citation data to compute the impact of authors, ambiguous names lead to biased measures. The problem is amplified where names are collected from heterogeneous sources, including crowdsourced annotations. This is the case in the Scholarometer system, which cross-correlates author names in user queries with those retrieved from bibliographic data. The uncontrolled nature of user-generated annotations is very valuable, but creates the need to detect ambiguous names. In this paper, we propose an approach to detect ambiguous names at query time, which makes it applicable in the context of a social computing application. We explore two kinds of heuristic features based on citations and crowdsourced topics. Our approach can detect ambiguous author names in crowdsourced scholarly data with an accuracy of 75%.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (5)

REFERENCES
C. L. Giles, K. D. Bollacker, and S. Lawrence. CiteSeer: An automatic citation indexing system. In ACM Conf. on Digital Libraries, 1998.
H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a K-way spectral clustering method. In JCDL, 2005.
D. Lee, B.-W. On, J. Kang, and S. Park. Effective and scalable solutions for mixed and split citation problems in digital libraries. In IQIS, 2005.
M. Ley. The DBLP computer science bibliography: Evolution, research issues, perspectives. In SPIRE, 2002.

Jan-Ming Ho

2008

Today, bibliographic digital libraries play an important role in helping members of academic community search for novel research. In particular, author disambiguation for citations is a major problem during the data integration and cleaning process, since author names are usually very ambiguous. For solving this problem, we proposed two kinds of correlations between citations, namely, Topic Correlation and Web Correlation, to exploit relationships between citations, in order to identify whether two citations with the same author name refer to the same individual.The topic correlation measures the similarity between research topics of two citations; while the Web correlation measures the number of co-occurrence in web pages. We employ a pair-wise grouping algorithm to group citations into clusters. The results of experiments show that the disambiguation accuracy has great improvement when using topic correlation and Web correlation, and Web correlation provides stronger evidences about the authors of citations.

downloadDownload free PDF View PDFchevron_right

S2AND: A Benchmark and Evaluation System for Author Name Disambiguation

rangga restu prayogo

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2021

Author Name Disambiguation (AND) is the task of resolving which author mentions in a bibliographic database refer to the same real-world person, and is a critical ingredient of digital library applications such as search and citation analysis. While many AND algorithms have been proposed, comparing them is difficult because they often employ distinct features and are evaluated on different datasets. In response to this challenge, we present S2AND, a unified benchmark dataset for AND on scholarly papers, as well as an open-source reference model implementation. Our dataset harmonizes eight disparate AND datasets into a uniform format, with a single rich feature set drawn from the Semantic Scholar (S2) database. Our evaluation suite for S2AND reports performance split by facets like publication year and number of papers, allowing researchers to track both global performance and measures of fairness across facet values. Our experiments show that because previous datasets tend to cover idiosyncratic and biased slices of the literature, algorithms trained to perform well on one on them may generalize poorly to others. By contrast, we show how training on a union of datasets in S2AND results in more robust models that perform well even on datasets unseen in training. The resulting AND model also substantially improves over the production algorithm in S2, reducing error by over 50% in terms of B 3 F1. We release our unified dataset, model code, trained models, and evaluation suite to the research community. 1 Index Terms-Digital libraries, Author name disambiguation, Out-of-domain evaluation.

downloadDownload free PDF View PDFchevron_right

Effective self-training author name disambiguation in scholarly digital libraries

Adriano Veloso

2010

Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. Thus, addressing the issues of (i) automatic acquisition of examples and (ii) highly effective disambiguation even when only few examples are available, are the need of the hour for such systems. In this paper, we propose a novel two-step disambiguation method, SAND (Self-training Associative Name Disambiguator), that deals with these two issues. The first step eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. The second step uses a supervised disambiguation method that is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example.

downloadDownload free PDF View PDFchevron_right

Using web information for author name disambiguation

Denilson A Pereira

ACM/IEEE Joint Conference on Digital Libraries, 2009

In digital libraries, ambiguous author names may occur due to the existence of multiple authors with the same name (polysemes) or different name variations for the same author (synonyms). We proposed here a new method that uses information available on the Web to deal with both problems at the same time. Our idea consists of gathering information from input citations

downloadDownload free PDF View PDFchevron_right

Automatic Disambiguation of Author Names in Bibliographic Repositories

Alberto Laender

Synthesis lectures on information concepts, retrieval, and services, 2020

This book deals with a hard problem that is inherent to human language: ambiguity. In particular, we focus on author name ambiguity, a type of ambiguity that exists in digital bibliographic repositories, which occurs when an author publishes works under distinct names or distinct authors publish works under similar names. This problem may be caused by a number of reasons, including the lack of standards and common practices, and the decentralized generation of bibliographic content. As a consequence, the quality of the main services of digital bibliographic repositories such as search, browsing, and recommendation may be severely affected by author name ambiguity. The focal point of the book is on automatic methods, since manual solutions do not scale to the size of the current repositories or the speed in which they are updated. Accordingly, we provide an ample view on the problem of automatic disambiguation of author names, summarizing the results of more than a decade of research on this topic conducted by our group, which were reported in more than a dozen publications that received over 900 citations so far, according to Google Scholar. We start by discussing its motivational issues (Chapter 1). Next, we formally define the author name disambiguation task (Chapter 2) and use this formalization to provide a brief, taxonomically organized, overview of the literature on the topic (Chapter 3). We then organize, summarize and integrate the efforts of our own group on developing solutions for the problem that have historically produced state-of-the-art (by the time of their proposals) results in terms of the quality of the disambiguation results. Thus, Chapter 4 covers HHC -Heuristic-based Clustering, an author name disambiguation method that is based on two specific real-world assumptions regarding scientific authorship. Then, Chapter 5 describes SAND -Self-training Author Name Disambiguator and Chapter 6 presents two incremental author name disambiguation methods, namely INDi -Incremental Unsupervised Name Disambiguation and INC-Incremental Nearest Cluster. Finally, Chapter 7 provides an overview of recent author name disambiguation methods that address new specific approaches such as graph-based representations, alternative predefined similarity functions, visualization facilities and approaches based on artificial neural networks. The chapters are followed by three appendices that cover, respectively: (i) a pattern matching function for comparing proper names and used by some of the methods addressed in this book; (ii) a tool for generating synthetic collections of citation records for distinct experimental tasks; and (iii) a number of datasets commonly used to evaluate author name disambiguation methods. In summary, the book organizes a large body of knowledge and work in the area of author name disambiguation in the last decade, hoping to consolidate a solid basis for future developments in the field.

downloadDownload free PDF View PDFchevron_right

A Multi-match Approach to the Author Uncertainty Problem

Alan Porter

Journal of Data and Information Science

Purpose The ability to identify the scholarship of individual authors is essential for performance evaluation. A number of factors hinder this endeavor. Common and similarly spelled surnames make it difficult to isolate the scholarship of individual authors indexed on large databases. Variations in name spelling of individual scholars further complicates matters. Common family names in scientific powerhouses like China make it problematic to distinguish between authors possessing ubiquitous and/or anglicized surnames (as well as the same or similar first names). The assignment of unique author identifiers provides a major step toward resolving these difficulties. We maintain, however, that in and of themselves, author identifiers are not sufficient to fully address the author uncertainty problem. In this study we build on the author identifier approach by considering commonalities in fielded data between authors containing the same surname and first initial of their first name. We i...

downloadDownload free PDF View PDFchevron_right

Author disambiguation using multi-aspect similarity indicators

Edwin Horlings

Scientometrics

Key to accurate bibliometric analyses is the ability to correctly link individuals to their corpus of work, with an optimal balance between precision and recall. We have developed an algorithm that does this disambiguation task with a very high recall and precision. The method addresses the issues of discarded records due to null data fields and their resultant effect on recall, precision and F-measure results. We have implemented a dynamic approach to similarity calculations based on all available data fields. We have also included differences in author contribution and age difference between publications, both of which have meaningful effects on overall similarity measurements, resulting in significantly higher recall and precision of returned records. The results are presented from a test dataset of heterogeneous catalysis publications. Results demonstrate significantly high average F-measure scores and substantial improvements on previous and stand-alone techniques.

downloadDownload free PDF View PDFchevron_right

A tool for generating synthetic authorship records for evaluating author name disambiguation methods

Alberto Laender, Jussara Almeida

Information Sciences, 2012

The author name disambiguation task has to deal with uncertainties related to the possible many-to-many correspondences between ambiguous names and unique authors. Despite the variety of name disambiguation methods available in the literature to solve the problem, most of them are rarely compared against each other. Moreover, they are often evaluated without considering a time evolving digital library, susceptible to dynamic (and therefore challenging) patterns such as the introduction of new authors and the change of researchers' interests over time. In order to facilitate the evaluation of name disambiguation methods in various realistic scenarios and under controlled conditions, in this article we propose SyGAR, a new Synthetic Generator of Authorship Records that generates citation records based on author profiles. SyGAR can be used to generate successive loads of citation records simulating a living digital library that evolves according to various publication patterns. We validate SyGAR by comparing the results produced by three representative name disambiguation methods on real as well as synthetically generated collections of citation records. We also demonstrate its applicability by evaluating those methods on a time evolving digital library collection generated with the tool, considering several dynamic and realistic scenarios.

downloadDownload free PDF View PDFchevron_right

Author name disambiguation forPubMed

Lana Yeganova

Journal of the Association for Information Science and Technology, 2013

Log analysis shows that PubMed users frequently use author names in queries for retrieving scientific literature. However, author name ambiguity may lead to irrelevant retrieval results. To improve the PubMed user experience with author name queries, we designed an author name disambiguation system consisting of similarity estimation and agglomerative clustering. A machine-learning method was employed to score the features for disambiguating a pair of papers with ambiguous names. These features enable the computation of pairwise similarity scores to estimate the probability of a pair of papers belonging to the same author, which drives an agglomerative clustering algorithm regulated by 2 factors: name compatibility and probability level. With transitivity violation correction, high precision author clustering is achieved by focusing on minimizing false-positive pairing. Disambiguation performance is evaluated with manual verification of random samples of pairs from clustering results. When compared with a state-of-the-art system, our evaluation shows that among all the pairs the lumping error rate drops from 10.1% to 2.2% for our system, while the splitting error rises from 1.8% to 7.7%. This results in an overall error rate of 9.9%, compared with 11.9% for the state-of-the-art method. Other evaluations based on gold standard data also show the increase in accuracy of our clustering. We attribute the performance improvement to the machine-learning method driven by a large-scale training set and the clustering algorithm regulated by a name compatibility scheme preferring precision. With integration of the author name disambiguation system into the PubMed search engine, the overall click-through-rate of PubMed users on author name query results improved from 34.9% to 36.9%.

downloadDownload free PDF View PDFchevron_right

Self-training author name disambiguation for information scarce scenarios

Alberto Laender

Journal of the Association for Information Science and Technology, 2014

We present a novel 3-step self-training method for author name disambiguation-SAND (self-training associative name disambiguator)-which requires no manual labeling, no parameterization (in real-world scenarios) and is particularly suitable for the common situation in which only the most basic information about a citation record is available (i.e., author names, and work and venue titles). During the first step, real-world heuristics on coauthors are able to produce highly pure (although fragmented) clusters. The most representative of these clusters are then selected to serve as training data for the third supervised author assignment step. The third step exploits a state-of-the-art transductive disambiguation method capable of detecting unseen authors not included in any training example and incorporating reliable predictions to the training data. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation, demonstrate that our proposed method outperforms all representative unsupervised author grouping disambiguation methods and is very competitive with fully supervised author assignment methods. Thus, different from other bootstrapping methods that explore privileged, hard to obtain information such as self-citations and personal information, our proposed method produces topnotch performance with no (manual) training data or parameterization and in the presence of scarce information.

downloadDownload free PDF View PDFchevron_right

Cited by

A generic Web-based entity resolution framework

Denilson A Pereira

Journal of the American Society for Information Science and Technology, 2011

Web data repositories usually contain references to thousands of real-world entities from multiple sources. It is not uncommon that multiple entities share the same label (polysemes) and that distinct label variations are associated with the same entity (synonyms), which frequently leads to ambiguous interpretations. Further, spelling variants, acronyms, abbreviated forms, and misspellings compound to worsen the problem. Solving this problem requires identifying which labels correspond to the same real-world entity, a process known as entity resolution. One approach to solve the entity resolution problem is to associate an authority identifier and a list of variant forms with each entity-a data structure known as an authority file. In this work, we propose a generic framework for implementing a method for generating authority files. Our method uses information from the Web to improve the quality of the authority file and, because of that, is referred to as WER-Web-based Entity Resolution. Our contribution here is threefold: (a) we discuss how to implement the WER framework, which is flexible and easy to adapt to new domains; (b) we run extended experimentation with our WER framework to show that it outperforms selected baselines; and (c) we compare the results of a specialized solution for author name resolution with those produced by the generic WER framework, and show that the WER results remain competitive.

downloadDownload free PDF View PDFchevron_right

Name disambiguation from link data in a collaboration graph using temporal and topological features

Mohammad Al Hasan

Social Network Analysis and Mining, 2015

In a social community, multiple persons may share the same name, phone number or some other identifying attributes. This, along with other phenomena, such as name abbreviation, name misspelling, and human error leads to erroneous aggregation of records of multiple persons under a single reference. Such mistakes affect the performance of document retrieval, web search, database integration, and more importantly, improper attribution of credit (or blame). The task of entity disambiguation partitions the records belonging to multiple persons with the objective that each decomposed partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from link information obtained from a collaboration network. Our method is non-intrusive of privacy as it uses only the time-stamped graph topology of an anonymized network. Experimental results on two real-life academic collaboration networks show that the proposed method has satisfactory performance.

downloadDownload free PDF View PDFchevron_right

Search engine driven author disambiguation

Sign up for access to the world's latest research

Abstract

Related papers

References (5)

Related papers

Related topics

Cited by