KDD Cup 2013

Benjamin Solecki; Lucas Silva; Dmitry Efimov

doi:10.1145/2517288.2517297

Outline

Natural Language Processing

KDD Cup 2013

Lucas Silva

2013, Proceedings of the 2013 KDD Cup 2013 Workshop

https://doi.org/10.1145/2517288.2517297

visibility

…

description

3 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

This paper describes our team's (BS Man & Dmitry & Leustagos) approach to the KDD Cup 2013 track 2 challenge: Author Disambiguation in the Microsoft Academic Search database.

anderson ferreira

2012

Authorship disambiguation is an urgent issue that affects the quality of digital library services and for which supervised solutions have been proposed, delivering state-of-the-art effectiveness. However, particular challenges such as the prohibitive cost of labeling vast amounts of examples (there are many ambiguous authors), the huge hypothesis space (there are several features and authors from which many different disambiguation functions may be derived), and the skewed author popularity distribution (few authors are very prolific, while most appear in only few citations), may prevent the full potential of such techniques. In this article, we introduce an associative author name disambiguation approach that identifies authorship by extracting, from training examples, rules associating citation features (e.g., coauthor names, work title, publication venue) to specific authors. As our main contribution we propose three associative author name disambiguators: (1) EAND (Eager Associative Name Disambiguation), our basic method that explores association rules for name disambiguation; (2) LAND (Lazy Associative Name Disambiguation), that extracts rules on a demand-driven basis at disambiguation time, reducing the hypothesis space by focusing on examples that are most suitable for the task; and (3) SLAND (Self-Training LAND), that extends LAND with self-training capabilities, thus drastically reducing the amount of examples required for building effective disambiguation functions, besides being able to detect novel/unseen authors in the test set. Experiments demonstrate that all our disambigutators are effective and that, in particular, SLAND is able to outperform stateof-the-art supervised disambiguators, providing gains that range from 12% to more than 400%, being extremely effective and practical.

downloadDownload free PDF View PDFchevron_right

A relevance feedback approach for the author name disambiguation problem

Ariadne Carvalho, Anderson Ferreira

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL '13, 2013

This paper presents a new name disambiguation method that exploits user feedback on ambiguous references across iterations. An unsupervised step is used to define pure training samples, and a hybrid supervised step is employed to learn a classification model for assigning references to authors. Our classification scheme combines the Optimum-Path Forest (OPF) classifier with complex reference similarity functions generated by a Genetic Programming framework. Experiments demonstrate that the proposed method yields better results than state-of-the-art disambiguation methods on two traditional datasets.

downloadDownload free PDF View PDFchevron_right

Self-training author name disambiguation for information scarce scenarios

Alberto Laender

Journal of the Association for Information Science and Technology, 2014

We present a novel 3-step self-training method for author name disambiguation-SAND (self-training associative name disambiguator)-which requires no manual labeling, no parameterization (in real-world scenarios) and is particularly suitable for the common situation in which only the most basic information about a citation record is available (i.e., author names, and work and venue titles). During the first step, real-world heuristics on coauthors are able to produce highly pure (although fragmented) clusters. The most representative of these clusters are then selected to serve as training data for the third supervised author assignment step. The third step exploits a state-of-the-art transductive disambiguation method capable of detecting unseen authors not included in any training example and incorporating reliable predictions to the training data. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation, demonstrate that our proposed method outperforms all representative unsupervised author grouping disambiguation methods and is very competitive with fully supervised author assignment methods. Thus, different from other bootstrapping methods that explore privileged, hard to obtain information such as self-citations and personal information, our proposed method produces topnotch performance with no (manual) training data or parameterization and in the presence of scarce information.

downloadDownload free PDF View PDFchevron_right

A brief survey of automatic methods for author name disambiguation

Marcos Goncalves

ACM SIGMOD Record, 2012

Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. The challenges of dealing with author name ambiguity have led to a myriad of disambiguation methods. Generally speaking, the proposed methods usually attempt to group citation records of a same author by finding some similarity among them or try to directly assign them to their respective authors. Both approaches may either exploit supervised or unsupervised techniques. In this article, we propose a taxonomy for characterizing the current author name disambiguation methods described in the literature, present a brief survey of the most representative ones and discuss several open challenges.

downloadDownload free PDF View PDFchevron_right

Improving Author Name Disambiguation with User Relevance Feedback

Marcos Goncalves

2012

Author name ambiguity in the context of bibliographic citations is a very hard problem. It occurs when there are citation records of a same author under distinct names or when there exists citation records belonging to distinct authors with very similar names. Among the several methods proposed in the literature, the most effective ones are those that perform a direct assignment of the records to their respective authors by means of the application of supervised machine learning techniques. However, those methods usually need large amounts of labeled training examples to properly disambiguate the author names. To deal with this issue, in previous work, we have proposed a method that automatically obtains and labels the training examples, showing competitive performance compared to representative author name disambiguation methods. In this work, we propose to improve our previous method by exploiting user relevance feedback. In more details we select a very small portion of the citation records for which our method was mostly unsure about the correct authorship and ask the administrators for labeling them. This feedback is then used to improve the effectiveness of the whole process. In our experimental evaluation, we observed that with a very small labeling effort (usually around 5% of the records), the overall disambiguation effectiveness improves by almost 10% on average, with gains of up to 61% in some of the largest ambiguous groups.

downloadDownload free PDF View PDFchevron_right

Whois? Deep Author Name Disambiguation Using Bibliographic Data

Nagaraj Asundi

Lecture Notes in Computer Science, 2022

As the number of authors is increasing exponentially over years, the number of authors sharing the same names is increasing proportionally. This makes it challenging to assign newly published papers to their adequate authors. Therefore, Author Name Ambiguity (ANA) is considered a critical open problem in digital libraries. This paper proposes an Author Name Disambiguation (AND) approach that links author names to their real-world entities by leveraging their co-authors and domain of research. To this end, we use a collection from the DBLP repository that contains more than 5 million bibliographic records authored by around 2.6 million co-authors. Our approach first groups authors who share the same last names and same first name initials. The author within each group is identified by capturing the relation with his/her co-authors and area of research, which is represented by the titles of the validated publications of the corresponding author. To this end, we train a neural network model that learns from the representations of the co-authors and titles. We validated the effectiveness of our approach by conducting extensive experiments on a large dataset.

downloadDownload free PDF View PDFchevron_right

Overview of the M-WePNaD task : multilingual web person name disambiguation at IberEval 2017

Victor Fresno

2017

Multilingual Web Person Name Disambiguation is a new shared task proposed for the first time at the IberEval 2017 evaluation campaign. For a set of web search results associated with a person name, the task deals with the grouping of the results based on the particular individual they refer to. Different from previous works dealing with monolingual search results, this task has further considered the challenge posed by search results written in different languages. This task allows to evaluate the performance of participating systems in a multilingual scenario. This overview summarizes a total of 18 runs received from four participating teams. We present the datasets utilized and the methodology defined for the task and the evaluation, along with an analysis of the results and the submitted systems.

downloadDownload free PDF View PDFchevron_right

Author Name Disambiguation in Bibliographic Databases: A Survey

Tehmina Amjad

2020

Entity resolution is a challenging and hot research area in the field of Information Systems since last decade. Author Name Disambiguation (AND) in Bibliographic Databases (BD) like DBLP , Citeseer , and Scopus is a specialized field of entity resolution. Given many citations of underlying authors, the AND task is to find which citations belong to the same author. In this survey, we start with three basic AND problems, followed by need for solution and challenges. A generic, five-step framework is provided for handling AND issues. These steps are; (1) Preparation of dataset (2) Selection of publication attributes (3) Selection of similarity metrics (4) Selection of models and (5) Clustering Performance evaluation. Categorization and elaboration of similarity metrics and methods are also provided. Finally, future directions and recommendations are given for this dynamic area of research.

downloadDownload free PDF View PDFchevron_right

Simulated Entity Resolution by Diverse Means: DIMACS Work on the KDD Challenge of 2005

Alexander Genkin, Dmitriy Fradkin

2006

downloadDownload free PDF View PDFchevron_right

Who's who in the world wide web: Approaches to name disambiguation

Vanessa Klaas

INSTITUT FUR INFORMATIK. der Ludwig-Maximilians- …, 2007

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (2)

REFERENCES
Senjuti Basu Roy, Martine De Cock, Vani Mandava, Brian Dalessandro, Claudia Perlich, William Cukierski, Ben Hamner. "The Microsoft Academic Search Dataset and KDD Cup 2013". KDD Cup 2013 workshop. 514/646

rangga restu prayogo

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2021

Author Name Disambiguation (AND) is the task of resolving which author mentions in a bibliographic database refer to the same real-world person, and is a critical ingredient of digital library applications such as search and citation analysis. While many AND algorithms have been proposed, comparing them is difficult because they often employ distinct features and are evaluated on different datasets. In response to this challenge, we present S2AND, a unified benchmark dataset for AND on scholarly papers, as well as an open-source reference model implementation. Our dataset harmonizes eight disparate AND datasets into a uniform format, with a single rich feature set drawn from the Semantic Scholar (S2) database. Our evaluation suite for S2AND reports performance split by facets like publication year and number of papers, allowing researchers to track both global performance and measures of fairness across facet values. Our experiments show that because previous datasets tend to cover idiosyncratic and biased slices of the literature, algorithms trained to perform well on one on them may generalize poorly to others. By contrast, we show how training on a union of datasets in S2AND results in more robust models that perform well even on datasets unseen in training. The resulting AND model also substantially improves over the production algorithm in S2, reducing error by over 50% in terms of B 3 F1. We release our unified dataset, model code, trained models, and evaluation suite to the research community. 1 Index Terms-Digital libraries, Author name disambiguation, Out-of-domain evaluation.

downloadDownload free PDF View PDFchevron_right

Towards a New Paradigm for Author Name Disambiguation

Tehmina Amjad

IEEE Access

Author Name Disambiguation (AND) has emerged as a significant challenge in the bibliometric context with the growing volume of scientific literature. When citations written by different authors have the same names (polysemy or homonym names), and when an author has different names, there is ambiguity (synonyms or name variants). It is difficult to associate a citation with the correct author. Polysemy and synonyms cause merging and splitting anomalies in the citations. These anomalies affect the quantification of an author's productivity (bibliometric analysis) and the reliability and quality of the information retrieved. Many techniques for AND have been proposed in the literature; most of them do not go beyond string matching or text matching. Most do not consider the context or semantics of the terms used in the citations. The AND problem is resolved semantically in this paper using the deep learning technique on the PubMed dataset. The experimental results show that the proposed method achieves overall (11.72%, 12.5%, and 12.1%) higher precision, recall, and f-measure than the pairwise class classification.

downloadDownload free PDF View PDFchevron_right

Automatic Disambiguation of Author Names in Bibliographic Repositories

Alberto Laender

Synthesis lectures on information concepts, retrieval, and services, 2020

This book deals with a hard problem that is inherent to human language: ambiguity. In particular, we focus on author name ambiguity, a type of ambiguity that exists in digital bibliographic repositories, which occurs when an author publishes works under distinct names or distinct authors publish works under similar names. This problem may be caused by a number of reasons, including the lack of standards and common practices, and the decentralized generation of bibliographic content. As a consequence, the quality of the main services of digital bibliographic repositories such as search, browsing, and recommendation may be severely affected by author name ambiguity. The focal point of the book is on automatic methods, since manual solutions do not scale to the size of the current repositories or the speed in which they are updated. Accordingly, we provide an ample view on the problem of automatic disambiguation of author names, summarizing the results of more than a decade of research on this topic conducted by our group, which were reported in more than a dozen publications that received over 900 citations so far, according to Google Scholar. We start by discussing its motivational issues (Chapter 1). Next, we formally define the author name disambiguation task (Chapter 2) and use this formalization to provide a brief, taxonomically organized, overview of the literature on the topic (Chapter 3). We then organize, summarize and integrate the efforts of our own group on developing solutions for the problem that have historically produced state-of-the-art (by the time of their proposals) results in terms of the quality of the disambiguation results. Thus, Chapter 4 covers HHC -Heuristic-based Clustering, an author name disambiguation method that is based on two specific real-world assumptions regarding scientific authorship. Then, Chapter 5 describes SAND -Self-training Author Name Disambiguator and Chapter 6 presents two incremental author name disambiguation methods, namely INDi -Incremental Unsupervised Name Disambiguation and INC-Incremental Nearest Cluster. Finally, Chapter 7 provides an overview of recent author name disambiguation methods that address new specific approaches such as graph-based representations, alternative predefined similarity functions, visualization facilities and approaches based on artificial neural networks. The chapters are followed by three appendices that cover, respectively: (i) a pattern matching function for comparing proper names and used by some of the methods addressed in this book; (ii) a tool for generating synthetic collections of citation records for distinct experimental tasks; and (iii) a number of datasets commonly used to evaluate author name disambiguation methods. In summary, the book organizes a large body of knowledge and work in the area of author name disambiguation in the last decade, hoping to consolidate a solid basis for future developments in the field.

downloadDownload free PDF View PDFchevron_right

Author Disambiguation using Error-Driven Machine Learning With a Ranking Loss Function

Andrew McCallum

Author disambiguation is the problem of determining whether records in a publications database refer to the same person. A common supervised machine learning approach is to build a classifier to predict whether a pair of records is coreferent, followed by a clustering step to enforce transitivity. However, this approach ignores powerful evidence obtainable by examining sets (rather than pairs) of records, such as the number of publications or co-authors an author has. In this paper we propose a representation that enables these first-order features over sets of records. We then propose a training algorithm well-suited to this representation that is (1) error-driven in that training examples are generated from incorrect predictions on the training data, and (2) rankbased in that the classifier induces a ranking over candidate predictions. We evaluate our algorithms on three author disambiguation datasets and demonstrate error reductions of up to 60% over the standard binary classification approach.

downloadDownload free PDF View PDFchevron_right

A Graph Combination With Edge Pruning‐Based Approach for Author Name Disambiguation

Joydeep Chandra

Journal of the Association for Information Science and Technology, 2019

Author name disambiguation (AND) is a challenging problem due to several issues such as missing key identifiers, same name corresponding to multiple authors, along with inconsistent representation. Several techniques have been proposed but maintaining consistent accuracy levels over all data sets is still a major challenge. We identify two major issues associated with the AND problem. First, the namesake problem in which two or more authors with the same name publishes in a similar domain. Second, the diverse topic problem in which one author publishes in diverse topical domains with a different set of coauthors. In this work, we initially propose a method named ATGEP for AND that addresses the namesake issue. We evaluate the performance of ATGEP using various ambiguous name references collected from the Arnetminer Citation (AC) and Web of Science (WoS) data set. We empirically show that the two aforementioned problems are crucial to address the AND problem that are difficult to handle using state-of-theart techniques. To handle the diverse topic issue, we extend ATGEP to a new variant named ATGEP-web that considers external web information of the authors. Experiments show that with enough information available from external web sources ATGEP-web can significantly improve the results further compared with ATGEP.

downloadDownload free PDF View PDFchevron_right

A knowledge graph embeddings based approach for author name disambiguation using literals

Aldo Gangemi

Scientometrics

Scholarly data is growing continuously containing information about the articles from a plethora of venues including conferences, journals, etc. Many initiatives have been taken to make scholarly data available in the form of Knowledge Graphs (KGs). These efforts to standardize these data and make them accessible have also led to many challenges such as exploration of scholarly articles, ambiguous authors, etc. This study more specifically targets the problem of Author Name Disambiguation (AND) on Scholarly KGs and presents a novel framework, Literally Author Name Disambiguation (LAND), which utilizes Knowledge Graph Embeddings (KGEs) using multimodal literal information generated from these KGs. This framework is based on three components: (1) multimodal KGEs, (2) a blocking procedure, and finally, (3) hierarchical Agglomerative Clustering. Extensive experiments have been conducted on two newly created KGs: (i) KG containing information from Scientometrics Journal from 1978 onwards...

downloadDownload free PDF View PDFchevron_right

Effective self-training author name disambiguation in scholarly digital libraries

Adriano Veloso

2010

Name ambiguity in the context of bibliographic citation records is a hard problem that affects the quality of services and content in digital libraries and similar systems. Supervised methods that exploit training examples in order to distinguish ambiguous author names are among the most effective solutions for the problem, but they require skilled human annotators in a laborious and continuous process of manually labeling citations in order to provide enough training examples. Thus, addressing the issues of (i) automatic acquisition of examples and (ii) highly effective disambiguation even when only few examples are available, are the need of the hour for such systems. In this paper, we propose a novel two-step disambiguation method, SAND (Self-training Associative Name Disambiguator), that deals with these two issues. The first step eliminates the need of any manual labeling effort by automatically acquiring examples using a clustering method that groups citation records based on the similarity among coauthor names. The second step uses a supervised disambiguation method that is able to detect unseen authors not included in any of the given training examples. Experiments conducted with standard public collections, using the minimum set of attributes present in a citation (i.e., author names, work title and publication venue), demonstrated that our proposed method outperforms representative unsupervised disambiguation methods that exploit similarities between citation records and is as effective as, and in some cases superior to, supervised ones, without manually labeling any training example.

downloadDownload free PDF View PDFchevron_right

Cited by

Entity Deduplication on ScholarlyData

Andrea Giovanni Nuzzolese

The Semantic Web, 2017

downloadDownload free PDF View PDFchevron_right

KDD Cup 2013

Sign up for access to the world's latest research

Abstract

Related papers

References (2)

Related papers

Related topics

Cited by