An Approach to Web-Scale Named-Entity Disambiguation

Sarmento, Luís; Kehlenbeck, Alexander; Oliveira, Eugénio; Ungar, Lyle

doi:10.1007/978-3-642-03070-3_52

Outline

Title

Abstract

An approach to web-scale named-entity disambiguation

Lyle Ungar

2009

https://doi.org/10.1007/978-3-642-03070-3_52

visibility

…

description

22 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

We present a multi-pass clustering approach to large scale, wide-scope named-entity disambiguation (NED) on collections of web pages. Our approach uses name co-occurrence information to cluster and hence disambiguate entities, and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasingly difficult as the corpus size increases, not only because of the challenge of scaling the NED algorithm, but also because new and surprising facets of entities become visible in the data.

Marc Spaniol

2011

Disambiguating named entities in naturallanguage text maps mentions of ambiguous names onto canonical entities like people or places, registered in a knowledge base such as DBpedia or YAGO. This paper presents a robust method for collective disambiguation, by harnessing context from knowledge bases and using a new form of coherence graph. It unifies prior approaches into a comprehensive framework that combines three measures: the prior probability of an entity being mentioned, the similarity between the contexts of a mention and a candidate entity, as well as the coherence among candidate entities for all mentions together. The method builds a weighted graph of mentions and candidate entities, and computes a dense subgraph that approximates the best joint mention-entity mapping. Experiments show that the new method significantly outperforms prior methods in terms of accuracy, with robust behavior across a variety of inputs.

downloadDownload free PDF View PDFchevron_right

A knowledge-based approach to named entity disambiguation in news articles

Hien Nguyen

2007

Named entity disambiguation has been one of the main challenges to research in Information Extraction and development of Semantic Web. Therefore, it has attracted much research effort, with various methods introduced for different domains, scopes, and purposes. In this paper, we propose a new approach that is not limited to some entity classes and does not require wellstructured texts. The novelty is that it exploits relations between co-occurring entities in a text as defined in a knowledge base for disambiguation. Combined with class weighting and coreference resolution, our knowledge-based method outperforms KIM system in this problem. Implemented algorithms and conducted experiments for the method are presented and discussed.

downloadDownload free PDF View PDFchevron_right

Mining evidences for named entity disambiguation

Jiawei Han

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013

Named entity disambiguation is the task of disambiguating named entity mentions in natural language text and link them to their corresponding entries in a knowledge base such as Wikipedia. Such disambiguation can help enhance readability and add semantics to plain text. It is also a central step in constructing high-quality information network or knowledge graph from unstructured text. Previous research has tackled this problem by making use of various textual and structural features from a knowledge base. Most of the proposed algorithms assume that a knowledge base can provide enough explicit and useful information to help disambiguate a mention to the right entity. However, the existing knowledge bases are rarely complete (likely will never be), thus leading to poor performance on short queries with not well-known contexts. In such cases, we need to collect additional evidences scattered in internal and external corpus to augment the knowledge bases and enhance their disambiguation power. In this work, we propose a generative model and an incremental algorithm to automatically mine useful evidences across documents. With a specific modeling of "background topic" and "unknown entities", our model is able to harvest useful evidences out of noisy information. Experimental results show that our proposed method outperforms the state-of-the-art approaches significantly: boosting the disambiguation accuracy from 43% (baseline) to 86% on short queries derived from tweets.

downloadDownload free PDF View PDFchevron_right

Entity Disambiguation for Wild Big Data Using Multi-Level Clustering

Jennifer Sleeman

2015

When RDF instances represent the same entity they are said to corefer. For example, two nodes from different RDF graphs 1 both refer to same individual, musical artist James Brown. Disambiguating entities is essential for knowledge base population and other tasks that result in integration or linking of data. Often however, entity instance data originates from different sources and can be represented using differ- ent schemas or ontologies. In the age of Big Data, data can have other characteristics such originating from sources which are schema-less or without ontological structure. Our work involves researching new ways to process this type of data in order to perform entity disambiguation. Our approach uses multi-level clustering and includes fine-grained entity type recognition, contextualization of entities, online processing of which can be supported by a parallel architecture.

downloadDownload free PDF View PDFchevron_right

Named Entity Disambiguation for Noisy Text

Shaul Markovitch

Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

We address the task of Named Entity Disambiguation (NED) for noisy text. We present WikilinksNED, a large-scale NED dataset of text fragments from the web, which is significantly noisier and more challenging than existing newsbased datasets. To capture the limited and noisy local context surrounding each mention, we design a neural model and train it with a novel method for sampling informative negative examples. We also describe a new way of initializing word and entity embeddings that significantly improves performance. Our model significantly outperforms existing state-ofthe-art methods on WikilinksNED while achieving comparable performance on a smaller newswire dataset.

downloadDownload free PDF View PDFchevron_right

Enriching Ontologies for Named Entity Disambiguation

Hien Nguyen

Detecting entity mentions in a text and then mapping them to their right entities in a given knowledge source is significant to realization of the semantic web, as well as advanced development of natural language processing applications. The knowledge sources used are often close ontologiesbuilt by small groups of experts -and Wikipedia. To date, state-of-the-art methods proposed for named entity disambiguation mainly use Wikipedia as such a knowledge source. This paper proposes a method that enriches a close ontology by Wikipedia and then disambiguates named entities in a text based on that enriched one. The method disambiguates named entities in a text iteratively and incrementally, including several iterative steps. Those named entities that are identified in each iterative step will be used to disambiguate the remaining ones in the next iterative steps. The experiment results show that enrichment of a close ontology noticeably improves disambiguation performance.

downloadDownload free PDF View PDFchevron_right

Scenario-Driven Selection and Exploitation of Semantic Data for Optimal Named Entity Disambiguation

Panos Alexopoulos

Abstract. The rapidly increasing use of large-scale data on the Web has made named entity disambiguation a key research challenge in Information Extraction (IE) and development of the Semantic Web. In this paper we propose a novel disambiguation framework that utilizes background semantic information, typically in the form of Linked Data, to accurately determine the intended meaning of detected semantic entity references within texts.

downloadDownload free PDF View PDFchevron_right

Ontology-driven automatic entity disambiguation in unstructured text

Boanerges Aleman-Meza

The Semantic Web-ISWC 2006, 2006

Precisely identifying entities in web documents is essential for document indexing, web search and data integration. Entity disambiguation is the challenge of determining the correct entity out of various candidate entities. Our novel method utilizes background knowledge in the form of a populated ontology. Additionally, it does not rely on the existence of any structure in a document or the appearance of data items that can provide strong evidence, such as e-mail addresses, for disambiguating authors for example. Originality of our method is demonstrated in the way it uses different relationships in a document as well as in the ontology to provide clues in determining the correct entity. We demonstrate the applicability of our method by disambiguating authors in a collection of DBWorld posts using a large scale, real-world ontology extracted from the DBLP. The precision and recall measurements provide encouraging results.

downloadDownload free PDF View PDFchevron_right

Task-Specific Representation Learning for Web-Scale Entity Disambiguation

Rijula Kar

2018

Named entity disambiguation (NED) is a central problem in information extraction. The goal is to link entities in a knowledge graph (KG) to their mention spans in unstructured text. Each distinct mention span (like John Smith, Jordan or Apache) represents a multi-class classification task. NED can therefore be modeled as a multitask problem with tens of millions of tasks for realistic KGs. We initiate an investigation into neural representations, network architectures, and training protocols for multitask NED. Specifically, we propose a task-sensitive representation learning framework that learns mention dependent representations, followed by a common classifier. Parameter learning in our framework can be decomposed into solving multiple smaller problems involving overlapping groups of tasks. We prove bounds for excess risk, which provide additional insight into the problem of multi-task representation learning. While remaining practical in terms of training memory and time requirem...

downloadDownload free PDF View PDFchevron_right

An Unsupervised Algorithm for Person Name Disambiguation in the Web

Agustin Ivan Frausto Delgado

En este trabajo presentamos un sistema no supervisado para agrupar los resultados proporcionados por un motor de búsqueda cuando la consulta corresponde a un nombre de persona compartido por diferentes individuos. Las páginas web se representan mediante n-gramas de diferente información y tamaño. Además, proponemos un algoritmo de clustering capaz de calcular el número de clusters y devolver grupos de páginas web correspondientes a cada uno de los individuos, sin necesidad de entrenamiento ni umbrales predefinidos, como hacen los mejores sistemas del estado del arte en esta tarea. Hemos evaluado nuestra propuesta con tres colecciones de evaluación propuestas en diferentes campañas de evaluación para la tarea de Desambiguación de Personas en la Web. Los resultados obtenidos son competitivos y comparables a aquellos obtenidos por los mejores sistemas del estado del arte que utilizan algún tipo de supervisión. Palabras clave: aprendizaje no supervisado, clustering, n-gramas, búsqueda de personas en la web

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

Hien Nguyen

The Semantic Web, 2008

The rapidly increasing use of large-scale data on the Web makes named entity disambiguation become one of the main challenges to research in Information Extraction and development of Semantic Web. This paper presents a novel method for detecting proper names in a text and linking them to the right entities in Wikipedia. The method is hybrid, containing two phases of which the first one utilizes some heuristics and patterns to narrow down the candidates, and the second one employs the vector space model to rank the ambiguous cases to choose the right candidate. The novelty is that the disambiguation process is incremental and includes several rounds that filter the candidates, by exploiting previously identified entities and extending the text by those entity attributes every time they are successfully resolved in a round. We test the performance of the proposed method in disambiguation of names of people, locations and organizations in texts of the news domain. The experiment results show that our approach achieves high accuracy and can be used to construct a robust named entity disambiguation system.

downloadDownload free PDF View PDFchevron_right

Personalized Page Rank for Named Entity Disambiguation

Maria Pershina

The task of Named Entity Disambiguation is to map entity mentions in the document to their correct entries in some knowledge base. We present a novel graph-based dis-ambiguation approach based on Personalized PageRank (PPR) that combines local and global evidence for disambiguation and effectively filters out noise introduced by incorrect candidates. Experiments show that our method outperforms state-of-the-art approaches by achieving 91.7% in micro-and 89.9% in macroaccuracy on a dataset of 27.8K named entity mentions.

downloadDownload free PDF View PDFchevron_right

Entity Disambiguation for Knowledge Base Population

Tim Finin

2010

Abstract The integration of facts derived from information extraction systems into existing knowledge bases requires a system to disambiguate entity mentions in the text. This is challenging due to issues such as non-uniform variations in entity names, mention ambiguity, and entities absent from a knowledge base. We present a state of the art system for entity disambiguation that not only addresses these challenges but also scales to knowledge bases with several million entries using very little resources.

downloadDownload free PDF View PDFchevron_right

Graph Based Disambiguation of Named Entities using Linked Data

International Journal IJRITCC

— Identifying entities such as people, organizations, songs, or places in natural language texts is needful for semantic search, machine translation, and information extraction. A key challenge is the ambiguity of entity names, requiring robust methods to disambiguate names to the entities registered in a knowledge base. Several approaches aim to tackle this problem, they still achieve poor accuracy. We address this drawback by presenting a novel knowledge-base-agnostic approach for named entity disambiguation. Our approach includes the HITS algorithm combined with label expansion strategies and string similarity measure like the n-gram similarity. Based on this combination, we can efficiently detect the correct URIs for a given set of named entities within an input text.

downloadDownload free PDF View PDFchevron_right

Entity Disambiguation and Linking over Queries using Encyclopedic Knowledge

Massimo Poesio

Literature has seen a large amount of work on entity recognition and semantic disambiguation in text but very limited on the effect in noisy text data. In this paper, we present an approach for recognizing and disambiguating entities in text based on the high coverage and rich structure of an online encyclopedia. This work was carried out on a collection of query logs from the Bridgeman Art Library. As queries are noisy unstructured text, pure natural language processing as well as computational techniques can create problems, we need to contend with the impact noise and the demands it places on query analysis. In order to cope with the noisy input, we use machine learning method with statistical measures derived from Wikipedia. It provides a huge electronic text from the Internet, which is also noisy. Our approach is an unsupervised approach and do not need any manual annotation made by human experts. We show that data collection from Wikipedia can be used statistically to derive good performance for entity recognition and semantic disambiguation over noisy unstructured text. Also, as no natural language specific tool is needed, the method can be applied to other languages in a similar manner with little adaptation.

downloadDownload free PDF View PDFchevron_right

AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables

Ilaria Bordino

2011

We present AIDA, a framework and online tool for entity detection and disambiguation. Given a natural-language text or a Web table, we map mentions of ambiguous names onto canonical entities like people or places, registered in a knowledge base like DBpedia, Freebase, or YAGO. AIDA is a robust framework centred around collective disambiguation exploiting the prominence of entities, similarity between the context of the mention and its candidates, and the coherence among candidate entities for all mentions. We have developed a Web-based online interface for AIDA where different formats of inputs can be processed on the fly, returning proper entities and showing intermediate steps of the disambiguation process.

downloadDownload free PDF View PDFchevron_right

AUG: A combined classification and clustering approach for web people disambiguation

Els Lefever

2007

This paper presents a combined supervised and unsupervised approach for multi-document person name disambiguation. Based on feature vectors reflecting pairwise comparisons between web pages, a classification algorithm provides linking information about document pairs, which leads to initial clusters. In addition, two different clustering algorithms are fed with matrices of weighted keywords. In a final step the "seed" clusters are combined with the results of the clustering algorithms. Results on the validation data show that a combined classification and clustering approach doesn't always compare favorably to those obtained by the different algorithms separately.

downloadDownload free PDF View PDFchevron_right

Anna Gentile

Communications in Computer and Information Science, 2010

Natural Language is a mean to express and discuss about concepts, objects, events, i.e. it carries semantic contents. One of the ultimate aims of Natural Language Processing techniques is to identify the meaning of the text, providing effective ways to make a proper linkage between textual references and their referents, that is real world objects. This work addresses the problem of giving a sense to proper names in a text, that is automatically associating words representing Named Entities with their referents. The proposed methodology for Named Entity Disambiguation is based on Semantic Relatedness Scores obtained with a graph based model over Wikipedia. We show that, without building a Bag of Words representation of the text, but only considering named entities within the text, the proposed paradigm achieves results competitive with the state of the art on two different datasets.

downloadDownload free PDF View PDFchevron_right

Named Entity Disambiguation in Streaming Data

Adriano Veloso

The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known Expectation-Maximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.

downloadDownload free PDF View PDFchevron_right

An approach to web-scale named-entity disambiguation

Sign up for access to the world's latest research

Abstract

Related papers

Related papers