Academia.eduAcademia.edu

Entity Resolution

description257 papers
group65 followers
lightbulbAbout this topic
Entity Resolution is the process of identifying and merging records that refer to the same real-world entity across different data sources, ensuring data consistency and accuracy. It involves techniques from data cleaning, deduplication, and record linkage to resolve ambiguities and discrepancies in data representation.
lightbulbAbout this topic
Entity Resolution is the process of identifying and merging records that refer to the same real-world entity across different data sources, ensuring data consistency and accuracy. It involves techniques from data cleaning, deduplication, and record linkage to resolve ambiguities and discrepancies in data representation.

Key research themes

1. How can schema-agnostic and scalable blocking techniques improve entity resolution on heterogeneous and noisy big data?

This research theme focuses on developing blocking methods that do not require prior schema knowledge and can efficiently handle large, heterogeneous, and noisy datasets. Blocking is a critical step in entity resolution (ER) that partitions datasets into smaller blocks to reduce the quadratic comparison cost. Addressing schema heterogeneity and noise while maintaining blocking effectiveness and scalability is essential for processing Big Data ER tasks.

Key finding: Proposes a novel schema-agnostic ER approach that treats entity attributes as bags of words and uses n-grams combined with Apache Spark for scalable processing. The method avoids complex schema alignment and meta-blocking... Read more
Key finding: Introduces a schema-agnostic blocking technique that incrementally processes streaming, noisy, and heterogeneous data using distributed infrastructure. The approach applies attribute selection and top-n neighborhood... Read more
Key finding: Presents NA-BLOCKER, a novel noise-tolerant, schema-agnostic blocking technique using Locality Sensitive Hashing (LSH) to hash attribute values. NA-BLOCKER enhances block quality and effectiveness over the state-of-the-art... Read more

2. What are effective distributed and clustering-based methods for scalable multi-source entity resolution?

This theme investigates methods that use distributed computing frameworks and clustering algorithms to tackle ER involving multiple heterogeneous data sources. By focusing on clustering to group matching entities across many datasets and exploiting parallel processing platforms like Apache Flink and Apache Spark, these approaches aim to improve scalability and integration quality in multi-source ER scenarios.

Key finding: Implements distributed versions of six clustering algorithms on Apache Flink for multi-source ER, demonstrating that clustering-based approaches improve match quality and scalability by grouping related entities across... Read more
Key finding: JediAI system facilitates building end-to-end ER pipelines combining schema-awareness, budget-awareness, and execution mode dimensions, supporting schema-agnostic and schema-based blocking/matching. It offers both serial and... Read more
Key finding: Proposes RulER, a method to efficiently execute complex record-level matching rules combining multiple similarity predicates on distributed MapReduce-like systems. It enables parallel and distributed processing of similarity... Read more

3. How can multi-type or graph-based entity representations improve unsupervised entity resolution and disambiguation?

This research area explores leveraging graph structures and multi-type entity models for unsupervised entity resolution and named entity disambiguation. Techniques focus on jointly resolving entities of different types by summarizing multi-typed RDF graphs, exploiting relational context, and applying graph-based semantic relatedness for disambiguating entities without relying on supervised learning.

Key finding: Formulates ER as a multi-type graph summarization problem, jointly clustering nodes of different types that represent the same entity, and inferring the importance of relations between entity types. The approach outperforms... Read more
Key finding: Proposes a graph-based random walk semantic relatedness method over Wikipedia that models only named entities and their contextual links for Named Entity Disambiguation (NED). The approach achieves state-of-the-art accuracy... Read more
Key finding: Introduces a Bayesian network-based algorithm for probabilistic entity linkage that models evidences and their interdependencies from heterogeneous information spaces. The method supports incremental update of matching... Read more

All papers in Entity Resolution

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of... more
Entity matching is a crucial and difficult task for data integration. Entity matching frameworks provide several methods and their combination to effectively solve different match tasks. In this paper, we comparatively analyze 11 proposed... more
Despite the huge amount of recent research efforts on entity resolution (matching) there has not yet been a comparative evaluation on the relative effectiveness and efficiency of alternate approaches. We therefore present such an... more
Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for... more
P. Christen), verykios@eap.gr (V.S. Verykios). Information Systems ] (]]]]) ]]]-]]] Please cite this article as: D. Vatsalan, et al., A taxonomy of privacy-preserving record linkage techniques, Information Systems (2013), http://dx.
Matching dependencies were recently introduced as declarative rules for data cleaning and entity resolution. Enforcing a matching dependency on a database instance identifies the values of some attributes for two tuples, provided that the... more
In scholarly digital libraries, author disambiguation is an important task that attributes a scholarly work with specific authors. This is critical when individuals share the same name. We present an approach to this task that analyzes... more
Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel... more
We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that... more
Object matching or object consolidation is a crucial task for data in- tegration and data cleaning. It addresses the problem of identifying object instances in data sources referring to the same real world entity. We propose a flexible... more
Intelligent agents must be able to handle the complexity and uncertainty of the real world. Logical AI has focused mainly on the former, and statistical AI on the latter. Markov logic combines the two by attaching weights to first-order... more
In recent years, Online Social Networks (OSNs) have essentially become an integral part of our daily lives. There are hundreds of OSNs, each with its own focus and offers for particular services and functionalities. To take... more
Damia is a lightweight enterprise data integration service where line of business users can create and catalog high value data feeds for consumption by situational applications. Damia is inspired by the Web 2.0 mashup phenomenon. It... more
Web data repositories usually contain references to thousands of real-world entities from multiple sources. It is not uncommon that multiple entities share the same label (polysemes) and that distinct label variations are associated with... more
In recent years, it has become increasingly clear that the vision of the Semantic Web requires uncertain reasoning over rich, firstorder representations. Markov logic brings the power of probabilistic modeling to first-order logic by... more
In many telecom and web applications, there is a need to identify whether data objects in the same source or different sources represent the same entity in the real-world. This problem arises for subscribers in multiple services,... more
Entity resolution, also known as data matching or record linkage, is the task of identifying and matching records from several databases that refer to the same entities. Traditionally, entity resolution has been applied in batch-mode and... more
We introduce HIL, a high-level scripting language for entity resolution and integration. HIL aims at providing the core logic for complex data processing flows that aggregate facts from large collections of structured or unstructured data... more
We present FEVER, a new evaluation platform for entity resolution approaches. The modular structure of the FEVER framework supports the incorporation or reconstruction of many previously proposed approaches for entity resolution. A... more
A major emerging problem among consumer finance institutions is that customers that are not well recognized might be riskier than customers that are fully recognized. Fortunately, financial institutions count with external vendors... more
The continuous growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information... more
Addressing the research opportunities we've identified could substantially broaden the spectrum of multilingual text-mining and its practicality for supporting global S&T knowledge management. These opportunities also share a common set... more
Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale... more
The Semantic Web is constantly gaining momentum, as more and more Web sites and content providers adopt its principles. At the core of these principles lies the Linked Data movement, which demands that data on the Web shall be annotated... more
The problem of matching product titles is of particular interest for both users and marketers. The former, frequently search the Web with the aim of comparing prices and characteristics , or obtaining and aggregating information provided... more
Entity resolution (ER) is an important and common problem in data cleaning. It is about identifying and merging records in a database that represent the same real-world entity. Recently, matching dependencies (MDs) have been introduced... more
There is a growing interest in methods for analyzing data describing networks of all types, including information, biological, physical, and social networks. Typically the data describing these networks is observational, and thus noisy... more
The detection of product duplicates is one of the challenges that Web shop aggregators are currently facing. In this paper, we focus on solving the problem of product duplicate detection on the Web. Our proposed method extends a... more
Blocking methods are crucial for making the inherently quadratic task of Entity Resolution more efficient. The blocking methods proposed in the literature rely on the homogeneity of data and the availability of binding schema information;... more
User facing topical web applications such as events or shopping sites rely on large collections of data records about real world entities that are updated at varying latencies ranging from days to seconds. For example, event venue details... more
by Lars Kolb and 
1 more
Entity resolution is a crucial step for data quality and data integration. Learning-based approaches show high effectiveness at the expense of poor efficiency. To reduce the typically high execution times, we investigate how learningbased... more
We consider a serious, previously-unexplored challenge facing almost all approaches to scaling up entity resolution (ER) to multiple data sources: the prohibitive cost of labeling training data for supervised learning of similarity scores... more
Twitter has emerged as a great source to provide insights about upcoming planned and unplanned events of social, economic and political relevance. Big events are publicized and known in advance, but smaller, unplanned sub-events around... more
Increasingly, smartphones are being used to access all manner of information: email messages, Facebook status updates, tweets, RSS feeds, photographs and more. Approaches to dealing with this multi-faceted information stream developed on... more
Authors of scholarly publications state their affiliation in various forms. This kind of heterogeneity makes bibliographic analysis tasks on institutions impossible unless a comprehensive cleaning and consolidation of affiliation data is... more
Chinese discourse coherence modeling remains a challenge taskin Natural Language Processing field.Existing approaches mostlyfocus on the need for feature engineering, whichadoptthe sophisticated features to capture the logic or syntactic... more
Social networks initially had been places for people to contact each other, find friends or new acquaintances. As such they ever proved interesting for machine aided analysis. Recent developments, however, pivoted social networks to being... more
This paper addresses the problem of entity identification in documents in which key identity attributes are missing. The most common approach is to take a single entity reference and determine the "best match" of its attributes to a set... more
Matching Dependencies (MDs) are a relatively recent proposal for declarative entity resolution. They are rules that specify, on the basis of similarities satisfied by values in a database, what values should be considered duplicates, and... more
Data quality is crucial in all information systems. As a key step in obtaining clean data, record linkage or entity resolution (ER) groups database records by the underlying real world entities. In this pa- per we give practical... more
In intelligence analysis environments, content such as entities, events and relationships appear in different source documents and contexts, and relating them is a challenging and intensive task. This paper presents an approach to... more
Every day, millions of people cross international borders by air or sea. A nation's ability to identify and neutralize threats posed by travelers depends heavily on an accurate and proactive methodology for establishing traveler identity.... more
Download research papers for free!