Academia.eduAcademia.edu

Entity Resolution

description257 papers
group65 followers
lightbulbAbout this topic
Entity Resolution is the process of identifying and merging records that refer to the same real-world entity across different data sources, ensuring data consistency and accuracy. It involves techniques from data cleaning, deduplication, and record linkage to resolve ambiguities and discrepancies in data representation.
lightbulbAbout this topic
Entity Resolution is the process of identifying and merging records that refer to the same real-world entity across different data sources, ensuring data consistency and accuracy. It involves techniques from data cleaning, deduplication, and record linkage to resolve ambiguities and discrepancies in data representation.

Key research themes

1. How can schema-agnostic and scalable blocking techniques improve entity resolution on heterogeneous and noisy big data?

This research theme focuses on developing blocking methods that do not require prior schema knowledge and can efficiently handle large, heterogeneous, and noisy datasets. Blocking is a critical step in entity resolution (ER) that partitions datasets into smaller blocks to reduce the quadratic comparison cost. Addressing schema heterogeneity and noise while maintaining blocking effectiveness and scalability is essential for processing Big Data ER tasks.

Key finding: Proposes a novel schema-agnostic ER approach that treats entity attributes as bags of words and uses n-grams combined with Apache Spark for scalable processing. The method avoids complex schema alignment and meta-blocking... Read more
Key finding: Introduces a schema-agnostic blocking technique that incrementally processes streaming, noisy, and heterogeneous data using distributed infrastructure. The approach applies attribute selection and top-n neighborhood... Read more
Key finding: Presents NA-BLOCKER, a novel noise-tolerant, schema-agnostic blocking technique using Locality Sensitive Hashing (LSH) to hash attribute values. NA-BLOCKER enhances block quality and effectiveness over the state-of-the-art... Read more

2. What are effective distributed and clustering-based methods for scalable multi-source entity resolution?

This theme investigates methods that use distributed computing frameworks and clustering algorithms to tackle ER involving multiple heterogeneous data sources. By focusing on clustering to group matching entities across many datasets and exploiting parallel processing platforms like Apache Flink and Apache Spark, these approaches aim to improve scalability and integration quality in multi-source ER scenarios.

Key finding: Implements distributed versions of six clustering algorithms on Apache Flink for multi-source ER, demonstrating that clustering-based approaches improve match quality and scalability by grouping related entities across... Read more
Key finding: JediAI system facilitates building end-to-end ER pipelines combining schema-awareness, budget-awareness, and execution mode dimensions, supporting schema-agnostic and schema-based blocking/matching. It offers both serial and... Read more
Key finding: Proposes RulER, a method to efficiently execute complex record-level matching rules combining multiple similarity predicates on distributed MapReduce-like systems. It enables parallel and distributed processing of similarity... Read more

3. How can multi-type or graph-based entity representations improve unsupervised entity resolution and disambiguation?

This research area explores leveraging graph structures and multi-type entity models for unsupervised entity resolution and named entity disambiguation. Techniques focus on jointly resolving entities of different types by summarizing multi-typed RDF graphs, exploiting relational context, and applying graph-based semantic relatedness for disambiguating entities without relying on supervised learning.

Key finding: Formulates ER as a multi-type graph summarization problem, jointly clustering nodes of different types that represent the same entity, and inferring the importance of relations between entity types. The approach outperforms... Read more
Key finding: Proposes a graph-based random walk semantic relatedness method over Wikipedia that models only named entities and their contextual links for Named Entity Disambiguation (NED). The approach achieves state-of-the-art accuracy... Read more
Key finding: Introduces a Bayesian network-based algorithm for probabilistic entity linkage that models evidences and their interdependencies from heterogeneous information spaces. The method supports incremental update of matching... Read more

All papers in Entity Resolution

In the Web of data, entities are described by interlinked data rather than documents on the Web. In this work, we focus on entity resolution in the Web of data, i.e., identifying descriptions that refer to the same real-world entity. To... more
Abstract. The Entity Resolution problem has been widely addressed in the literature. In its simplest version, the problem takes as input a knowledge base composed of records describing real world entities and outputs the sets of records... more
Record linkage, the process of identifying and linking records that refer to the same entity across different data sources, is a critical challenge in data management and integration. Exact matching often fails due to data entry errors,... more
We have encountered several practical issues in performing data mining on a database that has been normalized using entity resolution. We describe here four specific lessons learned in such mining and the meta-level lesson learned through... more
The problem of identifying duplicated entities in a dataset has gained increasing importance during the last decades. Due to the large size of the datasets, this problem can be very costly to be solved due to its intrinsic quadratic... more
In many scenarios, it is necessary to identify records referring to the same real-world object across different data sources (Record Linkage). Yet, such need is often in contrast with privacy requirements concerning (e.g., identify... more
The increasing use of Web systems has become a valuable source of semi-structured data. In this context, the Entity Resolution (ER) task emerges as a fundamental step to integrate multiple knowledge bases or identify similarities between... more
The problem of identifying duplicated entities in a dataset has gained increasing importance during the last decades. Due to the large size of the datasets, this problem can be very costly to be solved due to its intrinsic quadratic... more
Entity Matching (EM), i.e., the task of identifying records that refer to the same entity, is a fundamental problem in every information integration and data cleansing system, e.g., to find similar product descriptions in databases. The... more
Matching dependencies were recently introduced as declarative rules for data cleaning and entity resolution. Enforcing a matching dependency on a database instance identifies the values of some attributes for two tuples, provided that the... more
Record-level matching rules are chains of similarity join predicates on multiple attributes employed to join records that refer to the same real-world object when an explicit foreign key is not available on the data sets at hand. They are... more
In big data sources, real-world entities are typically represented with a variety of schemata and formats (e.g., relational records, JSON objects, etc.). Different profiles (i.e., representations) of an entity often contain redundant... more
In Papadakis et al. , we presented the latest release of JedAI, an open-source Entity Resolution (ER) system that allows for building a large variety of end-to-end ER pipelines. Through a thorough experimental evaluation, we compared a... more
Entity Resolution (ER) is the task of detecting different entity profiles that describe the same real-world objects. To facilitate its execution, we have developed JedAI, an open-source system that puts together a series of... more
Revolutionizing Business Intelligence with AI-Driven Big Data Analytics In an era where data-driven decision-making defines competitive advantage, traditional business intelligence systems often struggle with scalability, accuracy, and... more
Web data repositories usually contain references to thousands of real-world entities from multiple sources. It is not uncommon that multiple entities share the same label (polysemes) and that distinct label variations are associated with... more
Similarity Join is an important operation in data integration and cleansing, record linkage, data deduplication and pattern matching. It finds similar sting pairs from two collections of strings. Number of approaches have been proposed as... more
The widespread use of information systems has become a valuable source of semi-structured data. In this context, Entity Resolution (ER) emerges as a fundamental task to integrate multiple knowledge bases or identify similarities between... more
For unsupervised clustering, traditional accuracy metrics based on the constituent records do not often reflect the accuracy at the cluster level. For a specific example, consider entity resolution where the goal is to cluster records... more
Intelligent agents must be able to handle the complexity and uncertainty of the real world. Logical AI has focused mainly on the former, and statistical AI on the latter. Markov logic combines the two by attaching weights to first-order... more
Unifying first-order logic and probability is a long-standing goal of AI, and in recent years many representations combining aspects of the two have been proposed. However, inference in them is generally still at the level of... more
Entity resolution is the task of identifying duplicate entities in a dataset or multiple datasets. In the era of Big Data, this task has gained notorious attention due to the intrinsic quadratic complexity of the problem in relation to... more
The increasing use of Web systems has become a valuable source of semi-structured data. In this context, the Entity Resolution (ER) task emerges as a fundamental step to integrate multiple knowledge bases or identify similarities between... more
Entity resolution (ER) emerges as a fundamental step to integrate multiple knowledge bases or identify similarities between entities. Blocking is widely applied as an initial step of ER to avoid computing similarities between all pairs of... more
To combine information from heterogeneous sources, equivalent data in the multiple sources must be identi- fied. This task is the field matching problem. Specifi- cally, the task is to determine whether or not two syn- tactic values are... more
Data quality in companies is decisive and critical to the benefits their products and services can provide. However, in heterogeneous IT infrastructures where, e.g., different applications for Enterprise Resource Planning (ERP), Customer... more
Centered around the data cleaning and integration research area, in this paper we propose SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a... more
Objective Record linkage to integrate uncoordinated databases is critical in biomedical research using Big Data. Balancing privacy protection against the need for high quality record linkage requires a human-machine hybrid system to... more
In this report, we present our research results for the fourth half-year phase of the project Corporate Smart Content under the working package "Knowledge- based Mining of Complex Event Patterns". We present SpaceROAM, our new... more
The increasing application of social and human-enabled systems in people's daily life from one side and from the other side the fast growth of mobile and smart phones technologies have resulted in generating tremendous amount of data,... more
The constant growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information... more
Bibliographic databases are used to measure the performance of researchers, universities and research institutions. Thus, high data quality is required and data duplication is avoided. One of the weaknesses of the threshold-based approach... more
Matching-related methods, i.e., entity resolution, search and evolution, are essential parts in a variety of applications. The specific research area contains a plethora of methods focusing on efficiently and effectively detecting whether... more
Entity resolution is the process of determining if, in a specific context, two or more references correspond to the same entity. In this work, we address this problem in the context of references to persons as they are found in... more
Subject headings systems are tools for organization of knowledge that have been developed over the years by libraries. The SKOS Simple Knowledge Organization System provides a practical way to represent subject headings systems, and... more
Soft cardinality (SC) is a softened version of the classical cardinality of set theory. However, given its prohibitive cost of computing (exponential order), an approximation that is quadratic in the number of terms in the text has been... more
Jo, whose support during this project has been boundless and whose nagging about finishing this thesis wasn't annoying at all. My Mum, Heather, and Dad, Jeremy. Who have loved, supported and encouraged me over the years and will now... more
Citation Segmentation in a Digital Humanities Context. Bibliographies are an important resource for scientific research. Their storage in (online) bibliographic databases offers efficient search functionalities for wide-spread and timely... more
The Neural network system is an educational paradigm that unites several neural networks to solve a problem. This paper explores the relationship between the ensemble and its networks of neural components, both from the viewpoint of... more
The Neural network system is an educational paradigm that unites several neural networks to solve a problem. This paper explores the relationship between the ensemble and its networks of neural components, both from the viewpoint of... more
We present Midas, a system that uses complex data processing to extract and aggregate facts from a large collection of structured and unstructured documents into a set of unified, clean entities and relationships. Midas focuses on data... more
This paper looks at the problem of privacy in the context of Online Social Networks (OSNs). In particular, it examines the predictability of different types of personal information based on OSN data and compares it to the perceptions of... more
Background Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of... more
This paper describes our team's (BS Man & Dmitry & Leustagos) approach to the KDD Cup 2013 track 2 challenge: Author Disambiguation in the Microsoft Academic Search database.
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of... more
In intelligence analysis environments, content such as entities, events and relationships appear in different source documents and contexts, and relating them is a challenging and intensive task. This paper presents an approach to... more
Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the... more
Matching dependencies were recently introduced as declarative rules for data cleaning and entity resolution. Enforcing a matching dependency on a database instance identifies the values of some attributes for two tuples, provided that the... more
Download research papers for free!