Entity Resolution

description257 papers

group65 followers

lightbulbAbout this topic

Entity Resolution is the process of identifying and merging records that refer to the same real-world entity across different data sources, ensuring data consistency and accuracy. It involves techniques from data cleaning, deduplication, and record linkage to resolve ambiguities and discrepancies in data representation.

lightbulbAbout this topic

Key research themes

1. How can schema-agnostic and scalable blocking techniques improve entity resolution on heterogeneous and noisy big data?

This research theme focuses on developing blocking methods that do not require prior schema knowledge and can efficiently handle large, heterogeneous, and noisy datasets. Blocking is a critical step in entity resolution (ER) that partitions datasets into smaller blocks to reduce the quadratic comparison cost. Addressing schema heterogeneity and noise while maintaining blocking effectiveness and scalability is essential for processing Big Data ER tasks.

An Effective Entity Resolution Approach for Big Data

by Ali El-bastawissy

2022, International Journal of Innovative Technology and Exploring Engineering

Key finding: Proposes a novel schema-agnostic ER approach that treats entity attributes as bags of words and uses n-grams combined with Apache Spark for scalable processing. The method avoids complex schema alignment and meta-blocking... Read more

articleView Paper downloadDownload

Incremental Entity Blocking over Heterogeneous Streaming Data

by Tiago Brasileiro

2023, Information

Key finding: Introduces a schema-agnostic blocking technique that incrementally processes streaming, noisy, and heterogeneous data using distributed infrastructure. The approach applies attribute selection and top-n neighborhood... Read more

articleView Paper downloadDownload

A noise tolerant and schema-agnostic blocking technique for entity resolution

by Demetrio Mestre

2025, Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing

Key finding: Presents NA-BLOCKER, a novel noise-tolerant, schema-agnostic blocking technique using Locality Sensitive Hashing (LSH) to hash attribute values. NA-BLOCKER enhances block quality and effectiveness over the state-of-the-art... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What are effective distributed and clustering-based methods for scalable multi-source entity resolution?

This theme investigates methods that use distributed computing frameworks and clustering algorithms to tackle ER involving multiple heterogeneous data sources. By focusing on clustering to group matching entities across many datasets and exploiting parallel processing platforms like Apache Flink and Apache Spark, these approaches aim to improve scalability and integration quality in multi-source ER scenarios.

Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution

by Alieh Saeedi

2022

Key finding: Implements distributed versions of six clustering algorithms on Apache Flink for multi-source ER, demonstrating that clustering-based approaches improve match quality and scalability by grouping related entities across... Read more

articleView Paper downloadDownload

Three-dimensional Entity Resolution with JedAI

by Sonia BERGAMASCHI

2025, Information Systems

Key finding: JediAI system facilitates building end-to-end ER pipelines combining schema-awareness, budget-awareness, and execution mode dimensions, supporting schema-agnostic and schema-based blocking/matching. It offers both serial and... Read more

articleView Paper downloadDownload

Scaling Up Record-level Matching Rules

by Sonia BERGAMASCHI

2025, SEBD

Key finding: Proposes RulER, a method to efficiently execute complex record-level matching rules combining multiple similarity predicates on distributed MapReduce-like systems. It enables parallel and distributed processing of similarity... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can multi-type or graph-based entity representations improve unsupervised entity resolution and disambiguation?

This research area explores leveraging graph structures and multi-type entity models for unsupervised entity resolution and named entity disambiguation. Techniques focus on jointly resolving entities of different types by summarizing multi-typed RDF graphs, exploiting relational context, and applying graph-based semantic relatedness for disambiguating entities without relying on supervised learning.

Unsupervised Entity Resolution on Multi-type Graphs

by Linhong Zhu

2018

Key finding: Formulates ER as a multi-type graph summarization problem, jointly clustering nodes of different types that represent the same entity, and inferring the importance of relations between entity types. The approach outperforms... Read more

articleView Paper downloadDownload

Cultural Knowledge for Named Entity Disambiguation: A Graph-Based Semantic Relatedness Approach

by Ziqi Zhang

2023, Serdica Journal of Computing

Key finding: Proposes a graph-based random walk semantic relatedness method over Wikipedia that models only named entities and their contextual links for Named Entity Disambiguation (NED). The approach achieves state-of-the-art accuracy... Read more

articleView Paper downloadDownload

Probabilistic Entity Linkage for Heterogeneous Information Spaces

by Ekaterini Ioannou

2024, Lecture Notes in Computer Science

Key finding: Introduces a Bayesian network-based algorithm for probabilistic entity linkage that models evidences and their interdependencies from heterogeneous information spaces. The method supports incremental update of matching... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Entity Resolution

Big data entity resolution: From highly to somehow similar entity descriptions in the Web

by Vasileios Efthymiou

2025, 2015 IEEE International Conference on Big Data (Big Data)

In the Web of data, entities are described by interlinked data rather than documents on the Web. In this work, we focus on entity resolution in the Web of data, i.e., identifying descriptions that refer to the same real-world entity. To... more

descriptionView Paper arrow_downwardDownload

On Link Validity and entity resolution

by Madalina Croitoru

2025

Abstract. The Entity Resolution problem has been widely addressed in the literature. In its simplest version, the problem takes as input a knowledge base composed of records describing real world entities and outputs the sets of records... more

descriptionView Paper arrow_downwardDownload

CEFFRL -A Configurable and Extensible Framework for Fuzzy Record Linkage in .NET

by Ratul Ali

2025, SSRN Elsevier

Record linkage, the process of identifying and linking records that refer to the same entity across different data sources, is a critical challenge in data management and integration. Exact matching often fails due to data entry errors,... more

descriptionView Paper arrow_downwardDownload

Data Mining in the Context of Entity Resolution

by Sofus Macskássy

2025

We have encountered several practical issues in performing data mining on a database that has been normalized using entity resolution. We describe here four specific lessons learned in such mining and the meta-level lesson learned through... more

descriptionView Paper arrow_downwardDownload

Exploiting block co-occurrence to control block sizes for entity resolution

by Demetrio Mestre

2025, Knowledge and Information Systems

The problem of identifying duplicated entities in a dataset has gained increasing importance during the last decades. Due to the large size of the datasets, this problem can be very costly to be solved due to its intrinsic quadratic... more

descriptionView Paper arrow_downwardDownload

Blind attribute pairing for privacy-preserving record linkage

by Demetrio Mestre

2025, Proceedings of the 33rd Annual ACM Symposium on Applied Computing

In many scenarios, it is necessary to identify records referring to the same real-world object across different data sources (Record Linkage). Yet, such need is often in contrast with privacy requirements concerning (e.g., identify... more

descriptionView Paper arrow_downwardDownload

A noise tolerant and schema-agnostic blocking technique for entity resolution

by Demetrio Mestre

2025, Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing

The increasing use of Web systems has become a valuable source of semi-structured data. In this context, the Entity Resolution (ER) task emerges as a fundamental step to integrate multiple knowledge bases or identify similarities between... more

descriptionView Paper arrow_downwardDownload

Exploiting block co-occurrence to control block sizes for entity resolution

by Demetrio Mestre

2025, Knowledge and Information Systems

descriptionView Paper arrow_downwardDownload

An efficient spark-based adaptive windowing for entity matching

by Demetrio Mestre

2025, Journal of Systems and Software

Entity Matching (EM), i.e., the task of identifying records that refer to the same entity, is a fundamental problem in every information integration and data cleansing system, e.g., to find similar product descriptions in databases. The... more

descriptionView Paper arrow_downwardDownload

Data Cleaning and Query Answering with Matching Dependencies and Matching Functions

by Leopoldo Bertossi

2025, arXiv (Cornell University)

Matching dependencies were recently introduced as declarative rules for data cleaning and entity resolution. Enforcing a matching dependency on a database instance identifies the values of some attributes for two tuples, provided that the... more

descriptionView Paper arrow_downwardDownload

Scaling Up Record-level Matching Rules

by Sonia BERGAMASCHI

2025, SEBD

Record-level matching rules are chains of similarity join predicates on multiple attributes employed to join records that refer to the same real-world object when an explicit foreign key is not available on the data sets at hand. They are... more

descriptionView Paper arrow_downwardDownload

Scaling entity resolution: A loosely schema-aware approach

by Sonia BERGAMASCHI

2025, Information Systems

In big data sources, real-world entities are typically represented with a variety of schemata and formats (e.g., relational records, JSON objects, etc.). Different profiles (i.e., representations) of an entity often contain redundant... more

descriptionView Paper arrow_downwardDownload

Reproducible experiments on Three-Dimensional Entity Resolution with JedAI

by Sonia BERGAMASCHI

2025, Information Systems

In Papadakis et al. , we presented the latest release of JedAI, an open-source Entity Resolution (ER) system that allows for building a large variety of end-to-end ER pipelines. Through a thorough experimental evaluation, we compared a... more

descriptionView Paper arrow_downwardDownload

Three-dimensional Entity Resolution with JedAI

by Sonia BERGAMASCHI

2025, Information Systems

Entity Resolution (ER) is the task of detecting different entity profiles that describe the same real-world objects. To facilitate its execution, we have developed JedAI, an open-source system that puts together a series of... more

descriptionView Paper arrow_downwardDownload

AI-Driven Data Clustering

by Shishir Tewari

2025

Revolutionizing Business Intelligence with AI-Driven Big Data Analytics In an era where data-driven decision-making defines competitive advantage, traditional business intelligence systems often struggle with scalability, accuracy, and... more

descriptionView Paper arrow_downwardDownload

A generic Web-based entity resolution framework

by denilson pereira

2024, Journal of the American Society for Information Science and Technology

Web data repositories usually contain references to thousands of real-world entities from multiple sources. It is not uncommon that multiple entities share the same label (polysemes) and that distinct label variations are associated with... more

descriptionView Paper arrow_downwardDownload

Survey of Scalable String Similarity Joins

by Archana Vaidya

2024

Similarity Join is an important operation in data integration and cleansing, record linkage, data deduplication and pattern matching. It finds similar sting pairs from two collections of strings. Number of approaches have been proposed as... more

descriptionView Paper arrow_downwardDownload

Incremental Blocking for Entity Resolution over Web Streaming Data

by Tiago Brasileiro Araújo

2024, IEEE/WIC/ACM International Conference on Web Intelligence

The widespread use of information systems has become a valuable source of semi-structured data. In this context, Entity Resolution (ER) emerges as a fundamental task to integrate multiple knowledge bases or identify similarities between... more

descriptionView Paper arrow_downwardDownload

Record Linkage Measures in an Entity Centric World

by Sofus Macskássy

2024

For unsupervised clustering, traditional accuracy metrics based on the constituent records do not often reflect the accuracy at the cluster level. For a specific example, consider entity resolution where the goal is to cluster records... more

descriptionView Paper arrow_downwardDownload

Unifying Logical and Statistical AI

by Pedro Domingos Domingos

2024, Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science

Intelligent agents must be able to handle the complexity and uncertainty of the real world. Logical AI has focused mainly on the former, and statistical AI on the latter. Markov logic combines the two by attaching weights to first-order... more

descriptionView Paper arrow_downwardDownload

Lifted first-order belief propagation

by Pedro Domingos Domingos

2024

Unifying first-order logic and probability is a long-standing goal of AI, and in recent years many representations combining aspects of the two have been proposed. However, inference in them is generally still at the level of... more

descriptionView Paper arrow_downwardDownload

A Theoretical Model for Estimating Entity Resolution Costs in Cloud Computing Environments

by Carlos Eduardo Pires

2024, HAL (Le Centre pour la Communication Scientifique Directe)

Entity resolution is the task of identifying duplicate entities in a dataset or multiple datasets. In the era of Big Data, this task has gained notorious attention due to the intrinsic quadratic complexity of the problem in relation to... more

descriptionView Paper arrow_downwardDownload

A noise tolerant and schema-agnostic blocking technique for entity resolution

by Carlos Eduardo Pires

2024, Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing

descriptionView Paper arrow_downwardDownload

Noisy-aware Blocking over Heterogeneous Data

by Carlos Eduardo Pires

2024

Entity resolution (ER) emerges as a fundamental step to integrate multiple knowledge bases or identify similarities between entities. Blocking is widely applied as an initial step of ER to avoid computing similarities between all pairs of... more

descriptionView Paper arrow_downwardDownload

The field matching problem: Algorithms and applications

by Charles Elkan

2024, Proceedings of the Second International Conference …

To combine information from heterogeneous sources, equivalent data in the multiple sources must be identi- fied. This task is the field matching problem. Specifi- cally, the task is to determine whether or not two syn- tactic values are... more

descriptionView Paper arrow_downwardDownload

Overlooked Aspects of Data Governance: Workflow Framework For Enterprise Data Deduplication

by Dr.-Ing. Otmane Azeroual and

2024

Data quality in companies is decisive and critical to the benefits their products and services can provide. However, in heterogeneous IT infrastructures where, e.g., different applications for Enterprise Resource Planning (ERP), Customer... more

descriptionView Paper arrow_downwardDownload

A new algorithm for the correction and control of road accidents data based on string similarities

by Diego Moretti

2024, ASA

Road Accidents , Beware Everywhere

descriptionView Paper arrow_downwardDownload

An Innovative Framework for Combining Set Similarity Join Algorithms and Clustering

by Massimiliano Nolich

2024

Centered around the data cleaning and integration research area, in this paper we propose SjClust, a framework to integrate similarity join and clustering into a single operation. The basic idea of our proposal consists in introducing a... more

descriptionView Paper arrow_downwardDownload

Privacy preserving interactive record linkage (PPIRL)

by Ashok Krishnamurthy

2024, Journal of the American Medical Informatics Association

Objective Record linkage to integrate uncoordinated databases is critical in biomedical research using Big Data. Balancing privacy protection against the need for high quality record linkage requires a human-machine hybrid system to... more

descriptionView Paper arrow_downwardDownload

Designs and Prototype

by Ahmed Nader

2024

In this report, we present our research results for the fourth half-year phase of the project Corporate Smart Content under the working package "Knowledge- based Mining of Complex Event Patterns". We present SpaceROAM, our new... more

descriptionView Paper arrow_downwardDownload

Big Data Analytics Using Cloud and Crowd

by Mohammad Allahbakhsh

2024, arXiv (Cornell University)

The increasing application of social and human-enabled systems in people's daily life from one side and from the other side the fast growth of mobile and smart phones technologies have resulted in generating tremendous amount of data,... more

descriptionView Paper arrow_downwardDownload

A Clustering-Based Combinatorial Approach to Unsupervised Matching of Product Titles

by Panayiotis Bozanis

2024, ArXiv

The constant growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information... more

descriptionView Paper arrow_downwardDownload

Proposed threshold-based and rule-based approaches to detecting duplicates in bibliographic database

by beei iaes

2024

Bibliographic databases are used to measure the performance of researchers, universities and research institutions. Thus, high data quality is required and data duplication is avoided. One of the weaknesses of the threshold-based approach... more

descriptionView Paper arrow_downwardDownload

EMBench++: Benchmark Data for Thorough Evaluation of Matching-Related Methods

by Ekaterini Ioannou

2024

Matching-related methods, i.e., entity resolution, search and evolution, are essential parts in a variety of applications. The specific research area contains a plethora of methods focusing on efficiently and effectively detecting whether... more

descriptionView Paper arrow_downwardDownload

Consolidation of references to persons in bibliographic databases

by Jose Borbinha

2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Entity resolution is the process of determining if, in a specific context, two or more references correspond to the same entity. In this work, we address this problem in the context of references to persons as they are found in... more

descriptionView Paper arrow_downwardDownload

A Language Independent Approach for Aligning Subject Heading Systems with Geographic Ontologies

by Jose Borbinha

2024, International Conference on Dublin Core and Metadata Applications

Subject headings systems are tools for organization of knowledge that have been developed over the years by libraries. The SKOS Simple Knowledge Organization System provides a practical way to represent subject headings systems, and... more

descriptionView Paper arrow_downwardDownload

SC Spectra: A Linear-Time Soft Cardinality Approximation for Text Comparison

by Sergio Jimenez

2024, Advances in Soft Computing

Soft cardinality (SC) is a softened version of the classical cardinality of set theory. However, given its prohibitive cost of computing (exponential order), an approximation that is quadratic in the number of terms in the text has been... more

descriptionView Paper arrow_downwardDownload

Semi-automated co-reference identification in digital humanities collections

by David Croft

2024

Jo, whose support during this project has been boundless and whose nagging about finishing this thesis wasn't annoying at all. My Mum, Heather, and Dad, Jeremy. Who have loved, supported and encouraged me over the years and will now... more

descriptionView Paper arrow_downwardDownload

Segmentation from Sparse & Noisy Data : An Unsupervised Joint Inference Approach with Markov Logic Networks

by Dustin Heckmann

2024

Citation Segmentation in a Digital Humanities Context. Bibliographies are an important resource for scientific research. Their storage in (online) bibliographic databases offers efficient search functionalities for wide-spread and timely... more

Fig. 2. Turkology Annual Online: display of a single entry

descriptionView Paper arrow_downwardDownload

Inauguration in Development for Data Deduplication Under Neural Network Circumstances

by Prasadu Peddi

2024, International Research Journal on Advanced Science Hub

The Neural network system is an educational paradigm that unites several neural networks to solve a problem. This paper explores the relationship between the ensemble and its networks of neural components, both from the viewpoint of... more

descriptionView Paper arrow_downwardDownload

Inauguration in Development for Data Deduplication Under Neural Network Circumstances

by Prasadu Peddi

2024, International Research Journal on Advanced Science Hub

descriptionView Paper arrow_downwardDownload

Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study

by Sanjiv Das

2024, SSRN Electronic Journal

We present Midas, a system that uses complex data processing to extract and aggregate facts from a large collection of structured and unstructured documents into a set of unified, clean entities and relationships. Midas focuses on data... more

descriptionView Paper arrow_downwardDownload

Perceived Versus Actual Predictability of Personal Information in Social Networks

by Rob Heyman

2024, Internet Science

This paper looks at the problem of privacy in the context of Online Social Networks (OSNs). In particular, it examines the predictability of different types of personal information based on OSN data and compares it to the perceptions of... more

descriptionView Paper arrow_downwardDownload

NSF 2017 Annual Report for EAGER: Open XD Metrics on Demand Value Analytics

by Katy Borner

2024

descriptionView Paper arrow_downwardDownload

CIDACS-RL: a novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability

by Rosemeire Fiaccone

2024, BMC Medical Informatics and Decision Making

Background Record linkage is the process of identifying and combining records about the same individual from two or more different datasets. While there are many open source and commercial data linkage tools, the volume and complexity of... more

descriptionView Paper arrow_downwardDownload

KDD Cup 2013

by Lucas Silva

2024, Proceedings of the 2013 KDD Cup 2013 Workshop

This paper describes our team's (BS Man & Dmitry & Leustagos) approach to the KDD Cup 2013 track 2 challenge: Author Disambiguation in the Microsoft Academic Search database.

descriptionView Paper arrow_downwardDownload

Duplicate Record Detection: A Survey

by name lastname

2024, IEEE Transactions on Knowledge and Data Engineering

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of... more

descriptionView Paper arrow_downwardDownload

Significant information encapsulation and valence exploitation (SIEVE) for discovery

by William Rose

2024

In intelligence analysis environments, content such as entities, events and relationships appear in different source documents and contexts, and relating them is a challenging and intensive task. This paper presents an approach to... more

descriptionView Paper arrow_downwardDownload

Record linkage: Current practice and future directions

by Chris Rainsford

2024

Record linkage is the task of quickly and accurately identifying records corresponding to the same entity from one or more data sources. Record linkage is also known as data cleaning, entity reconciliation or identification and the... more

descriptionView Paper arrow_downwardDownload

Data cleaning and query answering with matching dependencies and matching functions

by Leopoldo Bertossi

2024

descriptionView Paper arrow_downwardDownload