Data Matching

description55 papers

group33 followers

lightbulbAbout this topic

Data matching is the process of identifying and linking records from different datasets that refer to the same entity, ensuring data consistency and accuracy. It involves techniques for comparing, merging, and reconciling data to eliminate duplicates and enhance data quality for analysis and decision-making.

lightbulbAbout this topic

Key research themes

1. How can scalable and efficient algorithms address large-scale multilingual record linkage and load balancing?

This research area investigates methods to improve the scalability and efficiency of record linkage processes, especially in contexts involving large datasets with records in multiple languages. It focuses on algorithmic solutions that balance computational loads while maintaining high accuracy in matching records with language and script variations.

Cost-aware load balancing for multilingual record linkage using MapReduce

by Cherif Salama

2024, Ain Shams Engineering Journal

Key finding: Introduced a scalable, cost-aware load balancing technique over MapReduce for linking multilingual data sources without transliteration, outperforming state-of-the-art blocking-based load balancing methods in execution time... Read more

articleView Paper downloadDownload

Comparing record linkage software programs and algorithms using real-world data

by Tzuyung Kou

2023, PLOS ONE

Key finding: Provided empirical comparative analysis of 17 linkage algorithms over large real-world datasets (100,000-200,000 records), highlighting the computational and memory trade-offs between exact and inexact string matching methods... Read more

articleView Paper downloadDownload

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

by Ram Chandran

2022, IEEE Transactions on Knowledge and Data Engineering

Key finding: Surveyed twelve indexing techniques aimed at reducing record pair comparisons to boost scalability while retaining linkage quality. The work analyzed their computational complexity and performance on real and synthetic data,... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What frameworks and methodologies improve data deduplication and entity integration across heterogeneous data sources?

This theme investigates conceptual frameworks, practical tools, and methodologies for deduplication and entity resolution across multiple heterogeneous data sources. It emphasizes methods combining blocking, record linkage, and human-in-the-loop strategies to improve data quality in domains with complex, diverse inputs and the integration challenges associated with large-scale or domain-specific datasets.

A Record Linkage-Based Data Deduplication Framework with DataCleaner Extension

by Dr.-Ing. Otmane Azeroual and

2022, Multimodal Technologies and Interaction

Key finding: Proposed a six-step deduplication framework integrated into the DataCleaner tool, combining record linkage methods with blocking and sorted neighborhood techniques to efficiently identify duplicates in large, heterogeneous... Read more

articleView Paper downloadDownload

Matching data detection for the integration system

by International Journal of Electrical and Computer Engineering (IJECE) and

2022, International Journal of Electrical and Computer Engineering (IJECE)

Key finding: Presented a novel data integration approach utilizing graph-based techniques to reduce search space and group similar records for deduplication across heterogeneous sources including databases, CSV files, and web services.... Read more

articleView Paper downloadDownload

Conceptual Framework for entity integration from multiple data sources

by Dražen Oreščanin

2022, 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)

Key finding: Outlined a conceptual framework automating entity integration from multiple data sources, addressing key issues such as ordering of resolution for datasets with different schemas, performance optimization via feature... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can privacy-preserving methods enable secure and ethical record linkage of sensitive genomic and clinical datasets?

This theme focuses on the ethical, legal, and technological challenges in linking sensitive data sets such as genomic and clinical records, with the dual goals of enabling data-driven health research and preserving participants' privacy. It explores privacy-preserving linkage (PPRL) approaches that allow record matching without direct identity disclosure and policy frameworks for responsible data sharing.

Privacy-Preserving Linkage of Genomic and Clinical Data Sets

by Domenica Taruscio

2023, IEEE/ACM Transactions on Computational Biology and Bioinformatics

Key finding: Provided a comprehensive overview of ethical, legal, and technological challenges in PPRL, emphasizing the need for linkage methods that enable data integration without disclosing individual identities except under legally... Read more

articleView Paper downloadDownload

All papers in Data Matching

Detection of potential updates of authoritative spatial databases by fusion of volunteered geographical information from different sources

by Thomas Devogele

2025

A continuous update of authoritative spatial databases is highly demanding task in both aspects, technical and financial. In the same time, alternative modalities to collect content, in particular spatial content, have achieved a certain... more

descriptionView Paper arrow_downwardDownload

Understanding and using the linked LSAY-NAPLAN data: issues and considerations

by Longitudinal Surveys of Australian Youth

2025

This paper seeks to identify and address issues that may arise with the use and analysis of the linked Longitudinal Surveys of Australian Youth (LSAY) and the National Assessment Program — Literacy and Numeracy (NAPLAN) data. The study... more

descriptionView Paper arrow_downwardDownload

25 Years of LSAY

by Longitudinal Surveys of Australian Youth

2025

This publication presents research highlights from the past 25 years of the LSAY program, with a focus on schooling, VET in schools programs, the influences of socioeconomic status and demographics on later opportunities, and pathways... more

descriptionView Paper arrow_downwardDownload

Mapping Adult Literacy Performance. Background Paper

by David D Curtis

2024, National Centre For Vocational Education Research

NCVER is an independent body responsible for collecting, managing and analysing, evaluating and communicating research and statistics about vocational education and training (VET). NCVER's in-house research and evaluation program... more

descriptionView Paper arrow_downwardDownload

Proyecto de migración de datos Sistema de Gestión Electoral

by Andrea Dalhul Uez

2024

En nuestro país la información electoral, es decir, aquella relacionada a los electores, los trámites que modifican los datos de estos, los actos electorales, los partidos políticos y sus autoridades, etc. está administrada por la Ju... more

descriptionView Paper arrow_downwardDownload

Cost-aware load balancing for multilingual record linkage using MapReduce

by Cherif Salama

2024, Ain Shams Engineering Journal

Gathering and processing large amounts of data is increasing every day. Record linkage is one of the most complex data-intensive tasks, which is used to accurately match records from different data sources that contain information about... more

descriptionView Paper arrow_downwardDownload

Analysing Building Shapes Quality of Collaborative Mapping

by maythm al-bakri

2024

The very fast developments of web and data collection technologies have enabled non-experts to collect and disseminate geospatial datasets through web applications. This new type of spatial data is usually known as collaborative mapping... more

descriptionView Paper arrow_downwardDownload

Analysis of Massive Imports of Open Data in Openstreetmap Database: A Study Case for France

by Ana-Maria Olteanu-Raimond

2024, ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

Importing spatial open data in OpenStreetMap (OSM) project, is a practice that has existed from the beginning of the project. The rapid development and multiplication of collaborative mapping tools and open data have led to the growth of... more

descriptionView Paper arrow_downwardDownload

The Impact of the Contribution Micro-environment on Data Quality: The Case of OSM

by Ana-Maria Olteanu-Raimond

2024, Mapping and the Citizen Sensor

OpenStreetMap (OSM) is the most successful example of Volunteered Geographic Information (VGI). It is also the most frequently used case study in research that focuses on VGI quality, as it is usually considered a proxy for other VGI... more

descriptionView Paper arrow_downwardDownload

Improvement of a location-aware recommender system using volunteered geographic information

by Rouzbeh Forouzandeh

2024, Geocarto International

Recommender systems (RS), as supportive tools, filter information from a massive amount of data based on the determined preferences. Most of the RS require information about the context of users such as their locations. In such cases,... more

descriptionView Paper arrow_downwardDownload

An End-to-End Big Data Deduplication Framework based on Online Continuous Learning

by Imane Alaoui

2024, International Journal of Advanced Computer Science and Applications

While big data benefits are numerous, most of the collected data is of poor quality and, therefore, cannot be effectively used as it is. One pre-processing the leading big data quality challenges is data duplication. Indeed, the gathered... more

3) Encoding: Encoding is the process of converting categorical variables into numeric types. Most ML algorithms cannot handle absolute values and work better with numerical We consider these transformations the most important ones to prepare data for deduplication. However, according to the dataset context, more text cleaning may be needed, such as Spell Corrections and Stemming.

TABLE IIL. DEDUPLICATION FRAMEWORK EVALUATION It is worth noting that the framework accuracy has evolved considerably after applying online continual learning. For our built dataset, the framework detected 117700 out of 122 000 duplicated records with an F-score of 98,21%. Indeed, the resulting F-score was initially 97,05% and has increased by 1,16% after applying the continual learning process to the model with an additional dataset of 100 000 records. The metrics above were measured for three different datasets: Restaurants, Companies, and our Built Big Dataset coming out with the results presented in Table III.

TABLEIV. PROCESSING TIME A second phase of the implementation consists of scrambling the built dataset intentionally by feeding the datasets with more challenging duplicates. The goal is to uncover the framework limitations and discover how the accuracy is impacted by the inferior and very poor data quality and to what extent the framework remains suitable for use. For this, we have unfiltered in the dataset extreme cases of non- duplicates where for example the name and the address are similar, but the records are not duplicates. The framework was applied to a very poor big data quality to uncover the limitations of the framework. The dataset was then scrambled

TABLE V. F-SCORE EVOLUTION IN A VERY POOR-QUALITY DATASET D. Discussion

descriptionView Paper arrow_downwardDownload

An Efficient Approach for Automated Token Formation for Record De-duplication with special reference to Real-Time Data-Warehouse Environment

by Vaishali Wangikar

2024

The record de-duplication is an important part of data cleaning process of a data-warehouse. Identification of multiple duplicate entries of a single entity in a data-warehouse is known as de-duplication. A lot of research is carried out... more

descriptionView Paper arrow_downwardDownload

Master Data Management using Record Linkage Toolkit for Integrating Lecturer Master Data

by M. Miftakul Amin

2024, E3S Web of Conferences

Merging databases from different data sources is one of the important tasks in the data integration process. This study will integrate lecturer data from data sources in the application of academic information systems and research... more

descriptionView Paper arrow_downwardDownload

A Risk-Based Approach for Enhancing the Fitness of Use of VGI

by Saida Aissi

2024, IEEE Access

Volunteered Geographic Information (VGI) phenomena offer an alternative or supplement to the authoritative mechanism of geospatial data acquisition. It aims to allow people without professional geospatial skills or knowledge to... more

descriptionView Paper arrow_downwardDownload

La tecnología en la prestación de los servicios básicos municipales: estacionamiento medido y gestión de la configuración

by Carlos Alvez

2024

descriptionView Paper arrow_downwardDownload

Record Linkage & Deduplication Based on Suffix and Prefix Array Indexing

by Arti Mohanpurkar

2024

Record linkage is an momentous process in data soundness which is used in combining, matching and duplicate removal from more than two databases that refer to the same entities. Deduplication is the process of taking off duplicate records... more

descriptionView Paper arrow_downwardDownload

Validation of Deduplication in Data using Similarity Measure

by Arti Mohanpurkar

2024, International Journal of Computer Applications

Deduplication is the process of determining all categories of information within a data set that signify the same real life / world entity. The data gathered from various resources may have data high quality issues in it. The concept to... more

descriptionView Paper arrow_downwardDownload

Contraption of Suffix Array Blocking for Efficacious Record Linkage and De-duplication

by Arti Mohanpurkar

2024, Communications on Applied Electronics

Information is united for common purpose from many sidedness computerized files is referred as record linkage. The basic methods compare name and address information across pairs of files to determine those pairs of records that are... more

descriptionView Paper arrow_downwardDownload

Review on Record LINKAGE and Deduplication based on Suffix Array Indexing

by Arti Mohanpurkar

2024, International Journal of Computer Applications

descriptionView Paper arrow_downwardDownload

Master Data Management using Record Linkage Toolkit for Integrating Lecturer Master Data

by Yevi Dwitayanti

2024, E3S Web of Conferences

Figure | shows the steps in merging databases using the record linkage approach. The steps taken include pre-processing, generating pairs, compare pairs, score pairs, and finally link data. All these stages are carried out using the record linkage toolkit library which is implemented in the python programming language.

deduplication are carried out, the results will be obtained in the form of | single dataset as a combination of two datasets. The formed dataset contains data that has no duplication. 4 Conclusion

descriptionView Paper arrow_downwardDownload

Inferring the Scale of OpenStreetMap Features

by Guillaume Touya

2024, Springer eBooks

Traditionally, national mapping agencies produced datasets and map products for a low number of specified and internally consistent scales, i.e. at a common level of detail (LoD). With the advent of projects like OpenStreetMap, data users... more

descriptionView Paper arrow_downwardDownload

Detecting Level-of-Detail Inconsistencies in Volunteered Geographic Information Data Sets

by Guillaume Touya

2024, Cartographica: The International Journal for Geographic Information and Geovisualization

Whereas it was possible to define the level of detail (LoD) of authoritative datasets, it is not possible for Volunteered Geographic Information (VGI), often characterised by heterogeneous levels of details. This heterogeneity is a curb... more

descriptionView Paper arrow_downwardDownload

What Is the Level of Detail of OpenStreetMap?

by Guillaume Touya

2024, HAL (Le Centre pour la Communication Scientifique Directe)

descriptionView Paper arrow_downwardDownload

Detecting Level of Detail Inconsistencies in VGI Datasets

by Guillaume Touya

2024

descriptionView Paper arrow_downwardDownload

Quality Assessment of the French OpenStreetMap Dataset

by Guillaume Touya

2024, Transactions in Gis

The concept of Volunteered Geographic Information (VGI) has recently emerged from the new Web 2.0 technologies. The OpenStreetMap project is currently the most significant example of a system based on VGI. It aims at producing free vector... more

descriptionView Paper arrow_downwardDownload

Proyecto de migración de datos Sistema de Gestión Electoral

by Marcelo Marciszack

2024

descriptionView Paper arrow_downwardDownload

Análisis de anomalías en los días de la semana, en precios del Bitcoin, con R Studio® y RapidMiner Studio®

by Daniel Guillermo Cavaller Riva

2023, XIII Simposio de Informática en el Estado (SIE 2019) - JAIIO 48 (Salta)

; diego.silva; roberto.moreno; pablo.masier; matias.codorniu}@fce.uncu.edu.ar http://fce.uncuyo.edu.ar Resúmen: el estudio de las anomalías en los días de la semana del mercado de las criptodivisas completa una vasta literatura que... more

descriptionView Paper arrow_downwardDownload

Análisis de causales en datos tributarios con anomalías

by Daniel Guillermo Cavaller Riva

2023

Actualmente en las administraciones tributarias existe un gran volumen de datos. Estos datos contienen implícito un conocimiento que puede ser extraído, este conocimiento dependerá de la calidad de los datos, y en esa cantidad de datos no... more

descriptionView Paper arrow_downwardDownload

Generating and Calibrating a Microscopic Traffic Flow Simulation Network of Kyoto

by Jan-Dirk Schmöcker

2023, SUMO Conference Proceedings

Microscopic traffic flow simulations as tools for enabling detailed insights on traffic efficiency and safety gained numerous popularity among transportation researchers, planners and engineers in the first to decades of the 21st century.... more

descriptionView Paper arrow_downwardDownload

Efficient Record De-Duplication Identifying Using Febrl Framework

by Mala Mala

2023, IOSR Journal of Computer Engineering

Record linkage is the problem of identifying similar records across different data sources. The similarity between two records is defined based on domain-specific similarity functions over several attributes. De-duplicating one data set... more

descriptionView Paper arrow_downwardDownload

Using Geometric Properties to Evaluate Possible Integration of Authoritative and Volunteered Geographic Information

by David Fairbairn

2023, ISPRS International Journal of Geo-Information

The assessment of data quality from different sources can be considered as a key challenge in supporting effective geospatial data integration and promoting collaboration in mapping projects. This paper presents a methodology for... more

descriptionView Paper arrow_downwardDownload

An Algorithm Determining the Front Edge of Buildings to Place Pois

by Batuhan Kilic

2023

Points of interests (POIs) describe a geographic entity that users are focused on such as a school. The different types of POIs are represented by the cartographic symbols. Its positional accuracy on a map is usually considered good if... more

descriptionView Paper arrow_downwardDownload

Generating and Calibrating a Microscopic Traffic Flow Simulation Network of Kyoto -First Insights from Simulating Private and Public Transport

by Andreas Keler

2023, SUMO User Conference 2023

descriptionView Paper arrow_downwardDownload

Matching data detection for the integration system

by M. El Abassi

2023, International Journal of Electrical and Computer Engineering (IJECE)

The purpose of data integration is to integrate the multiple sources of heterogeneous data available on the internet, such as text, image, and video. After this stage, the data becomes large. Therefore, it is necessary to analyze the data... more

descriptionView Paper arrow_downwardDownload

Tracing the Scientific Trajectory of Volunteered Cartography: The Case of OpenStreetMap

by Roberto Pizzolotto

2023, ISPRS International Journal of Geo-Information

Where the streets have no name is probably the preferred place for a volunteer OpenStreetMapper. Launched in 2004, the Open Street Map project aimed to share geographical data based on volunteer mapping and led to the collection of... more

descriptionView Paper arrow_downwardDownload

POI Mining for Land Use Classification: A Case Study

by Ana Alves

2023, ISPRS International Journal of Geo-Information

The modern planning and management of urban spaces is an essential topic for smart cities and depends on up-to-date and reliable information on land use and the functional roles of the places that integrate urban areas. In the last few... more

descriptionView Paper arrow_downwardDownload

POI Mining for Land Use Classification: A Case Study

by Ana Cristina Oliveira Alves

2023, ISPRS International Journal of Geo-Information

descriptionView Paper arrow_downwardDownload

Missing Values Compensation in Duplicates Detection Using Hot Deck

by Dr. Siti Azirah Asmai

2023

Duplicate record is a known problem within the datasets especially within databases of huge volumes. The accuracy of duplicates detection determines the efficiency of the duplicates removal process. Unfortunately, the effort to detect... more

descriptionView Paper arrow_downwardDownload

ENRES: a semantic framework for entity resolution modelling

by Latanya Sweeney

2023, Institute for Software Research International Technical Report (Carnegie Mellon Publication No. CMU-ISRI-05-134)

Entity resolution, the process of determining if two or more references correspond to the same entity, is an emerging area of study in computer science. While entity resolution models leverage artificial intelligence, machine learning,... more

descriptionView Paper arrow_downwardDownload

Attendance in Primary School: Factors and Consequences

by Galina Daraganova

2023, SSRN Electronic Journal

With the exception of the Commonwealth Coat of Arms: (for terms of use, refer to <www.itsanhonour.gov.au/coat-arms/index.cfm>), the details of the relevant licence conditions are available on the Creative Commons website, as is the full... more

descriptionView Paper arrow_downwardDownload

Matching data detection for the integration system

by Mohamed Amnai

2023, International Journal of Electrical and Computer Engineering (IJECE)

descriptionView Paper arrow_downwardDownload

Authoritative and Volunteered Geographical Information in a Developing Country: A Comparative Case Study of Road Datasets in Nairobi, Kenya

by Arie Croitoru

2023, ISPRS International Journal of Geo-Information

With volunteered geographic information (VGI) platforms such as OpenStreetMap (OSM) becoming increasingly popular, we are faced with the challenge of assessing the quality of their content, in order to better understand its place relative... more

descriptionView Paper arrow_downwardDownload

Assessing Completeness and Spatial Error of Features in Volunteered Geographic Information

by Arie Croitoru

2023, ISPRS International Journal of Geo-Information

The assessment of the quality and accuracy of Volunteered Geographic Information (VGI) contributions, and by extension the ultimate utility of VGI data has fostered much debate within the geographic community. The limited research to date... more

descriptionView Paper arrow_downwardDownload

Gobierno de datos sobre el Padrón de Contribuyentes de la Provincia de Buenos Aires

by Eliseo Palacios

2022

El presente artículo describe las etapas definidas en la Agencia de Recaudación de la Provincia de Buenos Aires para alcanzar un Gobierno de Datos sobre la información de los contribuyentes de la Provincia de Buenos Aires. El proyecto... more

descriptionView Paper arrow_downwardDownload

Matching data detection for the integration system

by Ali Choukri

2022, International Journal of Electrical and Computer Engineering (IJECE)

descriptionView Paper arrow_downwardDownload

A Record Linkage-Based Data Deduplication Framework with DataCleaner Extension

by Meena Jha

2022, Multimodal Technologies and Interaction

The data management process is characterised by a set of tasks where data quality management (DQM) is one of the core components. Data quality, however, is a multidimensional concept, where the nature of the data quality issues is very... more

descriptionView Paper arrow_downwardDownload

VGI contributors’ awareness of geographic information quality and its effect on data quality: a case study from Japan

by Nobusuke Iwasaki

2022, International Journal of Cartography

In many countries, geospatial data are typically provided by public institutions. Cities have been mapped using such public data. On the other hand, the demand for geospatial data has been diversifying, given the requirements for mapping... more

descriptionView Paper arrow_downwardDownload

Automatic analysis of positional plausibility for points of interest in OpenStreetMap using coexistence patterns

by Alireza Kashian

2022, International Journal of Geographical Information Science

In the past decade, Volunteered Geographic Information (VGI) has emerged as a new source of geographic information, making it a cheap and universal competitor to existing authoritative data sources. The growing popularity of VGI... more

descriptionView Paper arrow_downwardDownload

An Algorithm Determining the Front Edge of Buildings to Place Pois

by Batuhan Kılıç

2022

descriptionView Paper arrow_downwardDownload

Em-K Indexing for Approximate Query Matching in Large-scale ER

by Samudra Herath

2022, ArXiv

Accurate and efficient entity resolution (ER) is a significant challenge in many data mining and analysis projects requiring integrating and processing massive data collections. It is becoming increasingly important in real-world... more

descriptionView Paper arrow_downwardDownload

Data Matching

Key research themes

1. How can scalable and efficient algorithms address large-scale multilingual record linkage and load balancing?

2. What frameworks and methodologies improve data deduplication and entity integration across heterogeneous data sources?

3. How can privacy-preserving methods enable secure and ethical record linkage of sensitive genomic and clinical datasets?

Related Topics

All papers in Data Matching