Academia.eduAcademia.edu

Data Matching

description55 papers
group33 followers
lightbulbAbout this topic
Data matching is the process of identifying and linking records from different datasets that refer to the same entity, ensuring data consistency and accuracy. It involves techniques for comparing, merging, and reconciling data to eliminate duplicates and enhance data quality for analysis and decision-making.
lightbulbAbout this topic
Data matching is the process of identifying and linking records from different datasets that refer to the same entity, ensuring data consistency and accuracy. It involves techniques for comparing, merging, and reconciling data to eliminate duplicates and enhance data quality for analysis and decision-making.

Key research themes

1. How can scalable and efficient algorithms address large-scale multilingual record linkage and load balancing?

This research area investigates methods to improve the scalability and efficiency of record linkage processes, especially in contexts involving large datasets with records in multiple languages. It focuses on algorithmic solutions that balance computational loads while maintaining high accuracy in matching records with language and script variations.

Key finding: Introduced a scalable, cost-aware load balancing technique over MapReduce for linking multilingual data sources without transliteration, outperforming state-of-the-art blocking-based load balancing methods in execution time... Read more
Key finding: Provided empirical comparative analysis of 17 linkage algorithms over large real-world datasets (100,000-200,000 records), highlighting the computational and memory trade-offs between exact and inexact string matching methods... Read more
Key finding: Surveyed twelve indexing techniques aimed at reducing record pair comparisons to boost scalability while retaining linkage quality. The work analyzed their computational complexity and performance on real and synthetic data,... Read more

2. What frameworks and methodologies improve data deduplication and entity integration across heterogeneous data sources?

This theme investigates conceptual frameworks, practical tools, and methodologies for deduplication and entity resolution across multiple heterogeneous data sources. It emphasizes methods combining blocking, record linkage, and human-in-the-loop strategies to improve data quality in domains with complex, diverse inputs and the integration challenges associated with large-scale or domain-specific datasets.

Key finding: Proposed a six-step deduplication framework integrated into the DataCleaner tool, combining record linkage methods with blocking and sorted neighborhood techniques to efficiently identify duplicates in large, heterogeneous... Read more
Key finding: Presented a novel data integration approach utilizing graph-based techniques to reduce search space and group similar records for deduplication across heterogeneous sources including databases, CSV files, and web services.... Read more
Key finding: Outlined a conceptual framework automating entity integration from multiple data sources, addressing key issues such as ordering of resolution for datasets with different schemas, performance optimization via feature... Read more

3. How can privacy-preserving methods enable secure and ethical record linkage of sensitive genomic and clinical datasets?

This theme focuses on the ethical, legal, and technological challenges in linking sensitive data sets such as genomic and clinical records, with the dual goals of enabling data-driven health research and preserving participants' privacy. It explores privacy-preserving linkage (PPRL) approaches that allow record matching without direct identity disclosure and policy frameworks for responsible data sharing.

Key finding: Provided a comprehensive overview of ethical, legal, and technological challenges in PPRL, emphasizing the need for linkage methods that enable data integration without disclosing individual identities except under legally... Read more

All papers in Data Matching

A continuous update of authoritative spatial databases is highly demanding task in both aspects, technical and financial. In the same time, alternative modalities to collect content, in particular spatial content, have achieved a certain... more
This paper seeks to identify and address issues that may arise with the use and analysis of the linked Longitudinal Surveys of Australian Youth (LSAY) and the National Assessment Program — Literacy and Numeracy (NAPLAN) data. The study... more
This publication presents research highlights from the past 25 years of the LSAY program, with a focus on schooling, VET in schools programs, the influences of socioeconomic status and demographics on later opportunities, and pathways... more
NCVER is an independent body responsible for collecting, managing and analysing, evaluating and communicating research and statistics about vocational education and training (VET). NCVER's in-house research and evaluation program... more
En nuestro país la información electoral, es decir, aquella relacionada a los electores, los trámites que modifican los datos de estos, los actos electorales, los partidos políticos y sus autoridades, etc. está administrada por la Ju... more
Gathering and processing large amounts of data is increasing every day. Record linkage is one of the most complex data-intensive tasks, which is used to accurately match records from different data sources that contain information about... more
The very fast developments of web and data collection technologies have enabled non-experts to collect and disseminate geospatial datasets through web applications. This new type of spatial data is usually known as collaborative mapping... more
Importing spatial open data in OpenStreetMap (OSM) project, is a practice that has existed from the beginning of the project. The rapid development and multiplication of collaborative mapping tools and open data have led to the growth of... more
OpenStreetMap (OSM) is the most successful example of Volunteered Geographic Information (VGI). It is also the most frequently used case study in research that focuses on VGI quality, as it is usually considered a proxy for other VGI... more
Recommender systems (RS), as supportive tools, filter information from a massive amount of data based on the determined preferences. Most of the RS require information about the context of users such as their locations. In such cases,... more
While big data benefits are numerous, most of the collected data is of poor quality and, therefore, cannot be effectively used as it is. One pre-processing the leading big data quality challenges is data duplication. Indeed, the gathered... more
The record de-duplication is an important part of data cleaning process of a data-warehouse. Identification of multiple duplicate entries of a single entity in a data-warehouse is known as de-duplication. A lot of research is carried out... more
Merging databases from different data sources is one of the important tasks in the data integration process. This study will integrate lecturer data from data sources in the application of academic information systems and research... more
Volunteered Geographic Information (VGI) phenomena offer an alternative or supplement to the authoritative mechanism of geospatial data acquisition. It aims to allow people without professional geospatial skills or knowledge to... more
Record linkage is an momentous process in data soundness which is used in combining, matching and duplicate removal from more than two databases that refer to the same entities. Deduplication is the process of taking off duplicate records... more
Deduplication is the process of determining all categories of information within a data set that signify the same real life / world entity. The data gathered from various resources may have data high quality issues in it. The concept to... more
Information is united for common purpose from many sidedness computerized files is referred as record linkage. The basic methods compare name and address information across pairs of files to determine those pairs of records that are... more
Record linkage is an momentous process in data soundness which is used in combining, matching and duplicate removal from more than two databases that refer to the same entities. Deduplication is the process of taking off duplicate records... more
Merging databases from different data sources is one of the important tasks in the data integration process. This study will integrate lecturer data from data sources in the application of academic information systems and research... more
Traditionally, national mapping agencies produced datasets and map products for a low number of specified and internally consistent scales, i.e. at a common level of detail (LoD). With the advent of projects like OpenStreetMap, data users... more
Whereas it was possible to define the level of detail (LoD) of authoritative datasets, it is not possible for Volunteered Geographic Information (VGI), often characterised by heterogeneous levels of details. This heterogeneity is a curb... more
Whereas it was possible to define the level of detail (LoD) of authoritative datasets, it is not possible for Volunteered Geographic Information (VGI), often characterised by heterogeneous levels of details. This heterogeneity is a curb... more
The concept of Volunteered Geographic Information (VGI) has recently emerged from the new Web 2.0 technologies. The OpenStreetMap project is currently the most significant example of a system based on VGI. It aims at producing free vector... more
En nuestro país la información electoral, es decir, aquella relacionada a los electores, los trámites que modifican los datos de estos, los actos electorales, los partidos políticos y sus autoridades, etc. está administrada por la... more
; diego.silva; roberto.moreno; pablo.masier; matias.codorniu}@fce.uncu.edu.ar http://fce.uncuyo.edu.ar Resúmen: el estudio de las anomalías en los días de la semana del mercado de las criptodivisas completa una vasta literatura que... more
Actualmente en las administraciones tributarias existe un gran volumen de datos. Estos datos contienen implícito un conocimiento que puede ser extraído, este conocimiento dependerá de la calidad de los datos, y en esa cantidad de datos no... more
Microscopic traffic flow simulations as tools for enabling detailed insights on traffic efficiency and safety gained numerous popularity among transportation researchers, planners and engineers in the first to decades of the 21st century.... more
Record linkage is the problem of identifying similar records across different data sources. The similarity between two records is defined based on domain-specific similarity functions over several attributes. De-duplicating one data set... more
The assessment of data quality from different sources can be considered as a key challenge in supporting effective geospatial data integration and promoting collaboration in mapping projects. This paper presents a methodology for... more
Points of interests (POIs) describe a geographic entity that users are focused on such as a school. The different types of POIs are represented by the cartographic symbols. Its positional accuracy on a map is usually considered good if... more
Microscopic traffic flow simulations as tools for enabling detailed insights on traffic efficiency and safety gained numerous popularity among transportation researchers, planners and engineers in the first to decades of the 21 st... more
The purpose of data integration is to integrate the multiple sources of heterogeneous data available on the internet, such as text, image, and video. After this stage, the data becomes large. Therefore, it is necessary to analyze the data... more
Where the streets have no name is probably the preferred place for a volunteer OpenStreetMapper. Launched in 2004, the Open Street Map project aimed to share geographical data based on volunteer mapping and led to the collection of... more
The modern planning and management of urban spaces is an essential topic for smart cities and depends on up-to-date and reliable information on land use and the functional roles of the places that integrate urban areas. In the last few... more
The modern planning and management of urban spaces is an essential topic for smart cities and depends on up-to-date and reliable information on land use and the functional roles of the places that integrate urban areas. In the last few... more
Duplicate record is a known problem within the datasets especially within databases of huge volumes. The accuracy of duplicates detection determines the efficiency of the duplicates removal process. Unfortunately, the effort to detect... more
Entity resolution, the process of determining if two or more references correspond to the same entity, is an emerging area of study in computer science. While entity resolution models leverage artificial intelligence, machine learning,... more
With the exception of the Commonwealth Coat of Arms: (for terms of use, refer to <www.itsanhonour.gov.au/coat-arms/index.cfm>), the details of the relevant licence conditions are available on the Creative Commons website, as is the full... more
The purpose of data integration is to integrate the multiple sources of heterogeneous data available on the internet, such as text, image, and video. After this stage, the data becomes large. Therefore, it is necessary to analyze the data... more
With volunteered geographic information (VGI) platforms such as OpenStreetMap (OSM) becoming increasingly popular, we are faced with the challenge of assessing the quality of their content, in order to better understand its place relative... more
The assessment of the quality and accuracy of Volunteered Geographic Information (VGI) contributions, and by extension the ultimate utility of VGI data has fostered much debate within the geographic community. The limited research to date... more
El presente artículo describe las etapas definidas en la Agencia de Recaudación de la Provincia de Buenos Aires para alcanzar un Gobierno de Datos sobre la información de los contribuyentes de la Provincia de Buenos Aires. El proyecto... more
The purpose of data integration is to integrate the multiple sources of heterogeneous data available on the internet, such as text, image, and video. After this stage, the data becomes large. Therefore, it is necessary to analyze the data... more
The data management process is characterised by a set of tasks where data quality management (DQM) is one of the core components. Data quality, however, is a multidimensional concept, where the nature of the data quality issues is very... more
In many countries, geospatial data are typically provided by public institutions. Cities have been mapped using such public data. On the other hand, the demand for geospatial data has been diversifying, given the requirements for mapping... more
In the past decade, Volunteered Geographic Information (VGI) has emerged as a new source of geographic information, making it a cheap and universal competitor to existing authoritative data sources. The growing popularity of VGI... more
Points of interests (POIs) describe a geographic entity that users are focused on such as a school. The different types of POIs are represented by the cartographic symbols. Its positional accuracy on a map is usually considered good if... more
Accurate and efficient entity resolution (ER) is a significant challenge in many data mining and analysis projects requiring integrating and processing massive data collections. It is becoming increasingly important in real-world... more
Download research papers for free!