Synthetic Data Generation

description6,448 papers

group434 followers

lightbulbAbout this topic

Synthetic data generation is the process of creating artificial data that mimics real-world data characteristics, often using algorithms and statistical models. This technique is employed to enhance data privacy, facilitate machine learning model training, and enable testing in scenarios where real data is scarce, sensitive, or difficult to obtain.

lightbulbAbout this topic

Key research themes

1. How can Generative Adversarial Networks (GANs) ensure privacy-preserving synthetic tabular data generation with utility retention?

This theme investigates the application of GAN architectures tailored for tabular data synthesis that balances privacy preservation—preventing re-identification, attribute disclosure, and membership inference attacks—with maintaining data utility, particularly model compatibility for downstream machine learning tasks. It matters because tabular data is ubiquitous and often sensitive, requiring synthetic versions that do not compromise privacy yet allow model training comparable to original data.

Data Synthesis based on Generative Adversarial Networks

by Mahmoud Mohammadi and

2018

Key finding: Introduces table-GAN, a GAN-based approach incorporating generator, discriminator, and an additional classifier networks to generate synthetic tables with categorical, discrete, and continuous features. Demonstrates... Read more

articleView Paper downloadDownload

Survey on Synthetic Data Generation, Evaluation Methods and GANs

by Álvaro Figueira, PhD

2023, Mathematics

Key finding: Comprehensively reviews GAN-based synthetic data generation methods with particular focus on tabular data synthesis. Highlights the evolution and adaptations of GAN architectures to handle tabular data challenges such as... Read more

articleView Paper downloadDownload

On the Quality of Synthetic Generated Tabular Data

by Álvaro Figueira, PhD

2023, Mathematics

Key finding: Evaluates utility measures for synthetic tabular data, focusing on statistical similarity and classification performance impact. Demonstrates that synthetic data generated by GANs and other generative models can augment... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What are effective strategies to generate high-quality synthetic data for domain-specific applications with limited real data?

This theme explores synthetic data generation techniques tailored to particular domains like healthcare, social networks, fraud detection, and specialized imaging systems where real data are scarce, confidential, or costly to obtain. The focus is on methodologies that capture domain-specific characteristics and structure, providing realistic, privacy-preserving datasets suitable for system training, testing, and validation. This is crucial for advancing AI applications in domains constrained by data availability.

Generating high-fidelity synthetic patient data for assessing machine learning healthcare software

by Zhenchen Wang

2021, npj Digital Medicine

Key finding: Presents a multi-method approach integrating resampling, probabilistic graphical models, latent variable identification, and outlier analysis to generate synthetic primary care patient data from CPRD, preserving complex... Read more

articleView Paper downloadDownload

A Synthetic Data Generator for Online Social Network Graphs

by David Nettleton

2016

Key finding: Develops a configurable stochastic modeling system to populate social network graph topologies with synthetic data reflecting realistic distributions of user attributes and community structures. Empirically confirms that... Read more

articleView Paper downloadDownload

Synthesizing test data for fraud detection systems

by Håkan Kvarnström

2025, 19th Annual Computer Security Applications Conference, 2003. Proceedings.

Key finding: Proposes a synthetic data generation approach that preserves important statistical properties by using authentic normal and fraud data as seeds to create realistic user and attacker behavior profiles. Demonstrates synthetic... Read more

articleView Paper downloadDownload

S3Simulator: A Benchmarking Side Scan Sonar Simulator Dataset for Underwater Image Analysis

by Kamal Basha S and

2025, Springer, Cham

Key finding: Introduces S3Simulator, a novel multi-stage pipeline combining AI-based segmentation (SAM), selfCAD 3D modeling, Gazebo simulation, and computational imaging techniques to generate photorealistic synthetic side-scan sonar... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

3. How can synthetic data generation techniques address small data and class imbalance problems in machine learning tasks?

This theme focuses on algorithmic strategies for producing synthetic data that augment scarce or imbalanced datasets to enhance machine learning model performance. It includes oversampling methods based on geometric approaches, GAN variants, and AI-driven text generation for minority class augmentation. The significance lies in enabling reliable model training where data is limited or skewed, such as small sample sizes, infrequent classes, or underrepresented emotional states.

Improving the quality of predictive models in small data GSDOT: A new algorithm for generating synthetic data

by Fernando Bacao

2022, PLoS ONE

Key finding: Proposes the Geometric Small Data Oversampling Technique (GSDOT), which generates synthetic data by constructing geometric regions around existing samples to create high-quality artificial instances. Demonstrates that... Read more

articleView Paper downloadDownload

Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis

by Üveges István

2025, Information

Key finding: Evaluates the use of a GPT-4-based few-shot prompting approach to generate synthetic German-language text data for augmenting minority emotion classes in imbalanced datasets. Shows significant improvements in classification... Read more

articleView Paper downloadDownload

Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition

by Seyed Muhammad Hossein Mousavi

2025, Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition

Key finding: Introduces a novel neural gas network-based algorithm to synthesize body motion data for emotion recognition, leveraging topological learning of skeletal joint structures. Compares favorably to GAN and VAE-based methods by... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Synthetic Data Generation

Application of mobile-macroscale scanning X-ray fluorescence (mobile-MA-XRF) imaging in paleontology: analyses of vertebrate fossil specimens from Messel conserved in different solid and liquid media

by Marco Colombo

2025, Journal of Analytical Atomic Spectrometry

Cutting-edge analytical instrumentation is increasingly being developed and applied to the analysis of fossils. X-ray fluorescence (XRF) imaging spectroscopy is a powerful tool to resolve the elemental chemistry of fossil specimens. Most... more

descriptionView Paper arrow_downwardDownload

An automatic methodology for estimating eddy diffusivities from experimental data

by Fernando Carvalho Ramos

2025

A technique for estimating eddy diffusivities in a turbulent atmospheric layer is presented; the scheme adopted is based on an inverse-problem methodology. The inverse problem is formulated as a nonlinear constrained optimization problem,... more

descriptionView Paper arrow_downwardDownload

XML Tree Pattern Matching Algorithms

by DNSB Kavitha

2025

In the present day digital world, it is imperative that all organizations and enterprises facilitate efficient processing of queries on XML data. XML queries typically specify patterns of selection predicates on multiple elements that... more

descriptionView Paper arrow_downwardDownload

Using context-free grammars to constrain apriori-based algorithms for mining temporal association rules

by Claudia Antunes

2025

Algorithms for the inference of association with sequential information have been proposed and used but are ineffective, in some cases, because too many candidate rules are extracted. Filtering the relevant ones is usually difficult and... more

descriptionView Paper arrow_downwardDownload

Geostatistical data integration in complex reservoirs

by Morteza Naraghi

2025

Completion of this thesis would not have been possible without the support and contribution of many people. It is a great honor for me to thank some of those many, to whom I owe my deepest gratitude. I would like to express my deepest... more

descriptionView Paper arrow_downwardDownload

Radiative Transfer and Error Analysis Methods for the MSL REMS Ground Temperature Sensors

by Alejandro Soto

2025, … held 2-7 May, 2010 in …

The Rover Environmental Monitoring Station (REMS) on the Mars Science Laboratory (MSL) offers the opportunity to explore the near surface atmospheric boundary layer (ABL) over an extended region of the Martian surface. The atmospheric... more

descriptionView Paper arrow_downwardDownload

Targeted Delivery Model for RSS Feeds

by Ajay Kumar

2025

We exhibit an option and more adaptable approach that amplifies client utility by fulfilling all clients. It does this while minimizing the utilization of framework assets. We examine the profits of this last approach and create a... more

descriptionView Paper arrow_downwardDownload

SOM-based topology visualization for interactive analysis of high-dimensional large datasets

by Erzsébet Merényi

2025

Low-dimensional (2-or 3-dimensional) visual representations of large, highdimensional datasets with complicated cluster structures play a fundamental role in the discovery and identification of such structures. Visualization exploits the... more

descriptionView Paper arrow_downwardDownload

Nowcasting thunderstorm hazards for flight operations: the Cb-WIMS approach in FLYSAFE

by Thomas Hauf

2025, Proc. 26th Int. Congress …

An important piece of the ACARE (Advisory Council for Aeronautics in Europe) plan has been put in place early in 2005: the FLYSAFE Project (http://www.eu-flysafe.org/). FLYSAFE aims at defining and testing new tools and systems... more

descriptionView Paper arrow_downwardDownload

Providing Secure Access to Sensitive Data

by John M Abowd

2025, IASSIST

descriptionView Paper arrow_downwardDownload

The prediction of lithospheric magnetic anomalies using the inversion of magnetisation data for vector spherical harmonics

by Kumar Hemant

2025

descriptionView Paper arrow_downwardDownload

Proceedings of the Conference on Artificial Intelligence for Defence 2020

by Teddy Furon

2025, HAL (Le Centre pour la Communication Scientifique Directe)

Cubesats platforms expansion increases the need to simplify payloads and to optimize downlink data capabilities. A promising solution is to enhance on-board software, in order to take early decisions automatically. However, the most... more

descriptionView Paper arrow_downwardDownload

Iterative Cellular Image Processing Algorithm

by Osman N ucan

2025

In this paper, a new iterative image processing algorithm is introduced and denoted as "iterative cellular image processing algorithm" (ICIPA). The new unsupervised iterative algorithm uses the advantage of stochastic properties and... more

descriptionView Paper arrow_downwardDownload

Constructing a perfect matching is in random NC

by Avi Wigderson

2025, Combinatorica

We show that the problem of constructing a perfect matching in a graph is in the complexity class Random NC; i.e., the problem is solvable in polylog time by a randomized parallel algorithm using a polynomial -bounded number of... more

descriptionView Paper arrow_downwardDownload

Estimation conjointe disparité-mouvement pour le codage de séquences vidéo multi-vues

by Wided MILED SOUID

2025

Cet article aborde le problème du codage de séquences vidéo multi-vues et présente une nouvelle méthode d'estimation des champs de disparité et du mouvement impliqués dans la séquence. Afin de réduire la complexité et le coût de calcul et... more

descriptionView Paper arrow_downwardDownload

A new approach for variable influence on projection (VIP) in O2PLS models

by Beatriz Galindo-Prieto

2025

A novel variable influence on projection approach for O2PLS models, named VIP O2PLS , is presented in this paper. VIP O2PLS is a model-based method for judging the importance of variables. Its cornerstone is the 2-way formalism of the... more

descriptionView Paper arrow_downwardDownload

Making Euler deconvolution applicable to small ground magnetic surveys

by Valeria Cristina Ferreira Barbosa

2025, Journal of Applied Geophysics

Euler deconvolution is a commonly employed magnetic interpretation method because it requires only a little a priori knowledge about the magnetic source geometry, and, more importantly, because it requires no information about the magnetization vector. As a result, it may be successfully applied in areas where the geology is poorly known. However, it requires a priori knowledge about the nature of an equivalent source producing a magnetic anomaly with the same falloff rate of the observed anomaly. This is a crucial limitation of the method, requiring that a parameter known as structural index Ž . h be determined. The customary application of the Euler method and the process of estimating h are benefited by having a large number of data and solutions preventing its application to ground magnetic surveys which may consist of a limited number of observations. We show that if the structural index is estimated by a new criterion, Euler deconvolution becomes a feasible technique to interpret anomalies defined by just a few observations. This new criterion is based on the correlation between the total-field anomaly h o and the estimates of the base level b. These estimates are obtained for each position of the moving data window along the observed profile and for several tentative values for the structural index. However, differently from the customary method, instead of estimating the structural index as the tentative value producing the smallest solution dispersion, the best estimate of h is taken as the tentative value leading to the smallest correlation between h o and the estimates of b. This criterion is deduced from the Euler's equation, so it does not depend on the inclination and declination of the geomagnetic field. The good results obtained with this new criterion in determining the correct structural index is illustrated in tests using synthetic data from different latitudes and the feasibility in using this criterion in applying Euler deconvolution to ground surveys is illustrated with a real magnetic anomaly defined by just 12 observations and produced by a basicrultrabasic body at the emerald deposit of Socoto, Bahia, Brazil. The results of Euler deconvolution, ćombined with the geological information that the basicrultrabasic body outcrops, show that the intrusive body may be approximated by an outcropping horizontal cylinder with a diameter of 68 m and center at a depth of 34 m, which is consistent with the geologic knowledge of the deposit.

descriptionView Paper arrow_downwardDownload

Interactive gravity inversion

by Valeria Cristina Ferreira Barbosa

2025, GEOPHYSICS

We have developed a new approach for estimating the location and geometry of several density anomalies that give rise to a complex, interfering gravity field. The user interactively defines the assumed outline of the true gravity sources... more

descriptionView Paper arrow_downwardDownload

Building consistent sample databases to support information system evolution and migration

by DEIRDRE LAWLESS

2025

Prototype databases are needed in any information system development process to support data-intensive applications development. It is common practice to populate these databases using synthetic data. This data usually bears little... more

descriptionView Paper arrow_downwardDownload

Automated Feature Engineering and Hidden Bias: A Framework for Fair Feature Transformation in Machine Learning Pipelines

by Rajani Kumari Vaddepalli

2025, ISCSITR- International Journal of Scientific Research in Artificial Intelligence and Machine Learning (ISCSITR-IJSRAIML)

Automated feature engineering (AutoFE) has become a cornerstone of efficient machine learning (ML), yet its potential to perpetuate or amplify bias remains underexplored. This paper proposes a fairness-aware framework for feature... more

descriptionView Paper arrow_downwardDownload

Automated Feature Engineering and Hidden Bias: A Framework for Fair Feature Transformation in Machine Learning Pipelines

by IAEME AI

2025, International Journal of Scientific Research in Artificial Intelligence and Machine Learning (ISCSITR-IJSRAIML)

descriptionView Paper arrow_downwardDownload

Recovering Compressively Sampled Signals Using Partial Support Information

by ÖZGÜR YILMAZ

2025, IEEE Transactions on Information Theory

We study recovery conditions of weighted 1 minimization for signal reconstruction from compressed sensing measurements when partial support information is available. We show that if at least 50% of the (partial) support information is... more

descriptionView Paper arrow_downwardDownload

Radiative Transfer and Error Analysis Methods for the MSL REMS Ground Temperature Sensors

by Alejandro Soto

2025, … held 2-7 May, 2010 in …

descriptionView Paper arrow_downwardDownload

Least Squares Migration of Synthetic Aperture Data for Towed Streamer Electromagnetic Survey

by Michael Zhdanov

2025, IEEE Geoscience and Remote Sensing Letters

Towed streamer electromagnetic (TSEM) survey is an efficient data acquisition technique capable of collecting a large volume of electromagnetic (EM) data over extensive areas rapidly and economically. The TSEM survey is capable of... more

descriptionView Paper arrow_downwardDownload

Adaptive background estimation using an information theoretic cost for hidden state estimation

by Jose Principe

2025

descriptionView Paper arrow_downwardDownload

MEDICI: A Simple to Use Synthetic Social Network Data Generator

by David Nettleton

2025, Modeling Decisions for Artificial Intelligence

This paper describes an application, called Medici, designed to produce synthetic data for social network graphs, which can be used for analysis, hypothesis testing and application development by researchers and practitioners in the... more

descriptionView Paper arrow_downwardDownload

A synthetic data generator for online social network graphs

by David Nettleton

2025, Social Network Analysis and Mining

Two of the difficulties for data analysts of online social networks are (i) the public availability of data and (ii) respecting the privacy of the users. One possible solution to both of these problems is to use synthetically generated... more

descriptionView Paper arrow_downwardDownload

Gaussian Beam Processes: A Nonparametric Bayesian Measurement Model for Range Finders

by Wolfram Burgard

2025, Robotics: Science and Systems III

In probabilistic mobile robotics, the development of measurement models plays a crucial role as it directly influences the efficiency and the robustness of the robot's performance in a great variety of tasks including localization,... more

descriptionView Paper arrow_downwardDownload

Toward a quantitative near-bottom seismic profiler

by Robert Tyce

2025, The Journal of the Acoustical Society of America

This paper describes the development and application of a system for quantitative near-bottom seismic profiling at 4 kHz. Developed as part of the Deep-Tow Instrumentation System of the Marine Physical Laboratory, this system represents a... more

descriptionView Paper arrow_downwardDownload

The EURONEAR Lightcurve Survey of Near Earth Asteroids

by Amadeo Aznar

2025, Earth, Moon, and Planets

This data paper presents lightcurves of 101 near Earth asteroids (NEAs) observed mostly between 2014 and 2017 as part of the EURONEAR photometric survey using 11 telescopes with diameters between 0.4 and 4.2 m located in Spain, Chile,... more

descriptionView Paper arrow_downwardDownload

POTENTIAL DISTRIBUTION DUE TO A CYLINDRICAL ELECTRODE MOUNTED ON AN INSULATING PROBE 1

by Valery Fabrikant

2025

The potentials around a finite cylindrical electrode can be obtained by dividing the electrodes into rings of equal thickness and substituting an infinitely thin current ring for each of the slices. The field of an infinitely thin ring... more

descriptionView Paper arrow_downwardDownload

Stereological Correction of Mineral Liberation Grade Distributions Estimated by Single Sectioning of Particles

by Steven Spencer

2025, Image Analysis & Stereology

The liberation distribution of ore samples is of considerable interest for process optimisation in the minerals industry. A scanning electron microscope-based automatic mineral analyser such as the LEO QEMSCAN system developed by CSIRO... more

descriptionView Paper arrow_downwardDownload

Reproducible patterns of neural activity without attractors in cortical networks

by Domenico Guarino

2025, bioRxiv (Cold Spring Harbor Laboratory)

Cortical events of correlated neuronal firing are thought to underlie sensorimotor and associative functions. It is believed that events are attractors of cortical dynamics, pulled by strong mutual connections between recurrently active... more

descriptionView Paper arrow_downwardDownload

Numerical simulation of meso-gamma scale features of föhn at ground level in the Rhine valley

by mathieu nuret

2025, Quarterly Journal of the Royal Meteorological Society

This paper examines the impact of a mesoscale analysis (2.5 km grid distance) on the simulation of the meso-gamma scale aspects of föhn in the Rhine Valley. The föhn event, documented during IOP 15 (5 November 1999) of the Mesoscale... more

descriptionView Paper arrow_downwardDownload

On the determination of the earthquake slip distribution via linear programming techniques

by Vladimir Bushenkov

2025

The description that one can have of the seismic source is the mani- festation of an imagined model, obviously outlined from Physic Theories and supported by mathematical methods. In that context, the modelling of earthquake rupture... more

descriptionView Paper arrow_downwardDownload

Mesoscale inversion: first results from the CERES campaign with synthetic data

by F. Chevallier

2025, Atmospheric Chemistry and Physics

We investigate the ability of a mesoscale model to reconstruct CO 2 fluxes at regional scale. Formally, we estimate the reduction of error for a CO 2 flux inversion at 8 km resolution in the South West of France, during four days of the... more

descriptionView Paper arrow_downwardDownload

An Experimental Study of Weighted k-Link Shortest Path Algorithms

by simeon ntafos

2025, Springer Tracts in Advanced Robotics

We consider the problem of computing a minimum-weight polygonal path between two points in a weighted polygonal subdivision, subject to the constraint that the path have few segments (links). We give an algorithm that generates paths of... more

descriptionView Paper arrow_downwardDownload

Classification of Mushroom Fungi Using Machine Learning Techniques

by Mohammad Ottom

2025, International journal of advanced trends in computer science and engineering

Mushroom is one of the fungi types' food that has the most potent nutrients on the plant. Mushrooms have major medical advantages such as killing cancer cells. This study aims to find the most appropriate technique for mushroom... more

descriptionView Paper arrow_downwardDownload

A Synthetic Dataset for 5G UAV Attacks Based on Observable Network Parameters

by Pedro Sebastião

2025, arXiv (Cornell University)

Synthetic datasets are beneficial for machine learning researchers due to the possibility of experimenting with new strategies and algorithms in the training and testing phases. These datasets can easily include more scenarios that might... more

descriptionView Paper arrow_downwardDownload

Relative 3D positioning and 3D convex hull computation from a weakly calibrated stereo pair

by Olivier Faugeras

2025, Image and Vision Computing

Re ´sume ´: Dans ce rapport, nous conside ´rons une paire d'images ste ´re ´o dont seule la ge ´ome ´trie e ´pipolaire est connue, repre ´sente ´e par la .matrice fondamentale du syste `me de came ´ras. Nous montrons qu'il est possible de... more

descriptionView Paper arrow_downwardDownload

Applying data synthesis for longitudinal business data across three countries

by Md. Jahangir Alam

2025, Statistics in Transition New Series

Data on businesses collected by statistical agencies are challenging to protect. Many businesses have unique characteristics, and distributions of employment, sales, and profits are highly skewed. Attackers wishing to conduct... more

descriptionView Paper arrow_downwardDownload

Large Margin Learning of Bayesian Classifiers Based on Gaussian Mixture Models

by Franz Pernkopf

2025, Springer eBooks

We present a discriminative learning framework for Gaussian mixture models (GMMs) used for classification based on the extended Baum-Welch (EBW) algorithm . We suggest two criteria for discriminative optimization, namely the class... more

descriptionView Paper arrow_downwardDownload

Maximum margin hidden Markov models for sequence classification

by Franz Pernkopf

2025, Pattern Recognition Letters

This supplement includes all derivations and the pseudocode for learning the maximum margin hidden Markov model for sequence classification. It uses the extended Baum-Welch framework for optimization.

descriptionView Paper arrow_downwardDownload

FY05 Engineering Research and Technology Report

by James Stolken

2025

Engineering Systems for Knowledge and Inference, an emerging focus area for Engineering as well as for the country at large, encompasses a wide variety of technologies. The goal is to generate new understanding or knowledge of situations,... more

descriptionView Paper arrow_downwardDownload

On the use of prosody in a speech-to-speech translator

by Guenther Goerz

2025, 5th European Conference on Speech Communication and Technology (Eurospeech 1997)

descriptionView Paper arrow_downwardDownload

The universal scaling characteristics of tropical oceanic rain clusters

by prof. Leslie Norford

2025, Journal of Geophysical Research: Atmospheres

Using multiyear satellite rainfall estimates, the distributions of the area, and the total rain rate of rain clusters over the equatorial Indian, Pacific, and Atlantic Oceans was found to exhibit a power law f S s ð Þ e s Àζ S , in which... more

descriptionView Paper arrow_downwardDownload

Amplitude preserving AMO from true amplitude DMO and inverse DMO

by Nizar Chemingui

2025

Starting from the definition of Azimuth Moveout (AM O) as the cascade of D M O and inverse D M O at different offsets and azimuths, we derive an amplitude-preserving function for the AM O operator. This amplitude function is based on the... more

descriptionView Paper arrow_downwardDownload

3D high resolution seismic model with depth: A relevant guide for Andra deep geological repository project

by Béatrice Yven

2025

In the context of deep geological disposal of high level radioactive wastes, the French National Radioactive Waste Management Agency (Andra) has conducted an extensive characterization of the Callovo-Oxfordian argillaceous rock and... more

descriptionView Paper arrow_downwardDownload

S3Simulator: A Benchmarking Side Scan Sonar Simulator Dataset for Underwater Image Analysis

by Kamal Basha S and

2025, Springer, Cham

Acoustic sonar imaging systems are widely used for underwater surveillance in both civilian and military sectors. However, acquiring high-quality sonar datasets for training Artificial Intelligence (AI) models confronts challenges such as... more

descriptionView Paper arrow_downwardDownload

A Novel Scalable and Data Efficient Feature Subset Selection Algorithm

by Sergio Rodrigues

2025, Lecture Notes in Computer Science

In this paper, we aim to identify the minimal subset of discrete random variables that is relevant for probabilistic classification in data sets with many variables but few instances. A principled solution to this problem is to determine... more

descriptionView Paper arrow_downwardDownload

Synthetic Data Generation

Key research themes

1. How can Generative Adversarial Networks (GANs) ensure privacy-preserving synthetic tabular data generation with utility retention?

2. What are effective strategies to generate high-quality synthetic data for domain-specific applications with limited real data?

3. How can synthetic data generation techniques address small data and class imbalance problems in machine learning tasks?

Related Topics

All papers in Synthetic Data Generation