Academia.eduAcademia.edu

Synthetic Data Generation

description6,448 papers
group434 followers
lightbulbAbout this topic
Synthetic data generation is the process of creating artificial data that mimics real-world data characteristics, often using algorithms and statistical models. This technique is employed to enhance data privacy, facilitate machine learning model training, and enable testing in scenarios where real data is scarce, sensitive, or difficult to obtain.
lightbulbAbout this topic
Synthetic data generation is the process of creating artificial data that mimics real-world data characteristics, often using algorithms and statistical models. This technique is employed to enhance data privacy, facilitate machine learning model training, and enable testing in scenarios where real data is scarce, sensitive, or difficult to obtain.

Key research themes

1. How can Generative Adversarial Networks (GANs) ensure privacy-preserving synthetic tabular data generation with utility retention?

This theme investigates the application of GAN architectures tailored for tabular data synthesis that balances privacy preservation—preventing re-identification, attribute disclosure, and membership inference attacks—with maintaining data utility, particularly model compatibility for downstream machine learning tasks. It matters because tabular data is ubiquitous and often sensitive, requiring synthetic versions that do not compromise privacy yet allow model training comparable to original data.

Key finding: Introduces table-GAN, a GAN-based approach incorporating generator, discriminator, and an additional classifier networks to generate synthetic tables with categorical, discrete, and continuous features. Demonstrates... Read more
Key finding: Comprehensively reviews GAN-based synthetic data generation methods with particular focus on tabular data synthesis. Highlights the evolution and adaptations of GAN architectures to handle tabular data challenges such as... Read more
Key finding: Evaluates utility measures for synthetic tabular data, focusing on statistical similarity and classification performance impact. Demonstrates that synthetic data generated by GANs and other generative models can augment... Read more

2. What are effective strategies to generate high-quality synthetic data for domain-specific applications with limited real data?

This theme explores synthetic data generation techniques tailored to particular domains like healthcare, social networks, fraud detection, and specialized imaging systems where real data are scarce, confidential, or costly to obtain. The focus is on methodologies that capture domain-specific characteristics and structure, providing realistic, privacy-preserving datasets suitable for system training, testing, and validation. This is crucial for advancing AI applications in domains constrained by data availability.

Key finding: Presents a multi-method approach integrating resampling, probabilistic graphical models, latent variable identification, and outlier analysis to generate synthetic primary care patient data from CPRD, preserving complex... Read more
Key finding: Develops a configurable stochastic modeling system to populate social network graph topologies with synthetic data reflecting realistic distributions of user attributes and community structures. Empirically confirms that... Read more
Key finding: Proposes a synthetic data generation approach that preserves important statistical properties by using authentic normal and fraud data as seeds to create realistic user and attacker behavior profiles. Demonstrates synthetic... Read more
Key finding: Introduces S3Simulator, a novel multi-stage pipeline combining AI-based segmentation (SAM), selfCAD 3D modeling, Gazebo simulation, and computational imaging techniques to generate photorealistic synthetic side-scan sonar... Read more

3. How can synthetic data generation techniques address small data and class imbalance problems in machine learning tasks?

This theme focuses on algorithmic strategies for producing synthetic data that augment scarce or imbalanced datasets to enhance machine learning model performance. It includes oversampling methods based on geometric approaches, GAN variants, and AI-driven text generation for minority class augmentation. The significance lies in enabling reliable model training where data is limited or skewed, such as small sample sizes, infrequent classes, or underrepresented emotional states.

Key finding: Proposes the Geometric Small Data Oversampling Technique (GSDOT), which generates synthetic data by constructing geometric regions around existing samples to create high-quality artificial instances. Demonstrates that... Read more
Key finding: Evaluates the use of a GPT-4-based few-shot prompting approach to generate synthetic German-language text data for augmenting minority emotion classes in imbalanced datasets. Shows significant improvements in classification... Read more
Key finding: Introduces a novel neural gas network-based algorithm to synthesize body motion data for emotion recognition, leveraging topological learning of skeletal joint structures. Compares favorably to GAN and VAE-based methods by... Read more

All papers in Synthetic Data Generation

Cutting-edge analytical instrumentation is increasingly being developed and applied to the analysis of fossils. X-ray fluorescence (XRF) imaging spectroscopy is a powerful tool to resolve the elemental chemistry of fossil specimens. Most... more
A technique for estimating eddy diffusivities in a turbulent atmospheric layer is presented; the scheme adopted is based on an inverse-problem methodology. The inverse problem is formulated as a nonlinear constrained optimization problem,... more
In the present day digital world, it is imperative that all organizations and enterprises facilitate efficient processing of queries on XML data. XML queries typically specify patterns of selection predicates on multiple elements that... more
Algorithms for the inference of association with sequential information have been proposed and used but are ineffective, in some cases, because too many candidate rules are extracted. Filtering the relevant ones is usually difficult and... more
Completion of this thesis would not have been possible without the support and contribution of many people. It is a great honor for me to thank some of those many, to whom I owe my deepest gratitude. I would like to express my deepest... more
The Rover Environmental Monitoring Station (REMS) on the Mars Science Laboratory (MSL) offers the opportunity to explore the near surface atmospheric boundary layer (ABL) over an extended region of the Martian surface. The atmospheric... more
We exhibit an option and more adaptable approach that amplifies client utility by fulfilling all clients. It does this while minimizing the utilization of framework assets. We examine the profits of this last approach and create a... more
Low-dimensional (2-or 3-dimensional) visual representations of large, highdimensional datasets with complicated cluster structures play a fundamental role in the discovery and identification of such structures. Visualization exploits the... more
An important piece of the ACARE (Advisory Council for Aeronautics in Europe) plan has been put in place early in 2005: the FLYSAFE Project (http://www.eu-flysafe.org/). FLYSAFE aims at defining and testing new tools and systems... more
Cubesats platforms expansion increases the need to simplify payloads and to optimize downlink data capabilities. A promising solution is to enhance on-board software, in order to take early decisions automatically. However, the most... more
In this paper, a new iterative image processing algorithm is introduced and denoted as "iterative cellular image processing algorithm" (ICIPA). The new unsupervised iterative algorithm uses the advantage of stochastic properties and... more
We show that the problem of constructing a perfect matching in a graph is in the complexity class Random NC; i.e., the problem is solvable in polylog time by a randomized parallel algorithm using a polynomial -bounded number of... more
Cet article aborde le problème du codage de séquences vidéo multi-vues et présente une nouvelle méthode d'estimation des champs de disparité et du mouvement impliqués dans la séquence. Afin de réduire la complexité et le coût de calcul et... more
A novel variable influence on projection approach for O2PLS models, named VIP O2PLS , is presented in this paper. VIP O2PLS is a model-based method for judging the importance of variables. Its cornerstone is the 2-way formalism of the... more
Euler deconvolution is a commonly employed magnetic interpretation method because it requires only a little a priori knowledge about the magnetic source geometry, and, more importantly, because it requires no information about the... more
We have developed a new approach for estimating the location and geometry of several density anomalies that give rise to a complex, interfering gravity field. The user interactively defines the assumed outline of the true gravity sources... more
Prototype databases are needed in any information system development process to support data-intensive applications development. It is common practice to populate these databases using synthetic data. This data usually bears little... more
Automated feature engineering (AutoFE) has become a cornerstone of efficient machine learning (ML), yet its potential to perpetuate or amplify bias remains underexplored. This paper proposes a fairness-aware framework for feature... more
Automated feature engineering (AutoFE) has become a cornerstone of efficient machine learning (ML), yet its potential to perpetuate or amplify bias remains underexplored. This paper proposes a fairness-aware framework for feature... more
We study recovery conditions of weighted 1 minimization for signal reconstruction from compressed sensing measurements when partial support information is available. We show that if at least 50% of the (partial) support information is... more
The Rover Environmental Monitoring Station (REMS) on the Mars Science Laboratory (MSL) offers the opportunity to explore the near surface atmospheric boundary layer (ABL) over an extended region of the Martian surface. The atmospheric... more
Towed streamer electromagnetic (TSEM) survey is an efficient data acquisition technique capable of collecting a large volume of electromagnetic (EM) data over extensive areas rapidly and economically. The TSEM survey is capable of... more
This paper describes an application, called Medici, designed to produce synthetic data for social network graphs, which can be used for analysis, hypothesis testing and application development by researchers and practitioners in the... more
Two of the difficulties for data analysts of online social networks are (i) the public availability of data and (ii) respecting the privacy of the users. One possible solution to both of these problems is to use synthetically generated... more
In probabilistic mobile robotics, the development of measurement models plays a crucial role as it directly influences the efficiency and the robustness of the robot's performance in a great variety of tasks including localization,... more
This paper describes the development and application of a system for quantitative near-bottom seismic profiling at 4 kHz. Developed as part of the Deep-Tow Instrumentation System of the Marine Physical Laboratory, this system represents a... more
This data paper presents lightcurves of 101 near Earth asteroids (NEAs) observed mostly between 2014 and 2017 as part of the EURONEAR photometric survey using 11 telescopes with diameters between 0.4 and 4.2 m located in Spain, Chile,... more
The potentials around a finite cylindrical electrode can be obtained by dividing the electrodes into rings of equal thickness and substituting an infinitely thin current ring for each of the slices. The field of an infinitely thin ring... more
The liberation distribution of ore samples is of considerable interest for process optimisation in the minerals industry. A scanning electron microscope-based automatic mineral analyser such as the LEO QEMSCAN system developed by CSIRO... more
Cortical events of correlated neuronal firing are thought to underlie sensorimotor and associative functions. It is believed that events are attractors of cortical dynamics, pulled by strong mutual connections between recurrently active... more
This paper examines the impact of a mesoscale analysis (2.5 km grid distance) on the simulation of the meso-gamma scale aspects of föhn in the Rhine Valley. The föhn event, documented during IOP 15 (5 November 1999) of the Mesoscale... more
The description that one can have of the seismic source is the mani- festation of an imagined model, obviously outlined from Physic Theories and supported by mathematical methods. In that context, the modelling of earthquake rupture... more
We investigate the ability of a mesoscale model to reconstruct CO 2 fluxes at regional scale. Formally, we estimate the reduction of error for a CO 2 flux inversion at 8 km resolution in the South West of France, during four days of the... more
We consider the problem of computing a minimum-weight polygonal path between two points in a weighted polygonal subdivision, subject to the constraint that the path have few segments (links). We give an algorithm that generates paths of... more
Mushroom is one of the fungi types' food that has the most potent nutrients on the plant. Mushrooms have major medical advantages such as killing cancer cells. This study aims to find the most appropriate technique for mushroom... more
Synthetic datasets are beneficial for machine learning researchers due to the possibility of experimenting with new strategies and algorithms in the training and testing phases. These datasets can easily include more scenarios that might... more
Re ´sume ´: Dans ce rapport, nous conside ´rons une paire d'images ste ´re ´o dont seule la ge ´ome ´trie e ´pipolaire est connue, repre ´sente ´e par la .matrice fondamentale du syste `me de came ´ras. Nous montrons qu'il est possible de... more
Data on businesses collected by statistical agencies are challenging to protect. Many businesses have unique characteristics, and distributions of employment, sales, and profits are highly skewed. Attackers wishing to conduct... more
We present a discriminative learning framework for Gaussian mixture models (GMMs) used for classification based on the extended Baum-Welch (EBW) algorithm . We suggest two criteria for discriminative optimization, namely the class... more
This supplement includes all derivations and the pseudocode for learning the maximum margin hidden Markov model for sequence classification. It uses the extended Baum-Welch framework for optimization.
Engineering Systems for Knowledge and Inference, an emerging focus area for Engineering as well as for the country at large, encompasses a wide variety of technologies. The goal is to generate new understanding or knowledge of situations,... more
Using multiyear satellite rainfall estimates, the distributions of the area, and the total rain rate of rain clusters over the equatorial Indian, Pacific, and Atlantic Oceans was found to exhibit a power law f S s ð Þ e s Àζ S , in which... more
Starting from the definition of Azimuth Moveout (AM O) as the cascade of D M O and inverse D M O at different offsets and azimuths, we derive an amplitude-preserving function for the AM O operator. This amplitude function is based on the... more
In the context of deep geological disposal of high level radioactive wastes, the French National Radioactive Waste Management Agency (Andra) has conducted an extensive characterization of the Callovo-Oxfordian argillaceous rock and... more
Acoustic sonar imaging systems are widely used for underwater surveillance in both civilian and military sectors. However, acquiring high-quality sonar datasets for training Artificial Intelligence (AI) models confronts challenges such as... more
In this paper, we aim to identify the minimal subset of discrete random variables that is relevant for probabilistic classification in data sets with many variables but few instances. A principled solution to this problem is to determine... more
Download research papers for free!