Key research themes
1. How can automated program synthesis techniques improve data completion and extraction tasks in tabular data?
This theme focuses on methods that leverage programming-by-example (PBE) and predictive synthesis to automate filling missing data, transforming raw tabular inputs into usable forms. Automating data completion reduces manual effort, making complex data wrangling accessible to non-experts and improving data usability across domains such as spreadsheets, databases, and web data extraction.
2. What are effective data preprocessing strategies to address real-world data quality issues in automatic data analysis?
Addressing real-world data quality challenges—such as missing data, out-of-range values, inconsistencies, and incomplete records—is crucial for downstream analysis. This research theme covers systematic preprocessing methodologies integrating domain knowledge and iterative refinement, aiming to preserve valuable information while improving data integrity. Automated or semi-automated frameworks for detecting data issues and recommending suitable cleaning techniques help streamline the preprocessing pipeline and improve analytical outcomes.
3. How can machine learning enhance reconstruction and cleaning of complex, heterogeneous tabular datasets?
Imperfect and messy datasets with missing columns, mixed delimiters, multi-valued attributes, and varying attribute orders present substantial data structuring challenges. This theme investigates machine learning–based algorithms to reconstruct original table schemas and accurately allocate data into columns, facilitating subsequent analysis. Domain-independent modular ML approaches offer scalable, robust solutions to recovering structured information from diverse noisy datasets.