Papers by Zhengzhang (Zach) Chen

—Searching for solutions that optimize a continuous function can be difficult due to the infinite... more —Searching for solutions that optimize a continuous function can be difficult due to the infinite search space, and can be further complicated by the high dimensionality in the number of variables and complexity in the structure of constraints. Both deterministic and stochastic methods have been presented in the literature with a purpose of exploiting the search space and avoiding local optima as much as possible. In this research, we develop a machine learning framework aiming to 'prune' the search effort of both types of optimization techniques by developing meta-heuristics, attempting to knowledgeably reordering the search space and reducing the search region. Numerical examples demonstrate that this approach can effectively find the global optimal solutions and significantly reduce the computational time for seven benchmark problems with variable dimensions of 100, 500 and 1000, compared to Genetic Algorithms.

Proceedings of the 2015 SIAM International Conference on Data Mining, 2015
ABSTRACT Categorical data are ubiquitous in real-world databases. However, due to the lack of an ... more ABSTRACT Categorical data are ubiquitous in real-world databases. However, due to the lack of an intrinsic proximity mea- sure, many powerful algorithms for numerical data anal- ysis may not work well on the categorical counterparts, making it a bottleneck in practical applications. In this paper, we propose a novel method to trans- form categorical data to numerical representations, in order to open the possibility of exploiting the abun- dant, numerical learning algorithms in a great variety of categorical data mining problems. Our key idea is to learn a pairwise dissimilarity among categorical sym- bols, henceforth a continuous embedding, which can then be used for subsequent numerical treatment. There are two important criteria for learning the dissimilari- ties: First, it should capture the important “transitiv- ity” which has shown to be particularly useful in mea- suring the proximity relation in the categorical data. Second, the pairwise sample geometry arising from the learned symbol distances should be maximally consis- tent with prior knowledge (e.g., class labels) to obtain a good generalization performance. We achieve these by designing a multiple transitive distance learning and embedding method. Encouraging results are observed on a number of benchmark classification tasks against state-of-the-art.

Third International Conference on Natural Computation (ICNC 2007), 2007
In order to exploit correlation between bits of wavelet coefficients as much as possible, this pa... more In order to exploit correlation between bits of wavelet coefficients as much as possible, this paper presents a new algorithm called Wavelet Image Compression Based on Context Predicting with Past Information in Rows (Columns) on the thought of context-based coding. In the selection of high-order context, this paper presents Seven Points in Three Rows (Columns) Context using mutual information as a theoretical basis. In the prediction and quantization of high-order context, we make use of mathematical tools, such as Fisher discriminant, dynamic programming, etc., to obtain a good conditional probability estimate of significance and sign bits of wavelet coefficients. The experiments show that reasonable context modeling of the presented algorithm leads to satisfying results. This paper makes some valuable contributions into finding theoretical bases for wavelet image compression.
A Scalable Hierarchical Clustering Algorithm Using Spark
2015 IEEE First International Conference on Big Data Computing Service and Applications, 2015

Automatic Detection and Correction of Multi-class Classification Errors Using System Whole-part Relationships
Proceedings of the 2013 SIAM International Conference on Data Mining, 2013
ABSTRACT Real-world dynamic systems such as physical and atmosphere-ocean systems often exhibit a... more ABSTRACT Real-world dynamic systems such as physical and atmosphere-ocean systems often exhibit a hierarchical system-subsystem structure. However, the paradigm of making this hierarchical/modular structure and the rich properties they encode a “first-class citizen” of machine learning algorithms is largely absent from the literature. Furthermore, traditional data mining approaches focus on designing new classifiers or ensembles of classifiers, while there is a lack of study on detecting and correcting prediction errors of existing forecasting (or classification) algorithms. In this paper, we propose DETECTOR, a hierarchical method for detecting and correcting forecast errors by employing the whole-part relationships between the target system and non-target systems. Experimental results show that DETECTOR can successfully detect and correct forecasting errors made by state-of-art classifier ensemble techniques and traditional single classifier methods at an average rate of 22%, corresponding to a 11% average forecasting accuracy increase, in seasonal forecasting of hurricanes and landfalling hurricanes in North Atlantic and North African rainfall.

Proceedings of SPIE - The International Society for Optical Engineering
Nowadays wavelet transform has been one of the most effective transform means in the realm of ima... more Nowadays wavelet transform has been one of the most effective transform means in the realm of image processing, especially the biorthogonal 9/7 wavelet filters proposed by Daubechies, which have good performance in image compression. This paper deeply studied the implementation and optimization technologies of 9/7 wavelet lifting scheme based on the DSP platform, including carrying out the fixed-point wavelet lifting steps instead of time-consuming floating-point operation, adopting pipelining technique to improve the iteration procedure, reducing the times of multiplication calculation by simplifying the normalization operation of two-dimension wavelet transform, and improving the storage format and sequence of wavelet coefficients to reduce the memory consumption. Experiment results have shown that these implementation and optimization technologies can improve the wavelet lifting algorithm's efficiency more than 30 times, which establish a technique foundation for successfully...
Background: Microbial communities in their natural environments exhibit phenotypes that can direc... more Background: Microbial communities in their natural environments exhibit phenotypes that can directly cause particular diseases, convert biomass or wastewater to energy, or degrade various environmental contaminants. Understanding how these communities realize specific phenotypic traits (e.g., carbon fixation, hydrogen production) is critical for addressing health, bioremediation, or bioenergy problems.
<title>Optimization technology of 9/7 wavelet lifting scheme on DSP*</title>
MIPPR 2007: Medical Imaging, Parallel Processing of Images, and Optimization Techniques, 2007
Running MAP Inference on Million Node Graphical Models: A High Performance Computing Perspective
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2015
Incremental, Distributed Single-Linkage Hierarchical Clustering Algorithm Using MapReduce
The emergence and ubiquity of online social networks have enriched web data with evolving interac... more The emergence and ubiquity of online social networks have enriched web data with evolving interactions and communities both at mega-scale and in real-time. This data offers an unprecedented opportunity for studying the interaction between society and disease outbreaks. The challenge we describe in this data paper is how to extract and leverage epidemic outbreak insights from massive amounts of social media data and how this exercise can benefit medical professionals, patients, and policymakers alike. We attempt to prepare the research community for this challenge with four datasets. Publishing the four datasets will commoditize the data infrastructure to allow a higher and more efficient focal point for the research community.
Predictive Modeling in Characterizing Localization Relationships
Data Mining Approach in Structure-Property Optimization
Batch Mode Active Learning with Hierarchical-Structured Embedded Variance
Proceedings of the 2014 SIAM International Conference on Data Mining, 2014

2012 IEEE 12th International Conference on Data Mining Workshops, 2012
A dynamic physical system often undergoes phase transitions in response to fluctuations induced o... more A dynamic physical system often undergoes phase transitions in response to fluctuations induced on system parameters. For example, hurricane activity is the climate system's response initiated by a liquid-vapor phase transition associated with non-linearly coupled fluctuations in the ocean and the atmosphere. Because our quantitative knowledge about highly non-linear dynamic systems is very meager, scientists often resort to linear regression techniques such as Least Absolute Deviation (LAD) to learn the non-linear system's response (e.g., hurricane activity) from observed or simulated system's parameters (e.g., temperature, precipitable water, pressure). While insightful, such models still offer limited predictability, and alternatives intended to capture non-linear behaviors such as Stepwise Regression are often controversial in nature. In this paper, we hypothesize that one of the primary reasons for lack of predictability is the treatment of an inherently multi-phase system as being phaseless. To bridge this gap, we propose a hybrid approach that first predicts the phase the system is in, and then estimates the magnitude of the system's response using the regression model optimized for this phase. Our approach is designed for systems that could be characterized by multi-variate spatio-temporal data from observations, simulations, or both.
In order to exploit correlation between bits of wavelet coefficients as much as possible, this pa... more In order to exploit correlation between bits of wavelet coefficients as much as possible, this paper presents a new algorithm called wavelet image compression based on context predicting with past information in rows (columns) on the thought of context-based coding. In the selection of high-order context, this paper presents seven points in three rows (columns) context using mutual information as a theoretical basis. In the prediction and quantization of high-order context, we make use of mathematical tools, such as Fisher discriminant dynamic programming, etc., to obtain a good conditional probability estimate of significance and sign bits of wavelet coefficients. The experiments show that reasonable context modeling of the presented algorithm leads to satisfying results. This paper makes some valuable contributions into finding theoretical bases for wavelet image compression.

2010 IEEE International Conference on Data Mining Workshops, 2010
Community structure or clustering is ubiquitous in many evolutionary networks including social ne... more Community structure or clustering is ubiquitous in many evolutionary networks including social networks, biological networks and financial market networks. Detecting and tracking community deviations in evolutionary networks can uncover important and interesting behaviors that are latent if we ignore the dynamic information. In biological networks, for example, a small variation in a gene community may indicate an event, such as gene fusion, gene fission, or gene decay. In contrast to the previous work on detecting communities in static graphs or tracking conserved communities in time-varying graphs, this paper first introduces the concept of community dynamics, and then shows that the baseline approach by enumerating all communities in each graph and comparing all pairs of communities between consecutive graphs is infeasible and impractical. We propose an efficient method for detecting and tracking community dynamics in evolutionary networks by introducing graph representatives and community representatives to avoid generating redundant communities and limit the search space. We measure the performance of the representative-based algorithm by comparison to the baseline algorithm on synthetic networks, and our experiments show that our algorithm achieves a runtime speedup of 11-46. The method has also been applied to two real-world evolutionary networks including Food Web and Enron Email. Significant and informative community dynamics have been detected in both cases.
Data Compression for the Exascale Computing Era-Survey
Theoretical Computer Science, 2014
Detecting and tracking disease outbreaks by mining social media data
ABSTRACT The emergence and ubiquity of online social networks have enriched web data with evolvin... more ABSTRACT The emergence and ubiquity of online social networks have enriched web data with evolving interactions and communities both at mega-scale and in real-time. This data offers an unprecedented opportunity for studying the interaction between society and disease outbreaks. The challenge we describe in this data paper is how to extract and leverage epidemic outbreak insights from massive amounts of social media data and how this exercise can benefit medical professionals, patients, and policymakers alike. We attempt to prepare the research community for this challenge with four datasets. Publishing the four datasets will commoditize the data infrastructure to allow a higher and more efficient focal point for the research community.
Uploads
Papers by Zhengzhang (Zach) Chen