ACM Transactions on Database Systems, Jun 30, 2018
As data becomes dynamic, large, and distributed, there is increasing demand for what have become ... more As data becomes dynamic, large, and distributed, there is increasing demand for what have become known as distributed stream algorithms. Since continuously collecting the data to a central server and processing it there is infeasible, a common approach is to define local conditions at the distributed nodes, such that-as long as they are maintained-some desirable global condition holds. Previous methods derived local conditions focusing on communication efficiency. While proving very useful for reducing the communication volume, these local conditions often suffer from heavy computational burden at the nodes. The computational complexity of the local conditions affects both the run-time and the energy consumption. These are especially critical for resource-limited devices like smartphones and sensor nodes. Such devices are becoming more ubiquitous due to the recent trend towards smart cities and the Internet of Things (IoT). To accommodate for high data rates and limited resources of these devices, it is crucial that the local conditions be quickly and efficiently evaluated. Here we propose a novel approach, designated CB (for Convex/Concave Bounds). CB defines local conditions using suitably chosen convex and concave functions. Lightweight and simple, these local conditions can be rapidly checked on the fly. CB's superiority over the state-of-the-art is demonstrated in its reduced run-time and power consumption, by up to six orders of magnitude in some cases. As an added bonus, CB also reduced communication overhead in all the tested application scenarios.
Journal of Parallel and Distributed Computing, 2019
An important problem in real systems for mining data streams is to detect changes in the dynamic ... more An important problem in real systems for mining data streams is to detect changes in the dynamic model describing the temporal data. Such changes indicate that the underlying data has undergone a transition which may well require attention. A distributed setting poses one of the main challenges in this type of change detection. In a distributed setting, model training requires centralizing the data from all nodes (hereafter, synchronization), which is very costly in terms of communication. In order to minimize communication, a monitoring algorithm should be executed locally at each node, while preserving the validity of the global model (that is, the model that will be computed if a synchronization takes place). To achieve this goal, we propose the first communication-efficient algorithm for monitoring a classification model over distributed, dynamic data streams. While the approach is general, here we concentrate on Linear Discriminant Analysis (LDA), a popular method for classification and dimensionality reduction in many fields. We mainly apply tools from the realms of linear algebra and multi-variate analysis in order to solve the problem at hand. The resulting implementation is quite straightforward. The emphasis of this work is not on solving the distributed optimization problem that corresponds to finding a classifier over the distributed data; instead, we continuously monitor the current classifier to check that it still fits
Monitoring data streams in a distributed system is the focus of much research in recent years. Mo... more Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More involved challenges, such as the important task of feature selection (e.g., by monitoring the information gain of various features), still require very high communication overhead using naive, centralized algorithms. We present a novel geometric approach which reduces monitoring the value of a function (visa-vis a threshold) to a set of constraints applied locally on each of the streams. The constraints are used to locally filter out data increments that do not affect the monitoring outcome, thus avoiding unnecessary communication. As a result, our approach enables monitoring of arbitrary threshold functions over distributed data streams in an efficient manner. We present experimental results on real-world data which demonstrate that our algorithms are highly scalable, and considerably reduce communication load in comparison to centralized algorithms.
Given a sequence of images taken from a moving camera, they are registered with sub-pixel accurac... more Given a sequence of images taken from a moving camera, they are registered with sub-pixel accuracy in respect to translation and rotation. The sub-pixel registration enables image enhancement in respect to improved resolution and noise cleaning. Both the registration and the enhancement procedures are described. The methods are particulary useful for image sequences taken from an aircraft or satellite where images in a sequence differ mostly by translation and rotation. In these cases the process results in images that are stable, clean, and sharp. This work has been supported by a grant from AFIRST and the Israeli NCRD
International Conference on Machine Learning, Jul 6, 2015
We explore a novel approach to upper bound the misclassification error for problems with data com... more We explore a novel approach to upper bound the misclassification error for problems with data comprising a small number of positive samples and a large number of negative samples. We assign the hinge-loss to upper bound the misclassification error of the positive examples and use the minimax risk to upper bound the misclassification error with respect to the worst case distribution that generates the negative examples. This approach is computationally appealing since the majority of training examples (belonging to the negative class) are represented by the statistics of their distribution, in contrast to kernel SVM which produces a very large number of support vectors in such settings. We derive empirical risk bounds for linear and non-linear classification and show that they are dimensionally independent and decay as 1/ √ m for m samples. We propose an efficient algorithm for training an intersection of finite number of hyperplanes and demonstrate its effectiveness on real data, including letter and scene recognition.
Complex Event Processing (CEP) is an emerging field with important applications in many areas. CE... more Complex Event Processing (CEP) is an emerging field with important applications in many areas. CEP systems collect events arriving from input data streams and use them to infer more complex events according to predefined patterns. The Non-deterministic Finite Automaton (NFA) is one of the most popular mechanisms on which such systems are based. During the event detection process, NFAs incrementally extend previously observed partial matches until a full match for the query is found. As a result, each arriving event needs to be processed to determine whether a new partial match is to be initiated or an existing one extended. This method may be highly inefficient when many of the events do not result in output matches. We propose a lazy evaluation mechanism that defers processing of frequent event types and stores them internally upon arrival. Events are then matched in ascending order of frequency, thus minimizing potentially redundant computations. We introduce a lazy Chain NFA, which utilizes the above principle, and does not depend on the underlying pattern structure. An algorithm for constructing a Chain NFA for common pattern types is presented, including conjunction, negation and iteration. In addition, we propose a Tree NFA that does not require the frequencies of the event types to be defined in advance. Finally, we experimentally evaluate our mechanism on real-world stock trading data. The results demonstrate a performance gain of two orders of magnitude over traditional NFA-based approaches, with significantly reduced memory resource requirements.
Distributed monitoring methods address the di cult problem of continuously approximating function... more Distributed monitoring methods address the di cult problem of continuously approximating functions over distributed streams, while minimizing the communication cost. However, existing methods are concerned with the approximation of a single function at a time. Employing these methods to track multiple functions will multiply the communication volume, thus eliminating their advantage in the rst place. We introduce a novel approach that can be applied to multiple functions. Our method applies a communication reduction scheme to the set of functions, rather than to each function independently, keeping a low communication volume. Evaluation on several realworld datasets shows that our method can track many functions with reduced communication, in most cases incurring only a negligible increase in communication over distributed approximation of a single function. CCS CONCEPTS •Computing methodologies →Distributed algorithms;
Monitoring data streams in a distributed system is the focus of much research in recent years. Mo... more Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More involved challenges, such as the important task of feature selection (e.g., by monitoring the information gain of various features), still require very high communication overhead using naive, centralized algorithms. We present a novel geometric approach which reduces monitoring the value of a function (visa-vis a threshold) to a set of constraints applied locally on each of the streams. The constraints are used to locally filter out data increments that do not affect the monitoring outcome, thus avoiding unnecessary communication. As a result, our approach enables monitoring of arbitrary threshold functions over distributed data streams in an efficient manner. We present experimental results on real-world data which demonstrate that our algorithms are highly scalable, and considerably reduce communication load in comparison to centralized algorithms.
One of the most-extensively studied problems in three-dimensional Computer Vision is “Perspective... more One of the most-extensively studied problems in three-dimensional Computer Vision is “Perspective-n-Point” (PnP), which concerns estimating the pose of a calibrated camera, given a set of 3D points in the world and their corresponding 2D projections in an image captured by the camera. One solution method that ranks as very accurate and robust proceeds by reducing PnP to the minimization of a fourth-degree polynomial over the three-dimensional sphere S3. Despite a great deal of effort, there is no known fast method to obtain this goal. A very common approach is solving a convex relaxation of the problem, using “Sum Of Squares” (SOS) techniques. We offer two contributions in this paper: a faster (by a factor of roughly 10) solution with respect to the state-of-the-art, which relies on the polynomial’s homogeneity; and a fast, guaranteed, easily parallelizable approximation, which makes use of a famous result of Hilbert.
In many emerging applications, the data which has to be monitored is of very high volume, dynamic... more In many emerging applications, the data which has to be monitored is of very high volume, dynamic, and distributed, making it infeasible to collect the distinct data streams to a central node and process them there. Often, the monitoring problem consists of determining whether the value of a global function, which depends on the union of all streams, crossed a certain threshold. A great deal of effort is directed at reducing communication overhead by transforming the monitoring of the global function to the testing of local constraints, checked independently at the nodes. Recently, geometric monitoring (GM) proved to be very useful for constructing such local constraints for general (non-linear, non-monotonic) functions. Alas, in all current variants of geometric monitoring, the constraints at all nodes share an identical structure and are, thus, unsuitable for handling heterogeneous streams, which obey different distributions at the distinct nodes. To remedy this, we propose a general approach for geometric monitoring of heterogeneous streams (HGM), which defines constraints tailored to fit the distinct data distributions at the nodes. While optimally selecting the constraints is an NP-hard problem, we provide a practical solution, which seeks to reduce running time by hierarchically clustering nodes with similar data distributions and then solving more, but simpler, optimization problems. Experiments are provided to support the validity of the proposed approach.
In this position paper, we exploit the tools from the realm of graph theory to matching and porti... more In this position paper, we exploit the tools from the realm of graph theory to matching and portioning problems of agent population in an agent-based model for traffic and transportation applications. We take the agent-based carpooling application as an example scenario. The first problem is matching, which concerns finding the optimal pairing among agents. The second problem is partitioning, which is crucial for achieving scalability and for other problems that can be parallelized by separating the passenger population to sub-populations such that the interaction between different sub-populations is minimal. Since in real-life applications the agent population, as well as their preferences, very often change, we also discuss incremental solutions to these problems.
The empirical entropy is a key statistical measure of data frequency vectors, enabling one to est... more The empirical entropy is a key statistical measure of data frequency vectors, enabling one to estimate how diverse the data are. From the computational point of view, it is important to quickly compute, approximate, or bound the entropy. In a distributed system, the representative (“global”) frequency vector is the average of the “local” frequency vectors, each residing in a distinct node. Typically, the trivial solution of aggregating the local vectors and computing their average incurs a huge communication overhead. Hence, the challenge is to approximate, or bound, the entropy of the global vector, while reducing communication overhead. In this paper, we develop algorithms which achieve this goal.
An important problem in distributed, dynamic databases is to continuously monitor the value of a ... more An important problem in distributed, dynamic databases is to continuously monitor the value of a function defined on the nodes, and check that it satisfies some threshold constraint. We introduce a monitoring method, based on a geometric interpretation of the problem, which enables to define local constraints at the nodes. It is guaranteed that as long as none of these constraints is violated, the value of the function did not cross the threshold. We generalize previous work on geometric monitoring, and solve two problems which seriously hampered its performance: as opposed to the constraints used so far, which depend only on the current values of the local data, here we incorporate their temporal behavior. Also, the new constraints are tailored to the geometric properties of the specific monitored function. In addition, we extend the concept of safe zones for the monitoring problem, and show that previous work on geometric monitoring is a special case of the proposed extension. Exp...
Modern scale-out services are comprised of thousands of individual machines, which must be contin... more Modern scale-out services are comprised of thousands of individual machines, which must be continuously monitored for unexpected failures. One recent approach to monitoring is latent fault detection, an adaptive statistical framework for scale-out, load-balanced systems. By periodically measuring hundreds of performance metrics and looking for outlier machines, it attempts to detect subtle problems such as misconfigurations, bugs, and malfunctioning hardware, before they manifest as machine failures. Previous work on a large, real-world Web service has shown that many failures are indeed preceded by such latent faults. Latent fault detection is an offline framework with large bandwidth and processing requirements. Each machine must send all its measurements to a centralized location, which is prohibitive in some settings and requires data-parallel processing infrastructure. In this work we adapt the latent fault detector to provide an online, communication- and computation-reduced v...
Uploads
Papers by Daniel Keren