ACM Transactions on Knowledge Discovery From Data, Jul 30, 2022
People's location data is continuously tracked from various devices and sensors, enabling an ongo... more People's location data is continuously tracked from various devices and sensors, enabling an ongoing analysis of sensitive information that can violate people's privacy and reveal conidential information. Synthetic data has been used to generate representative location sequences yet to maintain the users' privacy. Nonetheless, the privacy-accuracy tradeof between these two measures has not been addressed systematically. In this paper, we analyze the use of diferent synthetic data generation models for long location sequences, including extended short-term memory networks (LSTMs), Markov Chains, and variable-order Markov models (VMMs). We employ diferent performance measures, such as data similarity and privacy, and discuss the inherent tradeof. Furthermore, we introduce other measurements to quantify each of these measures. Based on the anonymous data of 300 thousand cellular-phone users, our work ofers a road map for developing policies for synthetic data generation processes. We propose a framework for building data generation models and evaluating their efectiveness regarding those accuracy and privacy measures.
International Journal of Information Security, Mar 24, 2023
Malicious websites pose a challenging cybersecurity threat. Traditional tools for detecting malic... more Malicious websites pose a challenging cybersecurity threat. Traditional tools for detecting malicious websites rely heavily on industry-speci c domain knowledge, are maintained by large-scale research operations, and result in a never-ending attacker-defender dynamic. Malicious websites need to balance two opposing requirements to successfully function: escaping malware detection tools while attracting visitors. This fundamental con ict can be leveraged to create a robust and sustainable detection approach based on the extraction, analysis and learning of design attributes for malicious website identi cation. In this paper, we propose a next-generation algorithm for extended design attribute learning that learns and analyzes web page structures, contents, appearances and reputations to detect malicious websites. A large-scale experiment that was conducted on more than 35,000 websites suggests that the proposed algorithm effectively detects more than 83% of all malicious websites while maintaining a low falsepositive rate of 2%. In addition, the proposed method can incorporate user feedback and ag new suspicious websites and thus can be effective against zero-day attacks.
Most statistical process control (SPC) methods are not suitable for monitoring nonlinear and stat... more Most statistical process control (SPC) methods are not suitable for monitoring nonlinear and statedependent processes. This article introduces the context-based SPC (CSPC) methodology for statedependent data generated by a finite-memory source. The key idea of the CSPC is to monitor the statistical attributes of a process by comparing two context trees at any monitoring period of time. The first is a reference tree that represents the "in control" reference behavior of the process; the second is a monitored tree, generated periodically from a sample of sequenced observations, that represents the behavior of the process at that period. The Kullback-Leibler (KL) statistic is used to measure the relative "distance" between these two trees, and an analytic distribution of this statistic is derived. Monitoring the KL statistic indicates whether there has been any significant change in the process that requires intervention. An example of buffer-level monitoring in a production system demonstrates the viability of the new method with respect to conventional methods.
In certain types of processes, verification of the quality of the output units is possible only a... more In certain types of processes, verification of the quality of the output units is possible only after the entire batch has been processed. We develop a model that prescribes which units should be inspected and how the units that were not inspected should be disposed of, in order to minimize the expected sum of inspection costs and disposition error costs, for processes that are subject to random failure and recovery. The model is based on a dynamic programming algorithm that has a low computational complexity. The study also includes a sensitivity analysis under a variety of cost and probability scenarios, supplemented by an analysis of the smallest batch that requires inspection, the expected number of inspections, and the performance of an easy to implement heuristic.
IEEE Transactions on Computational Social Systems, Apr 1, 2022
In this article, we evaluate, for the first time, the potential of a scheduled seeding strategy f... more In this article, we evaluate, for the first time, the potential of a scheduled seeding strategy for influence maximization in a real-world setting. We first propose methods for analyzing historical data to quantify the infection probability of a node with a given set of properties in a given time and assess the potential of a given seeding strategy to infect nodes. Then, we examine the potential of a scheduled seeding strategy by analyzing a real-world large-scale dataset containing both the network topology as well as the nodes’ infection times. Specifically, we use the proposed methods to demonstrate the existence of two important effects in our dataset: a complex contagion effect and a diminishing social influence effect. As shown in a recent study, the scheduled seeding approach is expected to benefit greatly from the existence of these two effects. Finally, we compare a number of benchmark seeding strategies to a scheduled seeding strategy that ranks nodes based on a combination of the number of infectious friends (NIF) they have, as well as the time that has passed since they became infectious. Results of our analyses show that for a seeding budget of 1%, the scheduled seeding strategy yields a convergence rate that is 14% better than a seeding strategy based solely on their degrees, and 215% better than a random seeding strategy, which is often used in practice.
C-B4 Monitoring is a new method for automatically monitoring and analyzing complex processes. At ... more C-B4 Monitoring is a new method for automatically monitoring and analyzing complex processes. At the heart of the C-B4 Monitoring concept is the analysis of data patterns by the C-B4 network. Once the network is constructed, it captures all the significant dynamics and dependencies in the data. This information can be used by various applications. This presentation focuses on the integration between SAS® Forecast Server and C-B4's patternrecognition server, with respect to demand sensing and forecast control. Reference will be given to industrial, telecom, and retail organizations.
Applied Stochastic Models in Business and Industry, 2015
In recent years, with the emergence of big data and online Internet applications, the ability to ... more In recent years, with the emergence of big data and online Internet applications, the ability to classify huge amounts of objects in
a short time has become extremely important. Such a challenge can be achieved by constructing decision trees (DTs) with a low
expected number of tests (ENT).We address this challenge by proposing the ‘save favorable general optimal testing algorithm’ (SFGOTA)
that guarantees, unlike conventional look-ahead DT algorithms, the construction of DTs with monotonic non-increasing
ENT. The proposed algorithm has a lower complexity in comparison to conventional look-ahead algorithms. It can utilize parallel
processing to reduce the execution time when needed. Several numerical studies exemplify how the proposed SF-GOTA generates
efficient DTs faster than standard look-ahead algorithms, while converging to a DT with a minimum ENT.
Proceedings of the 12th International Conference on Agents and Artificial Intelligence, 2020
The paper addresses the problem of probabilistic search and detection of multiple targets by the ... more The paper addresses the problem of probabilistic search and detection of multiple targets by the group of mobile robots that are equipped by a variety of sensors and are communicating with each other at different levels. The goal is to define the trajectories of the robots in the group such that the targets are chased in minimal time. The suggested solution model follows the occupancy grid approach, and sensor fusion is implemented using a general Bayesian scheme with varying sensitivity of the sensors. The created control algorithm was verified in three settings with different levels of communication and information sharing between the robots and different levels of sensors' sensitivity. The suggested algorithms were implemented in a software simulation to analyze and compare the different policies.
A Risk-Scoring Feedback Model for Webpages and Web Users Based on Browsing Behavior
ACM Transactions on Intelligent Systems and Technology, 2017
It has been claimed that many security breaches are often caused by vulnerable (naïve) employees ... more It has been claimed that many security breaches are often caused by vulnerable (naïve) employees within the organization [Ponemon Institute LLC 2015a]. Thus, the weakest link in security is often not the technology itself but rather the people who use it [Schneier 2003]. In this article, we propose a machine learning scheme for detecting risky webpages and risky browsing behavior, performed by naïve users in the organization. The scheme analyzes the interaction between two modules: one represents naïve users, while the other represents risky webpages. It implements a feedback loop between these modules such that if a webpage is exposed to a lot of traffic from risky users, its “risk score” increases, while in a similar manner, as the user is exposed to risky webpages (with a high “risk score”), his own “risk score” increases. The proposed scheme is tested on a real-world dataset of HTTP logs provided by a large American toolbar company. The results suggest that a feedback learning p...
Contact mixing plays a key role in the spread of COVID-19. Thus, mobility restrictions of varying... more Contact mixing plays a key role in the spread of COVID-19. Thus, mobility restrictions of varying degrees up to and including nationwide lockdowns have been implemented in over 200 countries. To appropriately target the timing, location, and severity of measures intended to encourage social distancing at a country level, it is essential to predict when and where outbreaks will occur, and how widespread they will be. We analyze aggregated, anonymized health data and cell phone mobility data from Israel. We develop predictive models for daily new cases and the test positivity rate over the next 7 days for different geographic regions in Israel. We evaluate model goodness of fit using root mean squared error (RMSE). We use these predictions in a five-tier categorization scheme to predict the severity of COVID-19 in each region over the next week. We measure magnitude accuracy (MA), the extent to which the correct severity tier is predicted. Models using mobility data outperformed models that did not use mobility data, reducing RMSE by 17.3% when predicting new cases and by 10.2% when predicting the test positivity rate. The best set of predictors for new cases consisted of 1-day lag of past 7-day average new cases, along with a measure of internal movement within a region. The best set of predictors for the test positivity rate consisted of 3-days lag of past 7-day average test positivity rate, along with the same measure of internal movement. Using these predictors, RMSE was 4.812 cases per 100,000 people when predicting new cases and 0.79% when predicting the test positivity rate. MA in predicting new cases was 0.775, and accuracy of prediction to within one tier was 1.0. MA in predicting the test positivity rate was 0.820, and accuracy to within one tier was 0.998.
We consider the use of a wireless body area network (WBAN) for remote health monitoring applicati... more We consider the use of a wireless body area network (WBAN) for remote health monitoring applications. A partially observable Markov decision process is used to describe the information flow and behavior of the WBAN. We then discuss a sensor activation policy, used for optimizing the tradeoff between power consumption and probability of patient health state misclassification. In order to determine the underlying health state transition probabilities, by which a patient's health state evolves, we develop a learning algorithm which uses the data collected from a group of patients, each being monitored by a WBAN. Finally, a numerical examination demonstrates the applicability of such a system, which applies the learning process and sensor activation policy simultaneously.
ACM Transactions on Knowledge Discovery From Data, Aug 20, 2019
The immense stream of data from mobile devices during recent years enables one to learn more abou... more The immense stream of data from mobile devices during recent years enables one to learn more about human behavior and provide mobile phone users with personalized services. In this work, we identify clusters of users who share similar mobility behavioral patterns. We analyze trajectories of semantic locations to find users who have similar mobility "lifestyle," even when they live in different areas. For this task, we propose a new grouping scheme that is called Lifestyle-Based Clustering (LBC). We represent the mobility movement of each user by a Markov model and calculate the Jensen-Shannon distances among pairs of users. The pairwise distances are represented by a similarity matrix, which is used for the clustering. To validate the unsupervised clustering task, we develop an entropy-based clustering measure, namely, an index that measures the homogeneity of mobility patterns within clusters of users. The analysis is validated on a real-world dataset that contains location-movements of 50,000 cellular phone users that were analyzed over a two-month period.
Uploads
Papers by Irad Ben-Gal
a short time has become extremely important. Such a challenge can be achieved by constructing decision trees (DTs) with a low
expected number of tests (ENT).We address this challenge by proposing the ‘save favorable general optimal testing algorithm’ (SFGOTA)
that guarantees, unlike conventional look-ahead DT algorithms, the construction of DTs with monotonic non-increasing
ENT. The proposed algorithm has a lower complexity in comparison to conventional look-ahead algorithms. It can utilize parallel
processing to reduce the execution time when needed. Several numerical studies exemplify how the proposed SF-GOTA generates
efficient DTs faster than standard look-ahead algorithms, while converging to a DT with a minimum ENT.