2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2018
With the rapid growth in urban transit networks in recent years, detecting service disruptions in... more With the rapid growth in urban transit networks in recent years, detecting service disruptions in a timely manner is a problem of increased interest to service providers. Transit agencies are seeking to move beyond traditional customer questionnaires and manual service inspections to leveraging open source indicators like social media for deteting emerging transit events. In this paper, we leverage Twitter data for early detection of metro service disruptions. Inspired by the multi-task learning framework, we propose the Metro Disruption Detection Model, which captures the semantic similarity between transit lines in Twitter space. We propose novel constraints on feature semantic similarity exploiting prior knowledge about the spatial connectivity and shared tracks of the metro network. An algorithm based on the alternating direction method of multipliers (ADMM) framework is developed to solve the proposed model. We run extensive experiments and comparisons to other models with real world Twitter data and transit disruption records from the Washington Metropolitan Area Transit Authority (WMATA) to justify the efficacy of our model.
2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
Open and accessible public utilities such as mass public transit systems are some of the vexing v... more Open and accessible public utilities such as mass public transit systems are some of the vexing venues that are vulnerable to several criminal acts due to the large volumes of commuters. Existing forms of threat or event detection for the rail-based transit systems are either not working in real-time or do not provide complete coverage. In this paper, we present RISECURE 1 , an open-source system, that uses real-time social media mining to aid in the early detection of such possible events within a rail-based/metro system. The system leverages dynamic query expansion to keep track of any new emerging information about any particular incident. The Real Time Incident panel of the proposed system provides a comprehensible representation of the evolution of threatening transit events, which are further shown in the storyline modal for each respective station. The alert notification module of the system is capable of monitoring threats to the rail-based/metro systems in real-time. We demonstrate the system by including case studies involving incidents occurring within the Washington DC Metropolitan Area Transit Authority (WMATA) metro system to justify the effectiveness of our approach.
Proceedings of the AAAI Conference on Artificial Intelligence
Airports are a prime target for terrorist organizations, drug traffickers, smugglers, and other n... more Airports are a prime target for terrorist organizations, drug traffickers, smugglers, and other nefarious groups. Traditional forms of security assessment are not real-time and often do not exist for each airport and port of entry. Thus, homeland security professionals must rely on measures of attractiveness of an airport as a target for attacks. We present an open source indicators approach, using news and social media, to conduct relative threat assessment, i.e., estimating if one airport is under greater threat than another. The three ingredients of our approach are a dynamic query expansion algorithm for tracking emerging threat-related chatter, news-Twitter reciprocity modeling for capturing interactions between social and traditional media, and a ranking scheme to provide an ordered assessment of airport threats. Case studies based on actual aviation incidents are presented.
2017 IEEE International Conference on Big Data (Big Data), 2017
In the era of information overload, people are struggling to make sense of complex story events i... more In the era of information overload, people are struggling to make sense of complex story events in massive social media data. Most existing approaches are designed to address event extraction in news reports, documents and abstracts, but such approaches are not suitable for Twitter data streams due to their unstructured language, short-length messages, and heterogeneous features; few existing approach generates a story by considering both the shared topics throughout the story and the smooth connection between successive nodes simultaneously. In this paper, a novel Twitter stoRy generation framework via shAred subspaCe and tEmporal Smoothness called TRACES is proposed. Given a query of an ongoing event, a novel multi-task clustering method integrated with shared subspace and temporal smoothness (STMTC) is proposed to generate the event stories. Extensive experimental evaluations of data sets for different events demonstrate the effectiveness of this new approach.
2017 IEEE International Conference on Big Data (Big Data), 2017
In today's era of information overload, people are struggling to detect the evolution of hot ... more In today's era of information overload, people are struggling to detect the evolution of hot topics from massive news media and microblogs such as Twitter. Reports from mainstream news agencies and discussions from microblogs could complement each other to form a complete picture of major events. Existing work has generally focused on a single source, seldom attempting to combine multiple sources to track the evolution of topics: emerging, evolving and fading phrases as this would require a considerably more sophisticated model. This paper proposes a novel story discovery model that integrates evolutionary topics in news and Twitter data sources using an incremental algorithm by 1) discovering complementary information from news and microblogs that provides a more complete view of major events; 2) modeling emerging, evolving and fading topics and features throughout ongoing events; and 3) creating a scalable algorithm that is capable of handling massive data from news and social...
Deriving event storylines is an effective summarization method to succinctly organize extensive i... more Deriving event storylines is an effective summarization method to succinctly organize extensive information, which can significantly alleviate the pain of information overload. The critical challenge is the lack of widely recognized definition of storyline metric. Prior studies have developed various approaches based on different assumptions about users' interests. These works can extract interesting patterns, but their assumptions do not guarantee that the derived patterns will match users' preference. On the other hand, their exclusiveness of single modality source misses cross-modality information. This paper proposes a method, multimodal imitation learning via Generative Adversarial Networks(MIL-GAN), to directly model users' interests as reflected by various data. In particular, the proposed model addresses the critical challenge by imitating users' demonstrated storylines. Our proposed model is designed to learn the reward patterns given user-provided storyline...
2016 IEEE International Conference on Big Data (Big Data), 2016
Connecting the dots between diverse entities such as people and organizations is a vital task for... more Connecting the dots between diverse entities such as people and organizations is a vital task for forming hypotheses and uncovering latent relationships among complex and large datasets. Most existing approaches are designed to address the relationship of entities in news reports, documents and abstracts, but such approaches are not suitable for Twitter data streams due to their unstructured languages, short-length messages, heterogeneous features and massive size. The sheer size of Twitter data requires more efficient algorithms to connect the dots within a short period of time. We present a system that automatically constructs stories by connecting entities in Twitter datasets. An entity similarity model is designed that combines both traditional entity-related features and social network attributes and a novel story generation algorithm applied on the similarity model is proposed to cope with the massive Twitter datasets. Extensive experimental evaluations were conducted to demon...
Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2019
Critical incident stages identification and reasonable prediction of traffic incident duration ar... more Critical incident stages identification and reasonable prediction of traffic incident duration are essential in traffic incident management. In this paper, we propose a traffic incident duration prediction model that simultaneously predicts the impact of the traffic incidents and identifies the critical groups of temporal features via a multi-task learning framework. First, we formulate a sparsity optimization problem that extracts low-level temporal features based on traffic speed readings and then generalizes higher level features as phases of traffic incidents. Second, we propose novel constraints on feature similarity exploiting prior knowledge about the spatial connectivity of the road network to predict the incident duration. The proposed problem is challenging to solve due to the orthogonality constraints, non-convexity objective, and non-smoothness penalties. We develop an algorithm based on the alternating direction method of multipliers (ADMM) framework to solve the proposed formulation. Extensive experiments and comparisons to other models on real-world traffic data and traffic incident records justify the efficacy of our model.
Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016
Storyline detection aims to connect seemly irrelevant single documents into meaningful chains, wh... more Storyline detection aims to connect seemly irrelevant single documents into meaningful chains, which provides opportunities for understanding how events evolve over time and what triggers such evolutions. Most previous work generated the storylines through unsupervised methods that can hardly reveal underlying factors driving the evolution process. This paper introduces a Bayesian model to generate storylines from massive documents and infer the corresponding hidden relations and topics. In addition, our model is the first attempt that utilizes Twitter data as human input to "supervise" the generation of storylines. Through extensive experiments, we demonstrate our proposed model can achieve significant improvement over baseline methods and can be used to discover interesting patterns for real world cases.
Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2019
Cybersecurity event detection is a crucial problem for mitigating effects on various aspects of s... more Cybersecurity event detection is a crucial problem for mitigating effects on various aspects of society. Social media has become a notable source of indicators for detection of diverse events. Though previous social media based strategies for cybersecurity event detection focus on mining certain event-related words, the dynamic and evolving nature of online discourse limits the performance of these approaches. Further, because these are typically unsupervised or weakly supervised learning strategies, they do not perform well in an environment of biased samples, noisy context, and informal language which is routine for online, user-generated content. This paper takes a supervised learning approach by proposing a novel multi-task learning based model. Our model can handle diverse structures in feature space by learning models for different types of potential high-profile targets simultaneously. For parameter optimization, we develop an efficient algorithm based on the alternating direction method of multipliers. Through extensive experiments on a real world Twitter dataset, we demonstrate that our approach consistently outperforms existing methods at encoding and identifying cybersecurity incidents.
ACM Transactions on Knowledge Discovery from Data, 2020
Probabilistic topic models, which can discover hidden patterns in documents, have been extensivel... more Probabilistic topic models, which can discover hidden patterns in documents, have been extensively studied. However, rather than learning from a single document collection, numerous real-world applications demand a comprehensive understanding of the relationships among various document sets. To address such needs, this article proposes a new model that can identify the common and discriminative aspects of multiple datasets. Specifically, our proposed method is a Bayesian approach that represents each document as a combination of common topics (shared across all document sets) and distinctive topics (distributions over words that are exclusive to a particular dataset). Through extensive experiments, we demonstrate the effectiveness of our method compared with state-of-the-art models. The proposed model can be useful for “comparative thinking” analysis in real-world document collections.
The exponential growth of the urban data generated by urban sensors, government reports, and crow... more The exponential growth of the urban data generated by urban sensors, government reports, and crowd-sourcing services endorses the rapid development of urban computing and spatial data mining technologies. Easier accessibility to such enormous urban data may be a double-bladed sword. On the one hand, urban data can be applied to solve a wide range of practical issues such as urban safety analysis and urban event detection. On the other hand, ethical issues such as biasedly polluted urban data, problematic algorithms, and unprotected privacy may cause moral disaster not only for the research fields but also for the society. This paper seeks to identify ethical vulnerabilities from three primary research directions of urban computing: urban safety analysis, urban transportation analysis, and social media analysis for urban events. Visions for future improvements in the perspective of ethics are addressed.
Advancements in communication infrastructures and low access barriers to communication sinks (e.g... more Advancements in communication infrastructures and low access barriers to communication sinks (e.g., personal mobile devices) have dramatically increased the size and reach of open source data such as those observed in social media: Twitter feeds, user blogs, Flikr images, and others. In several cases, the data have been implicitly or explicitly encoded with spatial and temporal attributes manifested in a variety of forms such as place names in Tweets and GPS coordinates in Flikr. Exploiting the open source data in conjunction with their spatiotemporal contexts can enhance our understanding of the physical environment, societal condition, and the dynamic and complex relationships between them. For example, in the context of disaster response, Twitter feeds and Flikr imageries can provide a rich and valuable avenue for monitoring the spatial distribution of affected areas and population sentiments to positively impact relief efforts such as the one caused by Hurricane Sandy. Also, during Arab Spring, the geographic evolution of population attitudes as observed in various social mediums can provide effective indicators of demonstrations and protests. These examples underline the importance of geo-social media in bringing awareness, insights, and decisions to impact these major events. This special issue includes five articles delving on the following themes of geo-social media data: understanding complex patterns and relationships between humans and their environment,
There is a glaring need to improve the tracking of brand perception which is ill-served due to ti... more There is a glaring need to improve the tracking of brand perception which is ill-served due to time-consuming techniques [1]. Online surveys take pre-set questions from companies and present them to users. Offline surveys handpick representative users and ask them detailed questions about products. Responses are then carefully analyzed making the process time consuming and expensive. The cumbersome nature of traditional survey techniques also preclude companies from taking advantage of new trends or rapidly rectifying negative developments in perception. This work presents DERIV, a novel framework to track user perception of a brand in near real time using open data such as tweets. Current techniques that measure brand perception rely on the sentiment of users. This approach is limited as most opinions from customers have little or no sentiment attached to them. For instance, the phrase 'Electric Car Z goes 300 miles in a single charge' shows positive sentiment towards a brand. However, sentiment analysis techniques will frequently identify this statement as neutral. Measuring sentiment from each customer tweet or social media post also does not convey what is being said about a brand across sources and across elements. Instead of using raw social media posts, we employ storylines (see the next paragraph for an example of what a storyline looks like) which are entities (people, organizations, things) linked by edges represented by the observed relationships between the entities. These relationships are normally the verbs
ACM Transactions on Spatial Algorithms and Systems, 2016
Event forecasting from social media data streams has many applications. Existing approaches focus... more Event forecasting from social media data streams has many applications. Existing approaches focus on forecasting temporal events (such as elections and sports) but as yet cannot forecast spatiotemporal events such as civil unrest and influenza outbreaks, which are much more challenging. To achieve spatiotemporal event forecasting, spatial features that evolve with time and their underlying correlations need to be considered and characterized. In this article, we propose novel batch and online approaches for spatiotemporal event forecasting in social media such as Twitter. Our models characterize the underlying development of future events by simultaneously modeling the structural contexts and their spatiotemporal burstiness based on different strategies. Both batch and online-based inference algorithms are developed to optimize the model parameters. Utilizing the trained model, the alignment likelihood of tweet sequences is calculated by dynamic programming. Extensive experimental e...
ISPRS International Journal of Geo-Information, 2016
In massive Twitter datasets, tweets deriving from different domains, e.g., civil unrest, can be e... more In massive Twitter datasets, tweets deriving from different domains, e.g., civil unrest, can be extracted to constitute spatio-temporal Twitter events for spatio-temporal distribution pattern detection. Existing algorithms generally employ scan statistics to detect spatio-temporal hotspots from Twitter events and do not consider the spatio-temporal evolving process of Twitter events. In this paper, a framework is proposed to discover evolving domain related spatio-temporal patterns from Twitter data. Given a target domain, a dynamic query expansion is employed to extract related tweets to form spatio-temporal Twitter events. The new spatial clustering approach proposed here is based on the use of multi-level constrained Delaunay triangulation to capture the spatial distribution patterns of Twitter events. An additional spatio-temporal clustering process is then performed to reveal spatio-temporal clusters and outliers that are evolving into spatial distribution patterns. Extensive experiments on Twitter datasets related to an outbreak of civil unrest in Mexico demonstrate the effectiveness and practicability of the new method. The proposed method will be helpful to accurately predict the spatio-temporal evolution process of Twitter events, which belongs to a deeper geographical analysis of spatio-temporal Big Data.
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016
EMBERS is an anticipatory intelligence system forecasting population-level events in multiple cou... more EMBERS is an anticipatory intelligence system forecasting population-level events in multiple countries of Latin America. A deployed system from 2012, EMBERS has been generating alerts 24x7 by ingesting a broad range of data sources including news, blogs, tweets, machine coded events, currency rates, and food prices. In this paper, we describe our experiences operating EMBERS continuously for nearly 4 years, with specific attention to the discoveries it has enabled, correct as well as missed forecasts, and lessons learnt from participating in a forecasting tournament including our perspectives on the limits of forecasting and ethical considerations.
Proceedings of the 2nd International Conference on Geographical Information Systems Theory, Applications and Management, 2016
Event analysis in social media is challenging due to endless amount of information generated dail... more Event analysis in social media is challenging due to endless amount of information generated daily. While current research has put a strong focus on detecting events, there is no clear guidance on how those storylines should be processed such that they would make sense to a human analyst. In this paper, we present DISTL, an event processing platform which takes as input a set of storylines (a sequence of entities and their relationships) and processes them as follows: (1) uses different algorithms (LDA, SVM, information gain, rule sets) to identify events with different themes and allocates storylines to them; and (2) combines the events with location and time to narrow down to the ones that are meaningful in a specific scenario. The output comprises sets of events in different categories. DISTL uses in-memory distributed processing that scales to high data volumes and categorizes generated storylines in near real-time. It uses Big Data tools, such as Hadoop and Spark, which have shown to be highly efficient in handling millions of tweets concurrently.
2015 IEEE International Conference on Big Data (Big Data), 2015
Twitter has become a popular social sensor. It is socially significant to surveil the tweet conte... more Twitter has become a popular social sensor. It is socially significant to surveil the tweet content under crucial themes such as "disease" and "civil unrest". However, this creates two challenges: 1) how to characterize the theme pattern, given Twitter's heterogeneity, dynamics, and unstructured language; and 2) how to model the theme consistently across multiple Twitter functions such as hashtags, replying, and friendships. In this paper, we propose a dynamic query expansion (DQE) model for theme tracking in Twitter. Specifically, DQE characterizes the theme consistency among heterogeneous entities (e.g., terms, tweets, and users) through semantic and social relationships, including co-occurrence, replying, authorship, and friendship. The proposed new optimization algorithm estimates the weight of each relationship by minimizing the Kullback-Leibler divergence. To demonstrate the effectiveness and scalability of DQE, we conducted extensive experiments to track the theme "civil unrest" across 8 Latin American countries.
Proceedings of the 2006 SIAM International Conference on Data Mining, 2006
Spatial outliers are the spatial objects with distinct features from their surrounding neighbors.... more Spatial outliers are the spatial objects with distinct features from their surrounding neighbors. Detection of spatial outliers helps reveal valuable information from large spatial data sets. In many real applications, spatial objects can not be simply abstracted as isolated points. They have different boundary, size, volume, and location. These spatial properties affect the impact of a spatial object on its neighbors and should be taken into consideration. In this paper, we propose two spatial outlier detection methods which integrate the impact of spatial properties to the outlierness measurement. Experimental results on a real data set demonstrate the effectiveness of the proposed algorithms.
Uploads
Papers by Chang-Tien Lu