Detecting fake news as early as possible has attracted growing attention due to its fast-spreadin... more Detecting fake news as early as possible has attracted growing attention due to its fast-spreading nature and the significant harm it can cause. As demonstrated in recent studies, the propagation pattern of fake news on social media differs from that of real news, and a number of works have extracted different types of features from the propagation pattern for detection. However, a major limitation of this approach is that the propagation network is not fully available in the early stages, and may take a long time to complete. As a result, existing network-based fake news detection methods yield low accuracy during the early stages of propagation. To bridge the research gap, in this work we: (1) propose a novel network embedding algorithm, based on the investigation of a wide range of features obtained from the propagation network, which are not well studied in previous work; and (2) design an autoencoder-based neural architecture to predict the embedding of the complete propagation...
Market Basket Analysis (MBA) is a popular technique to identify associations between products, wh... more Market Basket Analysis (MBA) is a popular technique to identify associations between products, which is crucial for business decision making. Previous studies typically adopt conventional frequent itemset mining algorithms to perform MBA. However, they generally fail to uncover rarely occurring associations among the products at their most granular level. Also, they have limited ability to capture temporal dynamics in associations between products. Hence, we propose OMBA, a novel representation learning technique for Online Market Basket Analysis. OMBA jointly learns representations for products and users such that they preserve the temporal dynamics of product-to-product and user-to-product associations. Subsequently, OMBA proposes a scalable yet effective online method to generate products' associations using their representations. Our extensive experiments on three real-world datasets show that OMBA outperforms state-of-the-art methods by as much as 21%, while emphasizing rar...
Many learning tasks involve multi-modal data streams, where continuous data from different modes ... more Many learning tasks involve multi-modal data streams, where continuous data from different modes convey a comprehensive description about objects. A major challenge in this context is how to efficiently interpret multi-modal information in complex environments. This has motivated numerous studies on learning unsupervised representations from multi-modal data streams. These studies aim to understand higher-level contextual information (e.g., a Twitter message) by jointly learning embeddings for the lower-level semantic units in different modalities (e.g., text, user, and location of a Twitter message). However, these methods directly associate each low-level semantic unit with a continuous embedding vector, which results in high memory requirements. Hence, deploying and continuously learning such models in low-memory devices (e.g., mobile devices) becomes a problem. To address this problem, we present METEOR, a novel MEmory and Time Efficient Online Representation learning technique,...
Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020
Many learning tasks involve multi-modal data streams, where continuous data from different modes ... more Many learning tasks involve multi-modal data streams, where continuous data from different modes convey a comprehensive description about objects. A major challenge in this context is how to efficiently interpret multi-modal information in complex environments. This has motivated numerous studies on learning unsupervised representations from multi-modal data streams. These studies aim to understand higher-level contextual information (e.g., a Twitter message) by jointly learning embeddings for the lower-level semantic units in different modalities (e.g., text, user, and location of a Twitter message). However, these methods directly associate each low-level semantic unit with a continuous embedding vector, which results in high memory requirements. Hence, deploying and continuously learning such models in low-memory devices (e.g., mobile devices) becomes a problem. To address this problem, we present METEOR, a novel MEmory and Time Efficient Online Representation learning technique,...
Advances in Knowledge Discovery and Data Mining, 2020
Linking job seekers with relevant jobs requires matching based on not only skills, but also perso... more Linking job seekers with relevant jobs requires matching based on not only skills, but also personality types. Although the Holland Code also known as RIASEC has frequently been used to group people by their suitability for six different categories of occupations, the RIASEC category labels of individual jobs are often not found in job posts. This is attributed to significant manual efforts required for assigning job posts with RIASEC labels. To cope with assigning massive number of jobs with RIASEC labels, we propose JPLink, a machine learning approach using the text content in job titles and job descriptions. JPLink exploits domain knowledge available in an occupation-specific knowledge base known as O*NET to improve feature representation of job posts. To incorporate relative ranking of RIASEC labels of each job, JPLink proposes a listwise loss function inspired by learning to rank. Both our quantitative and qualitative evaluations show that JPLink outperforms conventional baseli...
With the rapid evolution of social media, fake news has become a significant social problem, whic... more With the rapid evolution of social media, fake news has become a significant social problem, which cannot be addressed in a timely manner using manual investigation. This has motivated numerous studies on automating fake news detection. Most studies explore supervised training models with different modalities (e.g., text, images, and propagation networks) of news records to identify fake news. However, the performance of such techniques generally drops if news records are coming from different domains (e.g., politics, entertainment), especially for domains that are unseen or rarely-seen during training. As motivation, we empirically show that news records from different domains have significantly different word usage and propagation patterns. Furthermore, due to the sheer volume of unlabelled news records, it is challenging to select news records for manual labelling so that the domain-coverage of the labelled dataset is maximized. Hence, this work: (1) proposes a novel framework th...
Personal values have significant influence on individuals' behaviors, preferences, and decisi... more Personal values have significant influence on individuals' behaviors, preferences, and decision making. It is therefore not a surprise that personal values of a person could influence his or her social media content and activities. Instead of getting users to complete personal value questionnaire, researchers have looked into a non-intrusive and highly scalable approach to predict personal values using user-generated social media data. Nevertheless, geographical differences in word usage and profile information are issues to be addressed when designing such prediction models. In this work, we focus on analyzing Singapore users' personal values, and developing effective models to predict their personal values using their Facebook data. These models leverage on word categories in Linguistic Inquiry and Word Count (LIWC) and correlations among personal values. The LIWC word categories are adapted to non-English word use in Singapore. We incorporate the correlations among person...
Understanding Multilingual Communities through Analysis of Code-switching Behaviors in Social Media Discussions
Currently, the enormous span of social media usage – while providing valuable resources for lingu... more Currently, the enormous span of social media usage – while providing valuable resources for linguistic behavior analysis – makes tracking and understanding these multilingual discussions a challenging task. We have undertaken a multidisciplinary comprehensive study of multilingual discussions via the development of specialized data collection techniques that discover and track multilingual users of social media, and their associated discussions, within a defined geographical region. To facilitate automatic discussion analysis of large numbers of discussions we generated a machine learning model based on ground truth data obtained from Amazon Turk. Our approach goes beyond analyzing social media posts in isolation, by analyzing them in the context of the discussion in which they appear. We show a selection of example discussions found using our approach which reveals a number of interesting socio-linguistic interactions in the communities that we sampled, in support of approach as a ...
Recent years have witnessed the significant damage caused by various types of fake news. Although... more Recent years have witnessed the significant damage caused by various types of fake news. Although considerable effort has been applied to address this issue and much progress has been made on detecting fake news, most existing approaches mainly rely on the textual content and/or social context, while knowledge-level information—entities extracted from the news content and the relations between them—is much less explored. Within the limited work on knowledge-based fake news detection, an external knowledge graph is often required, which may introduce additional problems: it is quite common for entities and relations, especially with respect to new concepts, to be missing in existing knowledge graphs, and both entity prediction and link prediction are open research questions themselves. Therefore, in this work, we investigate knowledge-based fake news detection that does not require any external knowledge graph. Specifically, our contributions include: (1) transforming the problem of ...
Propagation2Vec: Embedding partial propagation networks for explainable fake news early detection
Information Processing & Management
Abstract Many recent studies have demonstrated that the propagation patterns of news on social me... more Abstract Many recent studies have demonstrated that the propagation patterns of news on social media can facilitate the detection of fake news. Most of these studies rely on the complete propagation networks to build their model, which is not fully available in the early stages and may take a long time to complete. Hence, relying on the complete propagation network is not ideal for fake news early detection. However, detecting fake news as early as possible is important due to their fast-spreading nature and the significant harm they can cause. In addition, most existing propagation network-based fake news detection techniques are not explicitly designed to jointly emphasise informative cascades and nodes in the propagation networks to detect fake news. To bridge these research gaps, this work proposes Propagation2Vec, a novel fake news early detection technique, which assigns varying levels of importance for the nodes and cascades in propagation networks, and reconstructs the knowledge of complete propagation networks based on their partial propagation networks at an early detection deadline. Our experiments show that our model can achieve state-of-the-art performance while only having access to the early stage propagation networks. Furthermore, we devise general explanations for the underlying logic of Propagation2Vec based on its attention weights assigned to different nodes and cascades, which improves the applicability of our approach and facilitates future research on propagation network-based fake news detection.
2019 IEEE International Conference on Big Data (Big Data), Dec 1, 2019
Building spatiotemporal activity models for people's activities in urban spaces is important for ... more Building spatiotemporal activity models for people's activities in urban spaces is important for understanding the everincreasing complexity of urban dynamics. With the emergence of Geo-Tagged Social Media (GTSM) records, previous studies demonstrate the potential of GTSM records for spatiotemporal activity modeling. State-of-the-art methods for this task embed different modalities (location, time, and text) of GTSM records into a single embedding space. However, they ignore Non-GeoTagged Social Media (NGTSM) records, which generally account for the majority of posts (e.g., more than 95% in Twitter), and could represent a great source of information to alleviate the sparsity of GTSM records. Furthermore, in the current spatiotemporal embedding techniques, less focus has been given to the users, who exhibit spatially motivated behaviors. To bridge this research gap, this work proposes USTAR, a novel online learning method for User-guided SpatioTemporal Activity Representation, which (1) embeds locations, time, and text along with users into the same embedding space to capture their correlations; (2) uses a novel collaborative filtering approach based on two different empirically studied user behaviors to incorporate both NGTSM and GTSM records in learning; and (3) introduces a novel sampling technique to learn spatiotemporal representations in an online fashion to accommodate recent information into the embedding space, while avoiding overfitting to recent records and frequently appearing units in social media streams. Our results show that USTAR substantially improves the state-of-the-art for region retrieval and keyword retrieval and its potential to be applied to other downstream applications such as local event detection.
Proceedings of the 13th International Conference on Computational Semantics - Short Papers
Word embedding learning is a technique in Natural Language Processing (NLP) to map words into vec... more Word embedding learning is a technique in Natural Language Processing (NLP) to map words into vector space representations, is one of the most popular research directions in modern NLP by virtue of its potential to boost the performance of many NLP downstream tasks. Nevertheless, most of the underlying word embedding methods such as word2vec and GloVe fail to produce high-quality representations if the text corpus is small and sparse. This paper proposes a method to generate effective word embeddings from limited data. Empirically, we show that the proposed model outperforms existing works for the classical word similarity task and for a domain-specific application.
Proceedings of The 12th International Workshop on Semantic Evaluation
This paper describes the SemEval 2018 shared task on semantic extraction from cybersecurity repor... more This paper describes the SemEval 2018 shared task on semantic extraction from cybersecurity reports, which is introduced for the first time as a shared task on SemEval. This task comprises four SubTasks done incrementally to predict the characteristics of a specific malware using cybersecurity reports. To the best of our knowledge, we introduce the world's largest publicly available dataset of annotated malware reports in this task. This task received in total 18 submissions from 9 participating teams.
Market Basket Analysis (MBA) is a popular technique to identify associations between products, wh... more Market Basket Analysis (MBA) is a popular technique to identify associations between products, which is crucial for business decision making. Previous studies typically adopt conventional frequent itemset mining algorithms to perform MBA. However, they generally fail to uncover rarely occurring associations among the products at their most granular level. Also, they have limited ability to capture temporal dynamics in associations between products. Hence, we propose OMBA, a novel representation learning technique for Online Market Basket Analysis. OMBA jointly learns representations for products and users such that they preserve the temporal dynamics of product-to-product and user-to-product associations. Subsequently, OMBA proposes a scalable yet effective online method to generate products' associations using their representations. Our extensive experiments on three real-world datasets show that OMBA outperforms state-of-the-art methods by as much as 21%, while emphasizing rar...
Uploads
Papers by amila silva