When only users’ preferences and interests are considered by a recommendation algorithm, it will ... more When only users’ preferences and interests are considered by a recommendation algorithm, it will lead to the severe long-tail problem over items. Therefore, the unfair exposure phenomenon of recommended items caused by this problem has attracted widespread attention in recent years. For the first time, we reveal the fact that there is a more serious unfair exposure problem in session-based recommender systems (SRSs), which learn the short-term and dynamic preferences of users from anonymous sessions. Considering the fact that in SRSs, recommendations are provided multiple times and item exposures are accumulated over interactions in a session, we define new metrics both for the fairness of item exposure and recommendation quality among sessions. Moreover, we design a dynamic F airness- A ssurance ST rategy for s E ssion-based R ecommender systems ( FASTER ). FASTER is a post-processing strategy that tries to keep a balance between item exposure fairness and recommendation quality. I...
Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
Searching and reusing code snippets from open-source software repositories based on natural-langu... more Searching and reusing code snippets from open-source software repositories based on natural-language queries can greatly improve programming productivity. Recently, deep-learning-based approaches have become increasingly popular for code search. Despite substantial progress in training accurate models of code search, the robustness of these models has received little attention so far. In this paper, we aim to study and understand the security and robustness of code search models by answering the following question: Can we inject backdoors into deep-learning-based code search models? If so, can we detect poisoned data and remove these backdoors? This work studies and develops a series of backdoor attacks on the deep-learning-based models for code search, through data poisoning. We first show that existing models are vulnerable to data-poisoning-based backdoor attacks. We then introduce a simple yet effective attack on neural code search models by poisoning their corresponding training dataset. Moreover, we demonstrate that attacks can also influence the ranking of the code search results by adding a few specially-crafted source code files to the training corpus. We show that this type of backdoor attack is effective for several representative deep-learningbased code search systems, and can successfully manipulate the
Proceedings of the 44th International Conference on Software Engineering
Recently, many pre-trained language models for source code have been proposed to model the contex... more Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations. CCS CONCEPTS • Software and its engineering → Reusability.
Nowadays, huge amounts of texts are being generated for social networking purposes on Web. Keywor... more Nowadays, huge amounts of texts are being generated for social networking purposes on Web. Keyword extraction from such texts like microblog posts benefits many applications such as advertising, search, and content filtering. Unlike traditional web pages, a microblog post usually has some special social feature like a hashtag that is topical in nature and generated by users. Extracting keywords related to hashtags can reflect the intents of users and thus provides us better understanding on post content. In this paper, we propose a novel unsupervised keyword extraction approach for microblog posts by treating hashtags as topical indicators. Our approach consists of two hashtag enhanced algorithms. One is a topic model algorithm that infers topic distributions biased to hashtags on a collection of microblog posts. The words are ranked by their average topic probabilities. Our topic model algorithm can not only find the topics of a collection, but also extract hashtag-related keywords. The other is a random walk based algorithm. It first builds a word-post weighted graph by taking into account posts themselves. Then, a hashtag biased random walk is applied on this graph, which guides the algorithm to extract keywords according to hashtag topics. Last, the final ranking score of a word is determined by the stationary probability after a number of iterations. We evaluate our proposed approach on a collection of real Chinese microblog posts. Experiments show that our approach is more effective in terms of precision than traditional approaches considering no hashtag. The result achieved by the combination of two algorithms performs even better than each individual algorithm.
Code summarization (aka comment generation) provides a high-level natural language description of... more Code summarization (aka comment generation) provides a high-level natural language description of the function performed by code, which can benefit the software maintenance, code categorization and retrieval. To the best of our knowledge, the state-of-the-art approaches follow an encoder-decoder framework which encodes source code into a hidden space and later decodes it into a natural language space. Such approaches suffer from the following drawbacks: (a) they are mainly input by representing code as a sequence of tokens while ignoring code hierarchy; (b) most of the encoders only input simple features (e.g., tokens) while ignoring the features that can help capture the correlations between comments and code; (c) the decoders are typically trained to predict subsequent words by maximizing the likelihood of subsequent ground truth words, while in real world, they are excepted to generate the entire word sequence from scratch. As a result, such drawbacks lead to inferior and inconsistent comment generation accuracy. To address the above limitations, this paper presents a new code summarization approach using hierarchical attention network by incorporating multiple code features, including type-augmented abstract syntax trees and program control flows. Such features, along with plain code sequences, are injected into a deep reinforcement learning (DRL) framework (e.g., actor-critic network) for comment generation. Our approach assigns weights (pays "attention") to tokens and statements when constructing the code representation to reflect the hierarchical code structure under different contexts regarding code features (e.g., control flows and abstract syntax trees). Our reinforcement learning mechanism further strengthens the prediction results through the actor network and the critic network, where the actor network provides the confidence of predicting subsequent words based on the current state, and the critic network computes the reward values of all the possible extensions of the current state to provide global guidance for explorations. Eventually, we employ an advantage reward to train both networks and conduct a set of experiments on a real-world dataset. The experimental results demonstrate that our approach outperforms the baselines by around 22% to 45% in BLEU-1 and outperforms the state-of-the-art approaches by around 5% to 60% in terms of S-BLEU and C-BLEU.
Recommender systems are important approaches for dealing with the information overload problem in... more Recommender systems are important approaches for dealing with the information overload problem in big data era, and various kinds of auxiliary information, including time and sequential information, can help to improve the performance of retrieval and recommendation tasks. However, it is still a challenging problem how to fully exploit such information to achieve high-quality recommendation results and improve users' experience. In this work, we present a novel sequential recommendation model named Multivariate Hawkes Process Embedding with attention (MHPE-a), which combines a temporal point process with attention mechanism to predict the items that the target user may interact with according to her/his historical records. Specifically, the proposed approach MHPE-a can model users' sequential patterns in their temporal interaction sequences accurately with a multivariate Hawkes process. Then, we perform accurate sequential recommendation to satisfy target users' realtime requirement based on their preferences obtained with MHPE-a from their historical records. Especially, an attention mechanism is used to leverage users' long/short-term preferences adaptively to achieve accurate sequential recommendation. Extensive experiments are conducted on two real-world datasets (lastfm and gowalla), and the results show that MHPE-a achieves better performance than state-of-the-art baselines.
Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Modern recommender systems operate in a fully server-based fashion. To cater to millions of users... more Modern recommender systems operate in a fully server-based fashion. To cater to millions of users, the frequent model maintaining and the high-speed processing for concurrent user requests are required, which comes at the cost of a huge carbon footprint. Meanwhile, users need to upload their behavior data even including the immediate environmental context to the server, raising the public concern about privacy. On-device recommender systems circumvent these two issues with cost-conscious settings and local inference. However, due to the limited memory and computing resources, on-device recommender systems are confronted with two fundamental challenges: (1) how to reduce the size of regular models to fit edge devices? (2) how to retain the original capacity? Previous research mostly adopts tensor decomposition techniques to compress regular recommendation models with low compression rates so as to avoid drastic performance degradation. In this paper, we explore ultra-compact models for next-item recommendation, by loosing the constraint of dimensionality consistency in tensor decomposition. To compensate for the capacity loss caused by compression, we develop a self-supervised knowledge distillation framework which enables the compressed model (student) to distill the essential information lying in the raw data, and improves the long-tail item recommendation through an embeddingrecombination strategy with the original model (teacher). The extensive experiments on two benchmarks demonstrate that, with 30x size reduction, the compressed model almost comes with no accuracy loss, and even outperforms its uncompressed counterpart.
We explore the semantic-rich structured information derived from the knowledge graphs (KGs) assoc... more We explore the semantic-rich structured information derived from the knowledge graphs (KGs) associated with the user-item interactions and aim to reason out the motivations behind each successful purchase behavior. Existing works on KGs-based explainable recommendations focus purely on path reasoning based on current useritem interactions, which generally result in the incapability of conjecturing users' subsequence preferences. Considering this, we attempt to model the KGs-based explainable recommendation in sequential settings. Specifically, we propose a novel architecture called Reinforced Sequential Learning with Gated Recurrent Unit (RSL-GRU), which is composed of a Reinforced Path Reasoning Network (RPRN) component and a GRU component. RSL-GRU takes users' sequential behaviors and their associated KGs in chronological order as input and outputs potential top-N items for each user with appropriate reasoning paths from a global perspective. Our RPRN features a remarkable path reasoning capacity, which is regulated by a userconditioned derivatively action pruning strategy, a soft reward strategy based on an improved multi-hop scoring function, and a policy-guided sequential path reasoning algorithm. Experimental results on four of Amazon's large-scale datasets show that our method achieves excellent results compared with several state-of-the-art alternatives. Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Although the time complexity of our model is a little higher in the worst situation than Ω 2 in DAN, KPRN, and KARN, its calculation is much smaller compared with them. Point 2. Datasets in section 5.1.1 need to be given the links or citations. Thank you for your suggestion. We add a link (https://nijianmo.github.io/amazon/index.html) to the datasets in section 5.1.1 on page 13. Noname manuscript No.
Active Learning (AL) is a learning task that requires learners interactively query the labels of ... more Active Learning (AL) is a learning task that requires learners interactively query the labels of the sampled unlabeled instances to minimize the training outputs with human supervisions. In theoretical study, learners approximate the version space which covers all possible classification hypothesis into a bounded convex body and try to shrink the volume of it into a half-space by a given cut size. However, only the hypersphere with finite VC dimensions has obtained formal approximation guarantees that hold when the classes of Euclidean space are separable with a margin. In this paper, we approximate the version space to a structured hypersphere that covers most of the hypotheses, and then divide the available AL sampling approaches into two kinds of strategies: Outer Volume Sampling and Inner Volume Sampling. For the outer volume, it is represented by a circumscribed hypersphere which would exclude any outlier (non-promising) hypothesis from the version space globally. While for the inner volume, it is represented by many inscribed hyperspheres, which cover all feasible hypotheses within the outer volume. After providing provable guarantees for the performance of AL in version space, we aggregate the two kinds of volumes to eliminate their sampling biases via finding the optimal inscribed hyperspheres in the enclosing space of outer volume. To touch the version space from Euclidean space, we propose a theoretical bridge called Volume-based Model that increases the "sampling targetindependent". In non-linear feature space, spanned by kernel, we use sequential optimization to globally optimize the original space to a sparse space by halving the size of the kernel space. Then, the EM (Expectation Maximization) model which returns the local center helps us to find a local representation. To describe this process, we propose an easy-to-implement algorithm called Volume-based AL (VAL). Empirical evaluation on a various set of structured clustering and unstructured handwritten digit data sets have demonstrated that, employing our proposed model can accelerate the decline of the prediction error rate with fewer sampling number compared with the other algorithms.
A huge amount of data is generated every second on social media. Event and topic detection must a... more A huge amount of data is generated every second on social media. Event and topic detection must address both scalability and accuracy challenges when using enormous and noisy data collections from social media. Documents describing the same event and story have a similar set of collocated keywords that can be used to identify the event time and its description. In this work, we propose a novel graph-based approach, called the Enhanced Heartbeat Graph (EHG), which does not only detect events at an early stage but also suppresses event-related topics in the upcoming text stream in order to highlight other micro details. We have compared the proposed approach with ten state-of-the-art approaches for event detection. Experiment results on real-world data (i.e., Football Association Challenge Cup Final, Super Tuesday, and the US Election 2012) show considerable improvement in most cases, while computational complexity remains very attractive.
IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2019
Accurate classification of Electroencephalogram (EEG) signals, plays an important role in diagnos... more Accurate classification of Electroencephalogram (EEG) signals, plays an important role in diagnoses of different type of mental activities. One of the most important challenges, associated with classification of EEG signals is how to design an efficient classifier consisting of strong generalization capability. Aiming to improve the classification performance, in this paper, we propose a novel multiclass Support Matrix Machine (M-SMM) from the perspective of maximizing the inter-class margins. The objective function is a combination of binary hinge loss that works on C matrices and spectral elastic net penalty as regularization term. This regularization term is a combination of Frobenius and nuclear norm, which promotes structural sparsity and shares similar sparsity patterns across multiple predictors. It also maximizes the inter-class margins that helps deal with complex high dimensional noisy data. The extensive experiment results supported by theoretical analysis and statistical tests show the effectiveness of the M-SMM for solving the problem of classifying EEG signals associated with motor imagery in Braincomputer Interface (BCI) applications.
ACM Transactions on Asian and Low-Resource Language Information Processing
In this paper, aiming at a Chinese keyword based book search service, from a technological perspe... more In this paper, aiming at a Chinese keyword based book search service, from a technological perspective, we propose to modify a user query sequence carefully to confuse the user query topics and thus protect the user topic privacy on the untrusted server, without compromising the accuracy of each book search service. Firstly, we propose a client-based framework for the privacy protection of book search, and then a privacy model to formulate the constraints in terms of accuracy, efficiency and security, which the cover queries generated based on a user query sequence should meet. Secondly, we present a modification algorithm for a user query sequence, based on some heuristic strategies, which can quickly generate a cover query sequence meeting the privacy model, by replacing, deleting and adding keywords for each user query. Finally, both theoretical analysis and experimental evaluation demonstrate the effectiveness of the proposed approach, i.e., which can improve the security of use...
Multivariate time series classification is a critical problem in data mining with broad applicati... more Multivariate time series classification is a critical problem in data mining with broad applications. It requires harnessing the inter-relationship of multiple variables and various ranges of temporal dependencies to assign the correct classification label of the time series. Multivariate time series may come from a wide range of sources and be used in various scenarios, bringing the classifier challenge of temporal representation learning. We propose a novel convolutional neural network architecture called Attentional Gated Res2Net for multivariate time series classification. Our model uses hierarchical residual-like connections to achieve multi-scale receptive fields and capture multi-granular temporal information. The gating mechanism enables the model to consider the relations between the feature maps extracted by receptive fields of multiple sizes for information fusion. Further, we propose two types of attention modules, channel-wise attention and block-wise attention, to bett...
The increasingly developed online platform generates a large amount of online reviews every momen... more The increasingly developed online platform generates a large amount of online reviews every moment, e.g., Yelp and Amazon. Consumers gradually develop the habit of reading previous reviews before making a decision of buying or choosing various products. Online reviews play an vital part in determining consumers’ purchase choices in e-commerce, yet many online reviews are intentionally created to confuse or mislead potential consumers. Moreover, driven by product reputations and merchants’ profits, more and more spam reviews were inserted into online platform. This kind of reviews can be positive, negative or neutral, but they had common features: misleading consumers or damaging reputations. In the past decade, many people conducted research on detecting spam reviews using statistical or deep learning method with various datasets. In view of that, this article first introduces the task of spam online reviews detection and makes a common definition of spam reviews. Then, we comprehen...
Web Information Systems Engineering – WISE 2016, 2016
Music recommendation has gained substantial attention in recent times. As one of the most importa... more Music recommendation has gained substantial attention in recent times. As one of the most important context features, user emotion has great potential to improve recommendations, but this has not yet been sufficiently explored due to the difficulty of emotion acquisition and incorporation. This paper proposes a graph-based emotion-aware music recommendation approach (GEMRec) by simultaneously taking a user's music listening history and emotion into consideration. The proposed approach models the relations between user, music, and emotion as a three-element tuple (user, music, emotion), upon which an Emotion Aware Graph (EAG) is built, and then a relevance propagation algorithm based on random walk is devised to rank the relevance of music items for recommendation. Evaluation experiments are conducted based on a real dataset collected from a Chinese microblog service in comparison to baselines. The results show that the emotional context from a user's microblogs contributes to improving the performance of music recommendation in terms of hitrate, precision, recall, and F1 score.
Causal inference is capable of estimating the treatment effect (i.e., the causal effect of treatm... more Causal inference is capable of estimating the treatment effect (i.e., the causal effect of treatment on the outcome) to benefit the decision making in various domains. One fundamental challenge in this research is that the treatment assignment bias in observational data. To increase the validity of observational studies on causal inference, representation-based methods as the state-of-the-art have demonstrated the superior performance of treatment effect estimation. Most representation-based methods assume all observed covariates are pre-treatment (i.e., not affected by the treatment) and learn a balanced representation from these observed covariates for estimating treatment effect. Unfortunately, this assumption is often too strict a requirement in practice, as some covariates are changed by doing an intervention on treatment (i.e., post-treatment). By contrast, the balanced representation learned from unchanged covariates thus biases the treatment effect estimation. In light of th...
Citation-based research performance reporting is contentious. The methods used to categorize rese... more Citation-based research performance reporting is contentious. The methods used to categorize research and researchers are misleading and somewhat arbitrary. This paper compares cohorts of social science categorized citation data and ultimately shows that assumptions of comparability are spurious. A subject area comparison using research field distributions and networks between a 'reference author', bibliographically coupled data, keyword-obtained data, social science data and highly cited social science author data shows very dissimilar field foci with one dataset very much being medically focused. This leads to the question whether subject area classifications should continue to be used as the basis for the plethora of rankings and lists that use such groupings. It is suggested that bibliographic coupling and dynamic topic classifiers would better inform citation data comparisons.
Proceedings of the 26th International Conference on World Wide Web Companion - WWW '17 Companion, 2017
Answer selection is an important task in question answering (QA) from the Web. To address the int... more Answer selection is an important task in question answering (QA) from the Web. To address the intrinsic difficulty in encoding sentences with semantic meanings, we introduce a general framework, i.e., Lexical Semantic Feature based Skip Convolution Neural Network (LSF-SCNN), with several optimization strategies. The intuitive idea is that the granular representations with more semantic features of sentences are deliberately designed and estimated to capture the similarity between question-answer pairwise sentences. The experimental results demonstrate the effectiveness of the proposed strategies and our model outperforms the state-of-the-art ones by up to 3.5% on the metrics of MAP and MRR.
Location-based services (LBS) have become an important part of people's daily life. However, whil... more Location-based services (LBS) have become an important part of people's daily life. However, while providing great convenience for mobile users, LBS result in a serious problem on personal privacy, i.e., location privacy and query privacy. However, existing privacy methods for LBS generally take into consideration only location privacy or query privacy, without considering the problem of protecting both of them simultaneously. In this paper, we propose to construct a group of dummy query sequences, to cover up the query locations and query attributes of mobile users and thus protect users' privacy in LBS. First, we present a client-based framework for user privacy protection in LBS, which requires not only no change to the existing LBS algorithm on the server-side, but also no compromise
Uploads
Papers by Guandong Xu