Collective classification for spam filtering
2012, Logic Journal of IGPL
https://doi.org/10.1093/JIGPAL/JZS030…
2 pages
1 file
Sign up for access to the world's latest research
Abstract
Spam has become a major issue in computer security because it is a channel for threats such as computer viruses, worms and phishing. Many solutions feature machine-learning algorithms trained using statistical representations of the terms that usually appear in the e-mails. Still, these methods require a training step with labelled data. Dealing with the situation where the availability of labelled training instances is limited slows down the progress of filtering systems and offers advantages to spammers. Currently, many approaches direct their efforts into Semi-Supervised Learning (SSL). SSL is a halfway method between supervised and unsupervised learning, which, in addition to unlabelled data, receives some supervision information such as the association of the targets with some of the examples. Collective Classification for Text Classification poses as an interesting method for optimising the classification of partially-labelled data. In this way, we propose here, for the first time, Collective Classification algorithms for spam filtering to overcome the amount of unclassified e-mails that are sent every day.
Related papers
Artificial Intelligence Review, 2008
Email spam is one of the major problems of the today's Internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. In this paper we give an overview of the state of the art of machine learning applications for spam filtering, and of the ways of evaluation and comparison of different filtering methods. We also provide a brief description of other branches of anti-spam protection and discuss the use of various approaches in commercial and noncommercial anti-spam software solutions. Product Website address Symantec Mail Security for SMTP http://www.symantec.com/enterprise/products/ overview.jsp?pvid=845_1 MailCleaner http://www.mailcleaner.net/ SpamAssassin http://spamassassin.apache.org/ Bogofilter
2000
We investigate the performance of two machine learning algorithms in the context of antispam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an effective method to construct automatically anti-spam filters with superior performance. We investigate thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks. At the same time, we compare the performance of the Naive Bayesian filter to an alternative memorybased learning approach, after introducing suitable cost-sensitive evaluation measures. Both methods achieve very accurate spam filtering, outperforming clearly the keyword-based filter of a widely used e-mail reader.
Advances in Intelligent Systems and Computing, 2016
Nowadays emails have been an easy and fast tool of communication among people. As a result, filtering unsolicited/spam emails has become a very important challenge to achieve. Recently there has been some research work in text mining that combines text clustering with classification to improve the classification performance. In this paper, we investigate the effect of combining text clustering using K-means algorithm with various supervised classification mechanisms on improving the performance of classification of emails into spam or non-spam. The conjunction of clustering and classification mechanisms is carried out by adding extra features from the clustering step to the feature space used for classification. Our results show that combining K-means clustering with supervised classification by this methodology does not always improve the classification performance. Moreover, for the cases that the classifiers performance is improved by clustering, we found that the performance of classifiers in terms of accuracy is slightly increased with a very small amount that does not meet the increase in the time taken for building a learning model that combines both mechanisms. The result of our experiment has been shown using the Enron-Spam datasets.
2020
The humongous volume of unsolicited bulk e-mail (spam) which is further increasing, is the major cause for developing antispam protection filters. Machine learning provides a very optimized approach to automatically filter spams at a very successful rate. Here, in this paper we survey some of the most popular machine learning algorithms (Naïve Bayes, k-NN, SVMs and ANN) and their applicability to the problem of spam e-mail classification. Descriptions of the algorithms are presented, and the comparison of their performance on the UCI spam-base dataset is presented. Keywords⸻ Spam, E-mail classification, Machine learning algorithms, k-NN, SVM, Naïve Bayes, ANN.
Artificial Intelligence Review
A huge amount of data is generated daily leading to big data challenges. One of them is related to text mining, especially text classification. To perform this task we usually need a large set of labeled data that can be expensive, time-consuming, or difficult to be obtained. Considering this scenario semi-supervised learning (SSL), the branch of machine learning concerned with using labeled and unlabeled data has expanded in volume and scope. Since no recent survey exists to overview how SSL has been used in text classification, we aim to fill this gap and present an up-to-date review of SSL for text classification. We retrieve 1794 works from the last 5 years from IEEE Xplore, ACM Digital Library, Science Direct, and Springer. Then, 157 articles were selected to be included in this review. We present the application domain, datasets, and languages employed in the works. The text representations and machine learning algorithms. We also summarize and organize the works following a recent taxonomy of SSL. We analyze the percentage of labeled data used, the evaluation metrics, and obtained results. Lastly, we present some limitations and future trends in the area. We aim to provide researchers and practitioners with an outline of the area as well as useful information for their current research.
arXiv (Cornell University), 2016
We present a comprehensive review of the most effective content-based e-mail spam filtering techniques. We focus primarily on Machine Learning-based spam filters and their variants, and report on a broad review ranging from surveying the relevant ideas, efforts, effectiveness, and the current progress. The initial exposition of the background examines the basics of e-mail spam filtering, the evolving nature of spam, spammers playing cat-and-mouse with e-mail service providers (ESPs), and the Machine Learning front in fighting spam. We conclude by measuring the impact of Machine Learning-based filters and explore the promising offshoots of latest developments.
IEEE Transactions on Neural Networks, 1999
We study the use of support vector machines (SVM's) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM's performed best when using binary features. For both data sets, boosting trees and SVM's had acceptable test performance in terms of accuracy and speed. However, SVM's had significantly less training time.
International Journal for Research in Applied Science and Engineering Technology (IJRASET), 2022
Email is one of the most popular modes of communication we have today. Billions of emails are sent every day in our world but not every one of them is relevant or of importance. The irrelevant and unwanted emails are termed email spam. These spam emails are sent with many different targets that range from advertisement to data theft. Filtering these spam emails is very essential in order to keep the email space fluent in its functioning. Machine Learning algorithms are being extensively used in the classification of spam emails. This paper showcases the performance evaluation of some selected supervised Machine Learning algorithms namely Naive Bayes Classifier, Support Vector Machine, Random Forest, & XG-Boost for spam email classification on a combination of three different datasets. For feature extraction, both Bag of Words & TF-IDF models were used separately and performance with both of these approaches was also compared. The results showed that SVM performed better than all the other algorithms when trained with TF-IDF feature vectors. The performance metrics used were accuracy, precision, recall, and f1-score, along with the ROC curve.
This paper describes an e-mail spam filter based on local SVM, namely on the SVM classifier trained only on a neighborhood of the message to be classified, and not on the whole training data available. Two problems are stated and solved. First, the selection of the right size of neighborhood is shown to be critical; our solution is based on the estimation of the a-posteriori probability of the correct decision, and the resulting algorithm is called highest probability SVM nearest neighbor (HP-SVM-NN). The second problem is the application of the algorithm in practice, and we propose a practical filter architecture based on HP-SVM-NN. Extensive testing is performed on SpamAssassin corpus and TREC 2005 Spam Track corpus, showing that HP-SVM-NN outperforms pure SVM and is applicable in practice. Finally, we explore the locality properties of the two corpora using Sammon's projection.
2017
The increasing volume of unsolicited mass e-mail (otherwise called spam) has generated a need for reliable against spam filters.<br> Utilizing a classifier based on machine learning techniques to naturally filter out spam e-mail has drawn many researchers'<br> attention. In this paper, we review some of relevant ideas and do a set of systematic experiments on e-mail categorization,<br> which has been conducted with four machine learning calculations applied to different parts of e-mail. Experimental results<br> reveal that the header of e-mail provides very useful data for all the machine learning calculations considered to detect spam<br> e-mail.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.