Collective classification for spam filtering

Igor Santos

doi:10.1093/JIGPAL/JZS030

Outline

Artificial Intelligence

Collective classification for spam filtering

Igor Santos

2012, Logic Journal of IGPL

https://doi.org/10.1093/JIGPAL/JZS030

visibility

…

description

2 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Spam has become a major issue in computer security because it is a channel for threats such as computer viruses, worms and phishing. Many solutions feature machine-learning algorithms trained using statistical representations of the terms that usually appear in the e-mails. Still, these methods require a training step with labelled data. Dealing with the situation where the availability of labelled training instances is limited slows down the progress of filtering systems and offers advantages to spammers. Currently, many approaches direct their efforts into Semi-Supervised Learning (SSL). SSL is a halfway method between supervised and unsupervised learning, which, in addition to unlabelled data, receives some supervision information such as the association of the targets with some of the examples. Collective Classification for Text Classification poses as an interesting method for optimising the classification of partially-labelled data. In this way, we propose here, for the first time, Collective Classification algorithms for spam filtering to overcome the amount of unclassified e-mails that are sent every day.

Anton Bryl

Artificial Intelligence Review, 2008

Email spam is one of the major problems of the today's Internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. In this paper we give an overview of the state of the art of machine learning applications for spam filtering, and of the ways of evaluation and comparison of different filtering methods. We also provide a brief description of other branches of anti-spam protection and discuss the use of various approaches in commercial and noncommercial anti-spam software solutions. Product Website address Symantec Mail Security for SMTP http://www.symantec.com/enterprise/products/ overview.jsp?pvid=845_1 MailCleaner http://www.mailcleaner.net/ SpamAssassin http://spamassassin.apache.org/ Bogofilter

downloadDownload free PDF View PDFchevron_right

Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach

Costas Spyropoulos

2000

We investigate the performance of two machine learning algorithms in the context of antispam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an effective method to construct automatically anti-spam filters with superior performance. We investigate thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks. At the same time, we compare the performance of the Naive Bayesian filter to an alternative memorybased learning approach, after introducing suitable cost-sensitive evaluation measures. Both methods achieve very accurate spam filtering, outperforming clearly the keyword-based filter of a widely used e-mail reader.

downloadDownload free PDF View PDFchevron_right

Investigating the Effect of Combining Text Clustering with Classification on Improving Spam Email Detection

Doaa Hassan

Advances in Intelligent Systems and Computing, 2016

Nowadays emails have been an easy and fast tool of communication among people. As a result, filtering unsolicited/spam emails has become a very important challenge to achieve. Recently there has been some research work in text mining that combines text clustering with classification to improve the classification performance. In this paper, we investigate the effect of combining text clustering using K-means algorithm with various supervised classification mechanisms on improving the performance of classification of emails into spam or non-spam. The conjunction of clustering and classification mechanisms is carried out by adding extra features from the clustering step to the feature space used for classification. Our results show that combining K-means clustering with supervised classification by this methodology does not always improve the classification performance. Moreover, for the cases that the classifiers performance is improved by clustering, we found that the performance of classifiers in terms of accuracy is slightly increased with a very small amount that does not meet the increase in the time taken for building a learning model that combines both mechanisms. The result of our experiment has been shown using the Enron-Spam datasets.

downloadDownload free PDF View PDFchevron_right

Survey of machine learning methods for spam e-mail classification

Varsha Jenni

2020

The humongous volume of unsolicited bulk e-mail (spam) which is further increasing, is the major cause for developing antispam protection filters. Machine learning provides a very optimized approach to automatically filter spams at a very successful rate. Here, in this paper we survey some of the most popular machine learning algorithms (Naïve Bayes, k-NN, SVMs and ANN) and their applicability to the problem of spam e-mail classification. Descriptions of the algorithms are presented, and the comparison of their performance on the UCI spam-base dataset is presented. Keywords⸻ Spam, E-mail classification, Machine learning algorithms, k-NN, SVM, Naïve Bayes, ANN.

downloadDownload free PDF View PDFchevron_right

A review of semi-supervised learning for text classification

José Marcio Duarte

Artificial Intelligence Review

A huge amount of data is generated daily leading to big data challenges. One of them is related to text mining, especially text classification. To perform this task we usually need a large set of labeled data that can be expensive, time-consuming, or difficult to be obtained. Considering this scenario semi-supervised learning (SSL), the branch of machine learning concerned with using labeled and unlabeled data has expanded in volume and scope. Since no recent survey exists to overview how SSL has been used in text classification, we aim to fill this gap and present an up-to-date review of SSL for text classification. We retrieve 1794 works from the last 5 years from IEEE Xplore, ACM Digital Library, Science Direct, and Springer. Then, 157 articles were selected to be included in this review. We present the application domain, datasets, and languages employed in the works. The text representations and machine learning algorithms. We also summarize and organize the works following a recent taxonomy of SSL. We analyze the percentage of labeled data used, the evaluation metrics, and obtained results. Lastly, we present some limitations and future trends in the area. We aim to provide researchers and practitioners with an outline of the area as well as useful information for their current research.

downloadDownload free PDF View PDFchevron_right

Machine Learning for E-mail Spam Filtering: Review,Techniques and Trends

Shyamanta Hazarika

arXiv (Cornell University), 2016

We present a comprehensive review of the most effective content-based e-mail spam filtering techniques. We focus primarily on Machine Learning-based spam filters and their variants, and report on a broad review ranging from surveying the relevant ideas, efforts, effectiveness, and the current progress. The initial exposition of the background examines the basics of e-mail spam filtering, the evolving nature of spam, spammers playing cat-and-mouse with e-mail service providers (ESPs), and the Machine Learning front in fighting spam. We conclude by measuring the impact of Machine Learning-based filters and explore the promising offshoots of latest developments.

downloadDownload free PDF View PDFchevron_right

Support vector machines for spam categorization

Harris Drucker

IEEE Transactions on Neural Networks, 1999

We study the use of support vector machines (SVM's) in classifying e-mail as spam or nonspam by comparing it to three other classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets: one data set where the number of features were constrained to the 1000 best features and another data set where the dimensionality was over 7000. SVM's performed best when using binary features. For both data sets, boosting trees and SVM's had acceptable test performance in terms of accuracy and speed. However, SVM's had significantly less training time.

downloadDownload free PDF View PDFchevron_right

Performance Evaluation of Machine Learning Algorithms on Textual Datasets for Spam Email Classification

IJRASET Publication

International Journal for Research in Applied Science and Engineering Technology (IJRASET), 2022

Email is one of the most popular modes of communication we have today. Billions of emails are sent every day in our world but not every one of them is relevant or of importance. The irrelevant and unwanted emails are termed email spam. These spam emails are sent with many different targets that range from advertisement to data theft. Filtering these spam emails is very essential in order to keep the email space fluent in its functioning. Machine Learning algorithms are being extensively used in the classification of spam emails. This paper showcases the performance evaluation of some selected supervised Machine Learning algorithms namely Naive Bayes Classifier, Support Vector Machine, Random Forest, & XG-Boost for spam email classification on a combination of three different datasets. For feature extraction, both Bag of Words & TF-IDF models were used separately and performance with both of these approaches was also compared. The results showed that SVM performed better than all the other algorithms when trained with TF-IDF feature vectors. The performance metrics used were accuracy, precision, recall, and f1-score, along with the ROC curve.

downloadDownload free PDF View PDFchevron_right

E-MAIL SPAM FILTERING WITH LOCAL SVM CLASSIFIERS

Anton Bryl

This paper describes an e-mail spam filter based on local SVM, namely on the SVM classifier trained only on a neighborhood of the message to be classified, and not on the whole training data available. Two problems are stated and solved. First, the selection of the right size of neighborhood is shown to be critical; our solution is based on the estimation of the a-posteriori probability of the correct decision, and the resulting algorithm is called highest probability SVM nearest neighbor (HP-SVM-NN). The second problem is the application of the algorithm in practice, and we propose a practical filter architecture based on HP-SVM-NN. Extensive testing is performed on SpamAssassin corpus and TREC 2005 Spam Track corpus, showing that HP-SVM-NN outperforms pure SVM and is applicable in practice. Finally, we explore the locality properties of the two corpora using Sammon's projection.

downloadDownload free PDF View PDFchevron_right

Spam E-Mail Characterization: An Experimental Performance Comparison Of Machine Learning

Sabbir Ahmad

2017

The increasing volume of unsolicited mass e-mail (otherwise called spam) has generated a need for reliable against spam filters. Utilizing a classifier based on machine learning techniques to naturally filter out spam e-mail has drawn many researchers' attention. In this paper, we review some of relevant ideas and do a set of systematic experiments on e-mail categorization, which has been conducted with four machine learning calculations applied to different parts of e-mail. Experimental results reveal that the header of e-mail provides very useful data for all the machine learning calculations considered to detect spam e-mail.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

abhinav Pathak

2010

Spam filtering has traditionally relied on extracting spam signatures via supervised learning, i.e., using emails explicitly manually labeled as spam or ham. Such supervised learning is labor-intensive and costly, more importantly cannot adapt to new spamming behavior quickly enough. The fundamental reason for needing labeled training corpus is that the learning, e.g., the process of extracting signatures, is carried out by examining individual emails. In this paper, we study the feasibility of unsupervised learning-based spam filtering that can more effectively identify new spamming behavior. Our study is motivated by three key observations of today's Internet spam: (1) the vast majority of emails are spam, (2) a spam email should always belong to some campaign, (3) spam from the same campaign are generated from some template that obfuscates some parts of the spam, e.g., sensitive terms, leaving other parts unchanged.

downloadDownload free PDF View PDFchevron_right

A case for unsupervised-learning-based spam filtering

abhinav Pathak

Sigmetrics Performance Evaluation Review, 2010

downloadDownload free PDF View PDFchevron_right

Learning to classify e-mail

James Clark

Information Sciences, 2007

In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for automatic e-mail filing into folders and spam e-mail filtering. We show that random forest is a good choice for these tasks as it runs fast on large and high dimensional databases, is easy to tune and is highly accurate, outperforming popular algorithms such as decision trees, support vector machines and naïve Bayes. We introduce a new accurate feature selector with linear time complexity. Secondly, we examine the applicability of the semi-supervised co-training paradigm for spam e-mail filtering by employing random forests, support vector machines, decision tree and naïve Bayes as base classifiers. The study shows that a classifier trained on a small set of labelled examples can be successfully boosted using unlabelled examples to accuracy rate of only 5% lower than a classifier trained on all labelled examples. We investigate the performance of co-training with one natural feature split and show that in the domain of spam e-mail filtering it can be as competitive as co-training with two natural feature splits.

downloadDownload free PDF View PDFchevron_right

Stacking classifiers for anti-spam filtering of e-mail

Costas Spyropoulos

2001

We evaluate empirically a scheme for combining classifiers, known as stacked generalization, in the context of anti-spam filtering, a novel cost-sensitive application of text categorization. Unsolicited commercial email, or "spam", floods mailboxes, causing frustration, wasting bandwidth, and exposing minors to unsuitable content. Using a public corpus, we show that stacking can improve the efficiency of automatically induced anti-spam filters, and that such filters can be used in reallife applications.

downloadDownload free PDF View PDFchevron_right

An Empirical Performance Comparison of Machine Learning Methods for Spam E-Mail Categorization

Chih-Chin Lai

2004

The increasing volume of unsolicited bulk e-mail (also known as spam) has generated a need for reliable anti-spam filters. Using a classifier based on machine learning techniques to automatically filter out spam e-mail has drawn many researchers' attention. In this paper, we review some of relevant ideas and do a set of systematic experiments on e-mail categorization, which has been conducted with four machine learning algorithms applied to different parts of e-mail. Experimental results reveal that the header of e-mail provides very useful information for all the machine learning algorithms considered to detect spam e-mail.

downloadDownload free PDF View PDFchevron_right

Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach

Rocío Alaiz-rodríguez, Francisco Janez-Martino

2023

Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient antispam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques-Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT-and four classifiers: Support Vector Machine, Näive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in 2ms and 2.2ms on average, respectively.

downloadDownload free PDF View PDFchevron_right

MACHINE LEARNING METHODS FOR SPAM E-MAIL CLASSIFICATION

iael Fuks

The increasing volume of unsolicited bulk e-mail (also known as spam) has generated a need for reliable anti-spam filters. Machine learning techniques now days used to automatically filter the spam e-mail in a very successful rate. In this paper we review some of the most popular machine learning methods (Bayesian classification, k-NN, ANNs, SVMs, Artificial immune system and Rough sets) and of their applicability to the problem of spam Email classification. Descriptions of the algorithms are presented, and the comparison of their performance on the SpamAssassin spam corpus is presented.

downloadDownload free PDF View PDFchevron_right

Effectiveness and Limitations of Statistical Spam Filters

M Tariq Banday

2009

Spam is not only clogging the Internet traffic by consuming a hefty amount of network bandwidth but it is also a source for e-mail born viruses, spyware, adware and Trojan Horses. It is also used to carry out denial of service, directory harvesting and phishing attacks that directly cause financial losses. Further, the contents of spam are often offensive and contain adult oriented and fraudulent materials which are objectionable to recipients. Several anti-spam procedures are currently employed to distinguish spam from legitimate e-mails; however spammers and phishers employ dynamic spam structures to obfuscate email content to circumvent these procedures. Apart from other technological procedures various adaptive learning filters have been developed that have an ability to allow an algorithm to constantly learn what sort of e-mail's or e-mail content a recipient would typically process and what to see in normal course of its business. These filters are based on complex statistical techniques that classify future e-mails based on the word content of accepted e-mails. The statistical techniques employed in these filters separate an incoming e-mail into tokens and assign a probability value to each token. The probability of each token are collectively used to calculate the overall spam probability and accordingly the incoming e-mail is scored as spam, probably spam or legitimate e-mail. In this paper we discuss the techniques involved in the design of the famous statistical spam filters that include Naïve Bayes, Term Frequency-Inverse Document Frequency, K-Nearest Neighbor, Support Vector Machine, and Bayes Additive Regression Tree. We compare these techniques with each other in terms of accuracy, recall, precision, etc. Further, we discuss the effectiveness and limitations of statistical filters in filtering out various types of spam from legitimate e-mails.

downloadDownload free PDF View PDFchevron_right

A on Spam Filtering Classification: A Majority Voting like Approach

Lauri Lovén

2017

Despite the improvement in filtering tools and informatics security, spam still cause substantial damage to public and private organizations. In this paper, we present a majority-voting based approach in order to identify spam messages. A new methodology for building majority voting classifier is presented and tested. The results using SpamAssassin dataset indicates non-negligible improvement over state of art, which paves the way for further development and applications.

downloadDownload free PDF View PDFchevron_right

Learning to detect spam messages

Marcelo Luis Errecalde

2005

The problem of unwanted e-mails (or spam messages) has been increasing for years. Different methods have been proposed in order to deal with this problem wich includes blacklists of known spammers, handcrafted rules and machine learning techniques.

downloadDownload free PDF View PDFchevron_right

Collective classification for spam filtering

Sign up for access to the world's latest research

Abstract

Related papers

Related papers

Related topics