York University at TREC 2005: SPAM Track

Aijun An

Outline

Artificial Intelligence

York University at TREC 2005: SPAM Track

Aijun An

2005, … of Text Retrieval Conference: SPAM Track, …

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

We propose a variant of the k-nearest neighbor classification method, called instance-weighted k-nearest neighbor method, for adaptive spam filtering. The method assigns two weights, distance weight and correctness weight, to a training instance, and makes use of the two weights when classifying a new email. The correctness weight is also used in the maintenance of the training data to make the training data more adaptive to the changes of spam characteristics. We submitted 4 spam filters to the Spam Track. Two of the filters are purely based on the instance-weighted kNN method. The two other filters combine the kNN method with other spam filtering and classification techniques. We report the official results of our submissions on the Spam Track evaluation data sets.

Qianhe Ouyang

Highlights in Science, Engineering and Technology

E-mail spam filtering is becoming a critical and concerned issue in network security recently, and multiple machine learning techniques have been applied to tackle such sort of classification problem. With the emerging of machine learning framework, most of the tasks has been changed via the effective machine learning algorithms with satisfying performance and high speed. However, the underlying performances of different algorithms under certain given circumstances still lack of an intuitive demonstration. Hence, this study mainly focuses on the performance of two widely-used algorithms (KNN and Naive Bayes) from metrics including accuracy and running time, comparing the unique advantage of each algorithm when classifying emails. The paper uses thousands of spam data to feed two algorithms and analyzes both results respectively, indicating that KNN classifier performs better when determining the spam messages while the opposite is true for the Naive Bayes classifier. Thus, designers...

downloadDownload free PDF View PDFchevron_right

Comparison of Algorithms on Machine Learning For Spam Email Classification

Erni Seniwati

IJISTECH (International Journal of Information System and Technology), 2021

The rapid development of email use and the convenience provided make email as the most frequently used means of communication. Along with its development, many parties are abusing the use of email as a means of advertising promotion, phishing and sending other unimportant emails. This information is called spam email. One of the efforts in overcoming the problem of spam emails is by filtering techniques based on the content of the email. In the first study related to the classification of spam emails, the Naïve Bayes method is the most commonly used method. Therefore, in this study researchers will add Random Forest and K-Nearest Neighbor (KNN) methods to make comparisons in order to find which methods have better accuracy in classifying spam emails. Based on the results of the trial, the application of Naïve bayes classification algorithm in the classification of spam emails resulted in accuracy of 83.5%, Random Forest 83.5% and KNN 82.75%

downloadDownload free PDF View PDFchevron_right

E-Mail Spam Filtering

IJRASET Publication

IJRASET, 2021

E-mail is that the most typical method of communication because of its ability to get, the rapid modification of messages and low cost of distribution. E-mail is one among the foremost secure medium for online communication and transferring data or messages through the net. An overgrowing increase in popularity, the quantity of unsolicited data has also increased rapidly. Spam causes traffic issues and bottlenecks that limit the quantity of memory and bandwidth, power and computing speed. To filtering data, different approaches exist which automatically detect and take away these untenable messages. There are several numbers of email spam filtering technique like Knowledge-based technique, Clustering techniques, Learning-based technique, Heuristic processes so on. For data filtering, various approaches exist that automatically detect and suppress these indefensible messages. This paper illustrates a survey of various existing email spam filtering system regarding Machine Learning Technique (MLT) like Naive Bayes, SVM, K-Nearest Neighbor, Bayes Additive Regression, KNN Tree, and rules. Henceforth here we give the classification, evaluation and comparison of some email spam filtering system and summarize the scenario regarding accuracy rate of various existing approaches.

downloadDownload free PDF View PDFchevron_right

Spam detection filter using KNN algorithm and resampling

C. Vidrighin

Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing, 2010

Spamming has become a time consuming and expensive problem for which several new directions have been investigated lately. This paper presents a new approach for a spam detection filter. The solution developed is an offline application that uses the k-Nearest Neighbor (kNN) algorithm and a pre-classified email data set for the learning process.

downloadDownload free PDF View PDFchevron_right

Survey of machine learning methods for spam e-mail classification

Varsha Jenni

2020

The humongous volume of unsolicited bulk e-mail (spam) which is further increasing, is the major cause for developing antispam protection filters. Machine learning provides a very optimized approach to automatically filter spams at a very successful rate. Here, in this paper we survey some of the most popular machine learning algorithms (Naïve Bayes, k-NN, SVMs and ANN) and their applicability to the problem of spam e-mail classification. Descriptions of the algorithms are presented, and the comparison of their performance on the UCI spam-base dataset is presented. Keywords⸻ Spam, E-mail classification, Machine learning algorithms, k-NN, SVM, Naïve Bayes, ANN.

downloadDownload free PDF View PDFchevron_right

Learning-Based Spam Filters: the Influence of the Temporal Distribution of Training Data

Anton Bryl

2006

The great number and variety of learning-based spam filters proposed during the last years cause the need in complex and many-sided evaluation of them, taking features of the phenomenon of spam into account. This paper is dedicated to the analysis of the dependence of filter performance on the temporal distribution of training data; the cause of this dependence is the changeability of email. Such analysis provides additional information about the filter quality, and also may be useful for organizing more effective training of the filter. The naïve Bayes filter is chosen for evaluation in this paper.

downloadDownload free PDF View PDFchevron_right

E-MAIL SPAM FILTERING WITH LOCAL SVM CLASSIFIERS

Anton Bryl

This paper describes an e-mail spam filter based on local SVM, namely on the SVM classifier trained only on a neighborhood of the message to be classified, and not on the whole training data available. Two problems are stated and solved. First, the selection of the right size of neighborhood is shown to be critical; our solution is based on the estimation of the a-posteriori probability of the correct decision, and the resulting algorithm is called highest probability SVM nearest neighbor (HP-SVM-NN). The second problem is the application of the algorithm in practice, and we propose a practical filter architecture based on HP-SVM-NN. Extensive testing is performed on SpamAssassin corpus and TREC 2005 Spam Track corpus, showing that HP-SVM-NN outperforms pure SVM and is applicable in practice. Finally, we explore the locality properties of the two corpora using Sammon's projection.

downloadDownload free PDF View PDFchevron_right

Learning to Detect Spam: Naive-Euclidean Approach

Qiangfu Zhao

A method is proposed for learning to classify spam and non- spam emails. It combines the strategy of the Best Stepwise Feature Se- lection with a classifier of Euclidean nearest-neighbor. Each text email is first transformed into a vector of D-dimensional Euclidean space. Emails were divided into training and test sets in the manner of 10-fold cross- validation. Three experiments were performed, and their elapsed CPU times and accuracies reported. The proposed spam detection learner was found to be extremely fast in recognition and with good error rates. It could be used as a baseline learning agent, in terms of CPU time and accuracy, against which other learning agents can be measured.

downloadDownload free PDF View PDFchevron_right

Comparative Study of Gaussian and Nearest Mean Classifiers for Filtering Spam E-mails

Upasna Attri

The development of data-mining applications such as classification and clustering has shown the need for machine learning algorithms to be applied to large scale data. The article gives an overview of some of the most popular machine learning methods (Gaussian and Nearest Mean) and of their applicability to the problem of spam e-mail filtering. The aim of this paper is to compare and investigate the effectiveness of classifiers for filtering spam e-mails using different matrices. Since spam is increasingly becoming difficult to detect, so these automated techniques will help in saving lot of time and resources required to handle email messages.

downloadDownload free PDF View PDFchevron_right

Spam Filtering - a Comparative Study of the Performance of Different Classifiers for Effective Filtering

Dhananjay Tyagi

International Journal of Research in Engineering and Technology, 2016

Electronic mail is used daily by billions of people to interact and communicate around the world and is a critical application for many businesses. Over the last couple of decades unsolicited bulk email has become a headache for the email user. A staggering amount of spam is streaming into user's mailboxes daily. Spam is not only irritating for most email users but it also overtaxes the IT infrastructure of businesses and costs billions of dollars in wasted productivity. The need of effective spam filtering techniques increases. Machine learning algorithms can be used with current spam filtering schemes for increased efficiency. This paper presents a comparative study of the performance of different Machine Learning Algorithms which can be used to filter a mail as spam or ham.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (7)

Breiman, L., Random forests, Machine Learning, Vol.45, No.1, 5 -32, 2001.
Porter, M.F., An algorithm for suffix stripping, Program, 14(3), 130-137, 1980.
Porter, M.F., The Porter Stemming Algorithm, http://www.tartarus.org/~martin/PorterStemmer/.
SpamAssassin, http://spamassassin.apache.org/.
Witten, I.H. and Frank, E. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005.
Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/.
Yang, Y., Pedersen, J,O.: A Comparative Study on Feature Selection in Text Categorization. Proceedings of ICML-97 14 th Int Conf on Machine Learning. Nashville, US, 412-420, 1997.

Related papers

Learning to detect spam messages

Maria Teresa Taranilla

The problem of unwanted e-mails (or spam messages) has been increasing for years. Different methods have been proposed in order to deal with this problem wich includes blacklists of known spammers, handcrafted rules and machine learning techniques. In this paper we investigate the performance of the k Nearest Neighbours (k-NN) method in spam detection tasks. At this end, a number of different document codifications were tested. Moreover, we study how the vocabulary size reduction affects this task. In the experimental design, different k values were considered and results were analyzed with respect to a public mailing list and personal e-mail collections. The experiments showed that results with public mailing lists tend to be very optimistic and they should not be considered representative of those expected with personal user accounts.

downloadDownload free PDF View PDFchevron_right

A survey of learning-based techniques of email spam filtering

Anton Bryl

Artificial Intelligence Review, 2008

Email spam is one of the major problems of the today's Internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. In this paper we give an overview of the state of the art of machine learning applications for spam filtering, and of the ways of evaluation and comparison of different filtering methods. We also provide a brief description of other branches of anti-spam protection and discuss the use of various approaches in commercial and noncommercial anti-spam software solutions. Product Website address Symantec Mail Security for SMTP http://www.symantec.com/enterprise/products/ overview.jsp?pvid=845_1 MailCleaner http://www.mailcleaner.net/ SpamAssassin http://spamassassin.apache.org/ Bogofilter

downloadDownload free PDF View PDFchevron_right

A Monthly Double-Blind Peer Reviewed Refereed Open Access International e-Journal -Included in the International Serial Directories An Adaptive Classification approach to filter spam E-mail using Vector Space Model

Publisher ijmra.us UGC Approved

The majority of previous studies of data mining have been concentrate on structured data, such as relational, transactional and data warehouse data. But, in actuality, an important section of the available information is stored in text databases, which consist of large collections of web documents from various sources, such as news articles, research papers, e-books, digital libraries, e-mails, and Web pages. Moreover, It is in increasing phase and in magnitude of terabytes of size. Among the ample of provisions of internet, e-mail facility is very useful and broadly used. Spam email is the strongly attached issue with email provision. Among various approaches developed to stop spam emails, filtering is an important and popular one. In this paper, to categorize spam and non-span email which arrives to our email id, classification method-KNNC Classification can work for better accuracy using Vector Space Model in adaptive manner. For getting accuracy in spam classification we have used two dataset-personal & Ling Spam Corpus(Lemm dataset) and apply KNNC Classification on them. We got nearly 95% of precision in spam & 86.6% of precision in nonspam and got 83% of accuracy using personal dataset and 80% using Lemm dataset using adaptive approach. We propose our own solution by reviewing the result and related work that adaptive approach using vector space model in KNNC classification method is efficiently provide better accuracy for filtering the spam mail for both smaller and larger dataset.

downloadDownload free PDF View PDFchevron_right

A COMPARATIVE STUDY OF CLASSIFIERS FOR FILTERING SPAM EMAILS

Dinesh Kumar

ijcset.net

this paper presents the comparison between Gaussian classifiers and Nearest Neighbor Classifiers for filtering spam emails. The results are in the form of traces of probability of error and time taken for classification, both with respect to the number of emails. Since spam emails are increasingly becoming difficult to filter, so these automated techniques will help in saving lot of time and resources required to handle the same.

downloadDownload free PDF View PDFchevron_right

Adaptive filtering of spam

Jalal almhana

Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004., 2004

In this paper, we present a new spam filter which acts as an additional layer in the spam filtering process. This filter is based on what we call a representative vocabulary. Spam e-mails are divided into categories in which each category is represented by a set of tokens which form a Representative Text (RT). Tokens are strings of characters (words, sentences, or some times meaningless strings of characters). This RT is used to compute a resemblance ratio with incoming e-mails. With this ratio we decide whether the incoming e-mail is a spam. This filter was implemented and integrated to Spamihilator software. Some experimental and interesting results will be presented.

downloadDownload free PDF View PDFchevron_right

Highest Probability SVM Nearest Neighbor Classifier for Spam Filtering

Anton Bryl

In this paper we evaluate the performance of the highest probability SVM nearest neighbor classifier, which is a combination of the SVM and k-NN classifiers, on a corpus of email messages. To classify a sample the algorithm performs the following actions: for each k in a predefined set {k 1 , ..., k N } it trains an SVM model on k nearest labelled samples, and uses this model to classify the given sample, then fits a sigmoid approximation of the probabilistic output for the SVM model, and computes the probabilities of the positive and the negative answers; than it selects that of the 2 × N resulting answers which has the highest probability. The experimental evaluation shows, that this algorithm is able to achieve higher accuracy than the pure SVM classifier at least in the case of equal error costs.

downloadDownload free PDF View PDFchevron_right

SPAM FILTERING – A COMPARATIVE STUDY OF THE PERFORMANCE OF DIFFERENT CLASSIFIERS FOR EFFECTIVE FILTERING.docx

eSAT Journals

downloadDownload free PDF View PDFchevron_right

Handling Dimensionality Reduction in Spam E-Mail Classification

Osamah Ibrahim Khalaf

Currently, the Internet E-mail infrastructure has become very significant and most popular used for communication between end user, E-commerce and academic research purposes due to it is rapid, inexpensive and very active. This E-mail organizational structures is used for the daily work. Sometime we receive many undesirable E-mails from different unknown resource. These unwanted E-mails are identified as Spam E-mails. The determination of Ham and Spam E-mail is a main target and a variety of algorithms of classification have been implemented. The complication of a classifier algorithm is substantially reduced if the numbers of features in Spam E-mail data set are reduced. In this paper, it is proposed to present some of the most common data mining algorithms J48, Support Vector Machine (SVM) and Naive Bayes for Spam E-mail classification problem. The standard dataset Spam base is used. Enhanced the Spam Email classification is impact thereof and is objective of our study. An experimental study is carried out to build up a classifier Spam E-mail standard dataset that includes Ham and Spam E-mail message. A Rough Set Theory (RST) and Symmetric uncertainty (SU) methods are utilized to minimize dimensionality of Spam E-mail data group. The sub features got by the RST and Symmetric uncertainty are employed to train and test the different classifiers. A comparison of obtained results between by reduced features set and original data set are presented. The obtained results show that the effectiveness of classifiers with the reduced features has outperformed the existing systems.

downloadDownload free PDF View PDFchevron_right

A survey of machine learning techniques for Spam filtering

Irfanul Alam

downloadDownload free PDF View PDFchevron_right

A Comparative Study on Different Email Spam Filtering Techniques

IAEME Publication

IAEME PUBLICATION, 2016

An ideal spam filter is difficult to achieve, that is one which filter out any type of spam at any time. Most of the anti spam solutions fails to filter new types of spam. This is because whenever researchers introduce new filtering mechanisms spammers deploy new spamming techniques which can bypass those filters. In this scenario one thing possible is to update the current filtering mechanisms and minimize the security breaches. This study focuses on analyzing different data mining based classification algorithms which can be used for spam filtering and find out how effective they are.

downloadDownload free PDF View PDFchevron_right

York University at TREC 2005: SPAM Track

Sign up for access to the world's latest research

Abstract

Related papers

References (7)

Related papers

Related topics