York University at TREC 2005: SPAM Track
2005, … of Text Retrieval Conference: SPAM Track, …
Sign up for access to the world's latest research
Abstract
We propose a variant of the k-nearest neighbor classification method, called instance-weighted k-nearest neighbor method, for adaptive spam filtering. The method assigns two weights, distance weight and correctness weight, to a training instance, and makes use of the two weights when classifying a new email. The correctness weight is also used in the maintenance of the training data to make the training data more adaptive to the changes of spam characteristics. We submitted 4 spam filters to the Spam Track. Two of the filters are purely based on the instance-weighted kNN method. The two other filters combine the kNN method with other spam filtering and classification techniques. We report the official results of our submissions on the Spam Track evaluation data sets.
Related papers
Highlights in Science, Engineering and Technology
E-mail spam filtering is becoming a critical and concerned issue in network security recently, and multiple machine learning techniques have been applied to tackle such sort of classification problem. With the emerging of machine learning framework, most of the tasks has been changed via the effective machine learning algorithms with satisfying performance and high speed. However, the underlying performances of different algorithms under certain given circumstances still lack of an intuitive demonstration. Hence, this study mainly focuses on the performance of two widely-used algorithms (KNN and Naive Bayes) from metrics including accuracy and running time, comparing the unique advantage of each algorithm when classifying emails. The paper uses thousands of spam data to feed two algorithms and analyzes both results respectively, indicating that KNN classifier performs better when determining the spam messages while the opposite is true for the Naive Bayes classifier. Thus, designers...
IJISTECH (International Journal of Information System and Technology), 2021
The rapid development of email use and the convenience provided make email as the most frequently used means of communication. Along with its development, many parties are abusing the use of email as a means of advertising promotion, phishing and sending other unimportant emails. This information is called spam email. One of the efforts in overcoming the problem of spam emails is by filtering techniques based on the content of the email. In the first study related to the classification of spam emails, the Naïve Bayes method is the most commonly used method. Therefore, in this study researchers will add Random Forest and K-Nearest Neighbor (KNN) methods to make comparisons in order to find which methods have better accuracy in classifying spam emails. Based on the results of the trial, the application of Naïve bayes classification algorithm in the classification of spam emails resulted in accuracy of 83.5%, Random Forest 83.5% and KNN 82.75%
IJRASET, 2021
E-mail is that the most typical method of communication because of its ability to get, the rapid modification of messages and low cost of distribution. E-mail is one among the foremost secure medium for online communication and transferring data or messages through the net. An overgrowing increase in popularity, the quantity of unsolicited data has also increased rapidly. Spam causes traffic issues and bottlenecks that limit the quantity of memory and bandwidth, power and computing speed. To filtering data, different approaches exist which automatically detect and take away these untenable messages. There are several numbers of email spam filtering technique like Knowledge-based technique, Clustering techniques, Learning-based technique, Heuristic processes so on. For data filtering, various approaches exist that automatically detect and suppress these indefensible messages. This paper illustrates a survey of various existing email spam filtering system regarding Machine Learning Technique (MLT) like Naive Bayes, SVM, K-Nearest Neighbor, Bayes Additive Regression, KNN Tree, and rules. Henceforth here we give the classification, evaluation and comparison of some email spam filtering system and summarize the scenario regarding accuracy rate of various existing approaches.
Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing, 2010
Spamming has become a time consuming and expensive problem for which several new directions have been investigated lately. This paper presents a new approach for a spam detection filter. The solution developed is an offline application that uses the k-Nearest Neighbor (kNN) algorithm and a pre-classified email data set for the learning process.
2020
The humongous volume of unsolicited bulk e-mail (spam) which is further increasing, is the major cause for developing antispam protection filters. Machine learning provides a very optimized approach to automatically filter spams at a very successful rate. Here, in this paper we survey some of the most popular machine learning algorithms (Naïve Bayes, k-NN, SVMs and ANN) and their applicability to the problem of spam e-mail classification. Descriptions of the algorithms are presented, and the comparison of their performance on the UCI spam-base dataset is presented. Keywords⸻ Spam, E-mail classification, Machine learning algorithms, k-NN, SVM, Naïve Bayes, ANN.
2006
The great number and variety of learning-based spam filters proposed during the last years cause the need in complex and many-sided evaluation of them, taking features of the phenomenon of spam into account. This paper is dedicated to the analysis of the dependence of filter performance on the temporal distribution of training data; the cause of this dependence is the changeability of email. Such analysis provides additional information about the filter quality, and also may be useful for organizing more effective training of the filter. The naïve Bayes filter is chosen for evaluation in this paper.
This paper describes an e-mail spam filter based on local SVM, namely on the SVM classifier trained only on a neighborhood of the message to be classified, and not on the whole training data available. Two problems are stated and solved. First, the selection of the right size of neighborhood is shown to be critical; our solution is based on the estimation of the a-posteriori probability of the correct decision, and the resulting algorithm is called highest probability SVM nearest neighbor (HP-SVM-NN). The second problem is the application of the algorithm in practice, and we propose a practical filter architecture based on HP-SVM-NN. Extensive testing is performed on SpamAssassin corpus and TREC 2005 Spam Track corpus, showing that HP-SVM-NN outperforms pure SVM and is applicable in practice. Finally, we explore the locality properties of the two corpora using Sammon's projection.
A method is proposed for learning to classify spam and non- spam emails. It combines the strategy of the Best Stepwise Feature Se- lection with a classifier of Euclidean nearest-neighbor. Each text email is first transformed into a vector of D-dimensional Euclidean space. Emails were divided into training and test sets in the manner of 10-fold cross- validation. Three experiments were performed, and their elapsed CPU times and accuracies reported. The proposed spam detection learner was found to be extremely fast in recognition and with good error rates. It could be used as a baseline learning agent, in terms of CPU time and accuracy, against which other learning agents can be measured.
The development of data-mining applications such as classification and clustering has shown the need for machine learning algorithms to be applied to large scale data. The article gives an overview of some of the most popular machine learning methods (Gaussian and Nearest Mean) and of their applicability to the problem of spam e-mail filtering. The aim of this paper is to compare and investigate the effectiveness of classifiers for filtering spam e-mails using different matrices. Since spam is increasingly becoming difficult to detect, so these automated techniques will help in saving lot of time and resources required to handle email messages.
International Journal of Research in Engineering and Technology, 2016
Electronic mail is used daily by billions of people to interact and communicate around the world and is a critical application for many businesses. Over the last couple of decades unsolicited bulk email has become a headache for the email user. A staggering amount of spam is streaming into user's mailboxes daily. Spam is not only irritating for most email users but it also overtaxes the IT infrastructure of businesses and costs billions of dollars in wasted productivity. The need of effective spam filtering techniques increases. Machine learning algorithms can be used with current spam filtering schemes for increased efficiency. This paper presents a comparative study of the performance of different Machine Learning Algorithms which can be used to filter a mail as spam or ham.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (7)
- Breiman, L., Random forests, Machine Learning, Vol.45, No.1, 5 -32, 2001.
- Porter, M.F., An algorithm for suffix stripping, Program, 14(3), 130-137, 1980.
- Porter, M.F., The Porter Stemming Algorithm, http://www.tartarus.org/~martin/PorterStemmer/.
- SpamAssassin, http://spamassassin.apache.org/.
- Witten, I.H. and Frank, E. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005.
- Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/.
- Yang, Y., Pedersen, J,O.: A Comparative Study on Feature Selection in Text Categorization. Proceedings of ICML-97 14 th Int Conf on Machine Learning. Nashville, US, 412-420, 1997.