Academia.eduAcademia.edu

Outline

York University at TREC 2005: SPAM Track

2005, … of Text Retrieval Conference: SPAM Track, …

Abstract

We propose a variant of the k-nearest neighbor classification method, called instance-weighted k-nearest neighbor method, for adaptive spam filtering. The method assigns two weights, distance weight and correctness weight, to a training instance, and makes use of the two weights when classifying a new email. The correctness weight is also used in the maintenance of the training data to make the training data more adaptive to the changes of spam characteristics. We submitted 4 spam filters to the Spam Track. Two of the filters are purely based on the instance-weighted kNN method. The two other filters combine the kNN method with other spam filtering and classification techniques. We report the official results of our submissions on the Spam Track evaluation data sets.

References (7)

  1. Breiman, L., Random forests, Machine Learning, Vol.45, No.1, 5 -32, 2001.
  2. Porter, M.F., An algorithm for suffix stripping, Program, 14(3), 130-137, 1980.
  3. Porter, M.F., The Porter Stemming Algorithm, http://www.tartarus.org/~martin/PorterStemmer/.
  4. SpamAssassin, http://spamassassin.apache.org/.
  5. Witten, I.H. and Frank, E. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005.
  6. Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/.
  7. Yang, Y., Pedersen, J,O.: A Comparative Study on Feature Selection in Text Categorization. Proceedings of ICML-97 14 th Int Conf on Machine Learning. Nashville, US, 412-420, 1997.