Abstract
This paper focuses on the security of electronic mail, using machine learning algorithms. Spam email is unwanted messages, usually commercial, sent to a large number of recipients. In this work, an algorithm for the detection of spam messages with the aid of machine learning methods is proposed. The algorithm accepts as input text email messages grouped as benevolent ("ham") and malevolent (spam) and produces a text file in csv format. This file then is used to train a bunch of ten Machine Learning techniques to classify incoming emails into ham or spam. The following Machine Learning techniques have been tested: Support Vector Machines, k-Nearest Neighbour, Naïve Bayes, Neural Networks, Recurrent Neural Networks, Ada Boost, Random Forest, Gradient Boosting, Logistic Regression and Decision Trees. Testing was performed using two popular datasets, as well as a publicly available csv file. Our algorithm is written in Python and produces satisfactory results in terms of accuracy, compared to state-of-the-art implementations. In addition, the proposed system generates three output files: a csv file with the spam email IP addresses (of originating email servers), a map with their geolocation, as well as a csv file with statistics about the countries of origin. These files can be used to update existing organisational filters and blacklists used in other spam filters.
References (17)
- Ahmed, N., Amin, R., Aldabbas, H., Koundal, D., Alouffi B. and Shah, T. (2022) "Machine Learning Techniques for Spam Detection in Email and IoT Platforms: Analysis and Research Challenges", Hindawi Security and Communication Networks, Vol. 2022, Article ID 1862888, https://doi.org/10.1155/2022/1862888
- Benavides, E., Fuertes, W., Sanchez, S. and Sanchez, M. (2020) "Classification of Phishing Attack Solutions by Employing Deep Learning Techniques: A Systematic Literature Review". In Á. Rocha and R. P. Pereira (eds.), Developments and Advances in Defense and Security, Smart Innovation, Systems and Technologies 152, Springer Nature Singapore Pte Ltd. https://doi.org/10.1007/978-981-13-9155-2_5
- Chrismanto, A. R. and Lukito, Y. (2017) "Identifikasi Komentar Spam Pada Instagram", Lontar Komputer : Jurnal Ilmiah Teknologi Informasi, Vol. 8, no. 3, p. 219, doi: 10.24843/lkjiti.2017.v08.i03.p08.
- Ghosh, A. and Senthilrajan, A. (2023) "Comparison of machine learning techniques for spam detection", Multimedia Tools and Applications, 1-28. 10.1007/s11042-023-14689-3.
- Harisinghaney, A., Dixit, A., Gupta, S. and Arora, A. (2014) "Text and image based spam email classification using kNN, Naïve Bayes and Reverse DBSCAN algorithm", 2014 International Conference on Reliability Optimization and Information Technology (ICROIT) pp 153-155.
- Hemalatha, M., Katta, S., Santosh, R. S. and Priyanka, Ms. (2022) "E-Mail Spam Detection", International Journal of Computer Science and Mobile Computing, Vol. 11, Issue 1, Jan. 2022, pp. 36-44.
- Jáñez-Martino, F., Fidalgo, E., González-Martínez, S. and Velasco-Mata, J. (2020) "Classification of Spam Emails through Hierarchical Clustering and Supervised Learning," CoRR, Vol. abs/2005.08773.
- Karyawati, A. E., Wijaya, K. and Supriana, I W. S. (2023) "A Comparison of Different Kernel Functions of SVM Classification Method for Spam Detection", JITK (Jurnal Ilmu Pengetahuan dan Teknologi Komputer), 8, pp 91-97, doi: 10.33480/jitk.v8i2.2463.
- Laorden, C., Santos, I., Sanz, B., Alvarez, G. and Bringas, P. G. (2012) "Word Sense Disambiguation for Spam Filtering", Electron. Commer. Rec. Appl., Vol. 11, pp 290-298.
- Marie-Sainte, S. L. and Alalyani, N. (2020) "Firefly Algorithm based Feature Selection for Arabic Text Classification", J. King Saud Univ. Comput. Inf. Sci., Vol. 32, pp 320-328.
- Michelakis, E., Androutsopoulos, I., Paliouras, G., Sakkis, G., and Stamatopoulos, P. (2004) "Filtron: A Learning-Based Anti Spam Filter", International Conference on Email and Anti-Spam.
- Pratiwi, S. N. D. and Ulama, B. S. S. (2016) "Klasifikasi Email Spam dengan Menggunakan Metode Support Vector Machine dan k-Nearest Neighbor", Jurnal Sains dan Seni ITS, Vol. 5, No. 2, pp 344-349.
- Reddy, G. A. and Reddy, B. I. (2021) "Classification of Spam Text using SVM", Journal of University of Shanghai for Science and Technology, Vol. 23, No. 08, pp 616-624, doi: 10.51201/jusst/21/08437.
- Roiger, R. J. (2017) Data Mining: A Tutorial-Based Primer, 2nd ed. Boca Raton, CRC Press, Taylor & Francis Group.
- Sahmoud, T. and Mikki, M. A. (2022) "Spam Detection Using BERT", preprint 10.48550/arXiv.2206.02443. SpamLaws.com (2023) [online], https://www.spamlaws.com [Accessed 6 March 2023].
- Tan, P.-N., Steinbach, M. and Kumar, V. (2006) Introduction to Data Mining. Boston: Pearson Addison-Wesley.
- Zamil, Y. K., Ali, S. A. and Naser, M. A. (2019) "Spam image email filtering using K-NN and SVM", International Journal of Electrical and Computer Engineering (IJECE), Vol. 9, no. 1, pp 245-254, doi: 10.11591/ijece.v9i1.