Academia.eduAcademia.edu

Outline

SPAM EMAIL DETECTION USING MACHINE LEARNING INTEGRATED IN CLOUD

https://doi.org/10.1109/ICNWC57852.2023.10127237

Abstract

In this project, we focus on electronic mail, one of the most important means of communication among information professionals. As its use and significance among the general populace grows, so does its importance and utility. It has allowed for more adaptability and convenience in communication, both in the private and professional spheres. The increased use of email has led to a rise in spam as well as legitimate messages. An email that is sent to a large number of people without the sender's knowledge or consent is considered spam. Millions of internet users, both casual and professional, are currently frustrated by the widespread problem of email spam. The purpose of this study is to provide a hybrid approach to machine learning for identifying spam in email. Bagging and boosting of machine learning-based multinomial Decision Tree, Naive Bayes, KNN, Random Forest, and the SVM method are the proposed hybrid techniques. The bagging method uses a concurrent combination of weak classifiers to boost classification accuracy. The standard deviation of misclassifications is decreased by using bagging. Alternatively, by linking the classifiers in a series fashion, the boosting strategy can construct a robust classifier out of two or more relatively weak classifiers. Improved classification results can be achieved through reduced bias and variance thanks to the use of boosting. In order to detect spam in emails, it is necessary to take into account datasets, pre-process those datasets, extract and pick features, and classify the data. In this study, we evaluate the feasibility of conducting experiments using data from the Ling-Spam Corpus and the CSDMC2010 Spam Corpus. According to the stop-word list and lemmatiser, Ling-Spam Corpus's dataset is split into four different directories: bare, lemm, lemm stop, and stop. In addition, pre-processing consists of converting strings to word vectors (tokenization), stemming words, and removing stop words. Since the Ling Spam Corpus is already organised according to the stop-word list and the lemmatiser, only the CSDMC2010 Spam Corpus undergoes the stemming and XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE stop words removal processes. Features are extracted and selected from the preprocessed data. The feature selection procedure in this work makes use of a correlation-based approach.

References (22)

  1. S. A. A. Ghaleb et al., "Feature Selection by Multiobjective Optimization: Application to Spam Detection System by Neural Networks and Grasshopper Optimization Algorithm," in IEEE Access, vol. 10, pp. 98475-98489, 2022.
  2. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti and M. Alazab, "A Comprehensive Survey for Intelligent Spam Email Detection," in IEEE Access, vol. 7, pp. 168261-168295, 2019.
  3. Karim, S. Azam, B. Shanmugam and K. Kannoorpatti, "Efficient Clustering of Emails Into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework," in IEEE Access, vol. 8, pp. 154759-154788, 2020.
  4. Karim, S. Azam, B. Shanmugam and K. Kannoorpatti, "An Unsupervised Approach for Content-Based Clustering of Emails Into Spam and Ham Through Multiangular Feature Formulation," in IEEE Access, vol. 9, pp. 135186-135209, 2021.
  5. G. Al-Rawashdeh, R. Mamat and N. Hafhizah Binti Abd Rahim, "Hybrid Water Cycle Optimization Algorithm With Simulated Annealing for Spam E-mail Detection," in IEEE Access, vol. 7, pp. 143721-143734, 2019.
  6. S. Gibson, B. Issac, L. Zhang and S. M. Jacob, "Detecting Spam Email With Machine Learning Optimized With Bio-Inspired Metaheuristic Algorithms," in IEEE Access, vol. 8, pp. 187914- 187932, 2020.
  7. S. Maroofi, M. Korczyński, A. Hölzel and A. Duda, "Adoption of Email Anti-Spoofing Schemes: A Large Scale Analysis," in IEEE Transactions on Network and Service Management, vol. 18, no. 3, pp. 3184-3196, Sept. 2021.
  8. S. A. A. Ghaleb, M. Mohamad, S. A. Fadzli and W. A. H. M. Ghanem, "Training Neural Networks by Enhance Grasshopper Optimization Algorithm for Spam Detection System," in IEEE Access, vol. 9, pp. 116768-116813, 2021.
  9. Z. -Y. Zhao and P. Zeng, "Efficient All-or-Nothing Public Key Encryption With Authenticated Equality Test," in IEEE Access, vol. 9, pp. 94099-94108, 2021.
  10. M. Hijji and G. Alam, "A Multivocal Literature Review on Growing Social Engineering Based Cyber-Attacks/Threats During the COVID-19 Pandemic: Challenges and Prospective Solutions," in IEEE Access, vol. 9, pp. 7152-7169, 2021.
  11. M. RAZA, N. D. Jayasinghe and M. M. A. Muslam, "A Comprehensive Review on Email Spam Classification using Machine Learning Algorithms," 2021 International Conference on Information Networking (ICOIN), Jeju Island, Korea (South), 2021, pp. 327-332.
  12. R. Wongwatkit, M. Raktham and T. Phawananthaphuti, "Intelligent Blacklist Security System for Protecting Spammer in Corporate Email Solution: A Case of Corporate Email Service Provider in Thailand," 2021 23rd International Conference on Advanced Communication Technology (ICACT), PyeongChang, Korea (South), 2021, pp. 387-391.
  13. C. Bansal and B. Sidhu, "Machine Learning based Hybrid Approach for Email Spam Detection," 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 2021, pp. 1-4.
  14. A. Sumithra, A. Ashifa, S. Harini and N. Kumaresan, "Probability-based Naïve Bayes Algorithm for Email Spam Classification," 2022 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 2022, pp. 1-5.
  15. S. Kaddoura, O. Alfandi and N. Dahmani, "A Spam Email Detection Mechanism for English Language Text Emails Using Deep Learning Approach," 2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), Bayonne, France, 2020, pp. 193-198.
  16. N. Gavrilovic and V. Ciric, "Design and Evaluation of Proof of Work Based Anti-Spam Solution," 2020 Zooming Innovation in Consumer Technologies Conference (ZINC), Novi Sad, Serbia, 2020, pp. 286-289.
  17. N. A. Farahisya and F. A. Bachtiar, "Spam Email Detection with Affect Intensities using Recurrent Neural Network Algorithm," 2022 2nd International Conference on Information Technology and Education (ICIT&E), Malang, Indonesia, 2022, pp. 206- 211.
  18. H. Dinendra, C. Rajapakse and P. P. G. Dinesh Asanka, "Personalized Classification of Non-Spam Emails Using Machine Learning Techniques," 2022 International Research Conference on Smart Computing and Systems Engineering (SCSE), Colombo, Sri Lanka, 2022, pp. 171-177.
  19. S. K. Ganiev and S. J. Khamidov, "Artificial Intelligence-Based Methods For Filtering Spam Messages In Email Services," 2021 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan, 2021, pp. 1-4.
  20. A. B, D. P. M, K. M, M. N. Joseph and D. R, "Spam Detection System Using Supervised ML," 2021 International Conference on System, Computation, Automation and Networking (ICSCAN), 2021, pp. 1-5, doi: 10.1109/ICSCAN53069.2021.9526421. Authorized licensed use limited to: Charles Darwin University. Downloaded on May 27,2023 at 00:46:30 UTC from IEEE Xplore. Restrictions apply.
  21. Rundong Yang, Kangfeng Zheng, Bin Wu, Chunhua Wu, and Xiujuan Wang. Phishing website detection based on deep convolutional neural network and random forest ensemble learning. Sensors, 21(24):8281, 2021.
  22. Rayan, A. (2022). Analysis of e-Mail Spam Detection Using a Novel Machine Learning-Based Hybrid Bagging Technique. Computational Intelligence and Neuroscience, 2022.