Spam Email Detection Using Machine Learning Integrated In Cloud

Richa Kumari Karn; V.Elizabeth Jesi; Shabnam Mohamed Aslam

doi:10.1109/ICNWC57852.2023.10127237

Outline

SPAM EMAIL DETECTION USING MACHINE LEARNING INTEGRATED IN CLOUD

Joyece Jane

https://doi.org/10.1109/ICNWC57852.2023.10127237

visibility

…

description

8 pages

link

1 file

Abstract

In this project, we focus on electronic mail, one of the most important means of communication among information professionals. As its use and significance among the general populace grows, so does its importance and utility. It has allowed for more adaptability and convenience in communication, both in the private and professional spheres. The increased use of email has led to a rise in spam as well as legitimate messages. An email that is sent to a large number of people without the sender's knowledge or consent is considered spam. Millions of internet users, both casual and professional, are currently frustrated by the widespread problem of email spam. The purpose of this study is to provide a hybrid approach to machine learning for identifying spam in email. Bagging and boosting of machine learning-based multinomial Decision Tree, Naive Bayes, KNN, Random Forest, and the SVM method are the proposed hybrid techniques. The bagging method uses a concurrent combination of weak classifiers to boost classification accuracy. The standard deviation of misclassifications is decreased by using bagging. Alternatively, by linking the classifiers in a series fashion, the boosting strategy can construct a robust classifier out of two or more relatively weak classifiers. Improved classification results can be achieved through reduced bias and variance thanks to the use of boosting. In order to detect spam in emails, it is necessary to take into account datasets, pre-process those datasets, extract and pick features, and classify the data. In this study, we evaluate the feasibility of conducting experiments using data from the Ling-Spam Corpus and the CSDMC2010 Spam Corpus. According to the stop-word list and lemmatiser, Ling-Spam Corpus's dataset is split into four different directories: bare, lemm, lemm stop, and stop. In addition, pre-processing consists of converting strings to word vectors (tokenization), stemming words, and removing stop words. Since the Ling Spam Corpus is already organised according to the stop-word list and the lemmatiser, only the CSDMC2010 Spam Corpus undergoes the stemming and XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE stop words removal processes. Features are extracted and selected from the preprocessed data. The feature selection procedure in this work makes use of a correlation-based approach.

Figures (10)

data patterns [4]. Data mining 1s not a stand-alone strategy; rather, it is an umbrella term for a number of related processes, such as those used in database management, AI, statistics, machine learning, etc. Data mining's main goal is to analyse unprocessed data and produce large amounts of structured information. Figure | depicts a schematic of the data mining/knowledge discovery process [5]. Knowledge, pattern & rules, transformed data, pre-processed data, target data, and raw data are the fundamental components of the data mining process. In order to obtain these elements, a number of steps—from selection and cleaning to pre- processing, feature processing, algorithmic processing, interpretation and evaluation—must take place. Fig 1: Flowchart of the Knowledge Acquisition and Data Mining Procedure Oe ————KO——y———————= ee Email spam is not a new phenomenon; in 1978, Gary Thuerk manually sent the first spam letter on ARPANET to 400 (ARPANET, 2006) persons to attract attention to the launch of his company's DECSYSTEM computer products [6]. The rising incidence of spamming is cause for concern, with 84.85% of all global email correspondence being spam, according to the most recent data from Talos Intelligence (December 2019). [7]. In addition to wasting the recipients’ time, effort, and bandwidth, email spam can lead them to malicious websites that could compromise their computer systems with phishing or malware. Financial scams and terrorist actions can be facilitated by phishing for sensitive information in spam emails. Numerous techniques and filters exist for identifying spam emails. Due to the difficulty of feature selection and the selection of an improper classification strategy, existing methods fail to achieve the necessary performance. This study modifies the bagging and boosting ensemble methods to identify spam in electronic messages. SVM and multinomial naive bayes algorithms are combined as a bagging and boosting technique in machine learning. The ensemble methods can compensate for the weaknesses of the individual ones. The performance of the algorithms used to detect email spam is enhanced by the bagging and data, and raw data are the fundamental components of the

3.1. THE DECISION TREE Decision trees have emerged as a popular machine

Table 4: FN, FP, TN, and TP Evaluations for the Stoy Category of the Ling Spam Corpus, Based on Machine Learning-based Classifiers Table 5: FN, FP, TN, and TP Values Assessed by Machine Learning-based Classifiers for the CODMC2010 Spam Corpus

Table 3: Machine Learning-based classifiers were used to assess the FN, FP, TN, and TP values in the Lemm-Stop nntarnnwm: ntéhlen J tur Onni fLinuniea

Tables 1 through 5 show that compared to the MNB method, the SVM _ has greater true values (TP and TN) and lower false values (FN and FP). It is necessary to do additional evaluation in terms of performance metrics in order to identify the optimal algorithm. The evaluated values are then utilised in the determination of the Precision, Recall, Accuracy, F- Measure, FNR, FPR, and TNR. Results for the Bare Category in the Ling Spam Corpus as Assessed by Machine Learning-based Classifiers are presented in Table 6. Table 6 shows the results of testing the machine learning classifiers SVM and MNB classifier on the bare category of the ling spam corpus. The SVM outperforms the MNB method in terms of precision, recall, f-measure, accuracy, and TNR, but it performs worse in terms of FNR and FPR. The greater an algorithm's efficacy, the lower its false-measure values (FNR and FPR).

References (22)

S. A. A. Ghaleb et al., "Feature Selection by Multiobjective Optimization: Application to Spam Detection System by Neural Networks and Grasshopper Optimization Algorithm," in IEEE Access, vol. 10, pp. 98475-98489, 2022.
Karim, S. Azam, B. Shanmugam, K. Kannoorpatti and M. Alazab, "A Comprehensive Survey for Intelligent Spam Email Detection," in IEEE Access, vol. 7, pp. 168261-168295, 2019.
Karim, S. Azam, B. Shanmugam and K. Kannoorpatti, "Efficient Clustering of Emails Into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework," in IEEE Access, vol. 8, pp. 154759-154788, 2020.
Karim, S. Azam, B. Shanmugam and K. Kannoorpatti, "An Unsupervised Approach for Content-Based Clustering of Emails Into Spam and Ham Through Multiangular Feature Formulation," in IEEE Access, vol. 9, pp. 135186-135209, 2021.
G. Al-Rawashdeh, R. Mamat and N. Hafhizah Binti Abd Rahim, "Hybrid Water Cycle Optimization Algorithm With Simulated Annealing for Spam E-mail Detection," in IEEE Access, vol. 7, pp. 143721-143734, 2019.
S. Gibson, B. Issac, L. Zhang and S. M. Jacob, "Detecting Spam Email With Machine Learning Optimized With Bio-Inspired Metaheuristic Algorithms," in IEEE Access, vol. 8, pp. 187914- 187932, 2020.
S. Maroofi, M. Korczyński, A. Hölzel and A. Duda, "Adoption of Email Anti-Spoofing Schemes: A Large Scale Analysis," in IEEE Transactions on Network and Service Management, vol. 18, no. 3, pp. 3184-3196, Sept. 2021.
S. A. A. Ghaleb, M. Mohamad, S. A. Fadzli and W. A. H. M. Ghanem, "Training Neural Networks by Enhance Grasshopper Optimization Algorithm for Spam Detection System," in IEEE Access, vol. 9, pp. 116768-116813, 2021.
Z. -Y. Zhao and P. Zeng, "Efficient All-or-Nothing Public Key Encryption With Authenticated Equality Test," in IEEE Access, vol. 9, pp. 94099-94108, 2021.
M. Hijji and G. Alam, "A Multivocal Literature Review on Growing Social Engineering Based Cyber-Attacks/Threats During the COVID-19 Pandemic: Challenges and Prospective Solutions," in IEEE Access, vol. 9, pp. 7152-7169, 2021.
M. RAZA, N. D. Jayasinghe and M. M. A. Muslam, "A Comprehensive Review on Email Spam Classification using Machine Learning Algorithms," 2021 International Conference on Information Networking (ICOIN), Jeju Island, Korea (South), 2021, pp. 327-332.
R. Wongwatkit, M. Raktham and T. Phawananthaphuti, "Intelligent Blacklist Security System for Protecting Spammer in Corporate Email Solution: A Case of Corporate Email Service Provider in Thailand," 2021 23rd International Conference on Advanced Communication Technology (ICACT), PyeongChang, Korea (South), 2021, pp. 387-391.
C. Bansal and B. Sidhu, "Machine Learning based Hybrid Approach for Email Spam Detection," 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 2021, pp. 1-4.
A. Sumithra, A. Ashifa, S. Harini and N. Kumaresan, "Probability-based Naïve Bayes Algorithm for Email Spam Classification," 2022 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 2022, pp. 1-5.
S. Kaddoura, O. Alfandi and N. Dahmani, "A Spam Email Detection Mechanism for English Language Text Emails Using Deep Learning Approach," 2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), Bayonne, France, 2020, pp. 193-198.
N. Gavrilovic and V. Ciric, "Design and Evaluation of Proof of Work Based Anti-Spam Solution," 2020 Zooming Innovation in Consumer Technologies Conference (ZINC), Novi Sad, Serbia, 2020, pp. 286-289.
N. A. Farahisya and F. A. Bachtiar, "Spam Email Detection with Affect Intensities using Recurrent Neural Network Algorithm," 2022 2nd International Conference on Information Technology and Education (ICIT&E), Malang, Indonesia, 2022, pp. 206- 211.
H. Dinendra, C. Rajapakse and P. P. G. Dinesh Asanka, "Personalized Classification of Non-Spam Emails Using Machine Learning Techniques," 2022 International Research Conference on Smart Computing and Systems Engineering (SCSE), Colombo, Sri Lanka, 2022, pp. 171-177.
S. K. Ganiev and S. J. Khamidov, "Artificial Intelligence-Based Methods For Filtering Spam Messages In Email Services," 2021 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan, 2021, pp. 1-4.
A. B, D. P. M, K. M, M. N. Joseph and D. R, "Spam Detection System Using Supervised ML," 2021 International Conference on System, Computation, Automation and Networking (ICSCAN), 2021, pp. 1-5, doi: 10.1109/ICSCAN53069.2021.9526421. Authorized licensed use limited to: Charles Darwin University. Downloaded on May 27,2023 at 00:46:30 UTC from IEEE Xplore. Restrictions apply.
Rundong Yang, Kangfeng Zheng, Bin Wu, Chunhua Wu, and Xiujuan Wang. Phishing website detection based on deep convolutional neural network and random forest ensemble learning. Sensors, 21(24):8281, 2021.
Rayan, A. (2022). Analysis of e-Mail Spam Detection Using a Novel Machine Learning-Based Hybrid Bagging Technique. Computational Intelligence and Neuroscience, 2022.

SPAM EMAIL DETECTION USING MACHINE LEARNING INTEGRATED IN CLOUD

Sign up for access to the world's latest research

Abstract

Related papers

References (22)

Related papers

Related topics