E-Mail Spam Detection using Machine Learning and Deep Learning
2020, International Journal for Research in Applied Science and Engineering Technology
https://doi.org/10.22214/IJRASET.2020.6159…
7 pages
1 file
Sign up for access to the world's latest research
Related papers
International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2023
This comprehensive review delves into the realm of email spam classification, scrutinizing the efficacy of various machine learning methods employed in the ongoing battle against unwanted email communication. The paper synthesizes a wide array of research findings, methodologies, and performance metrics to provide a holistic perspective on the evolving landscape of spam detection. Emphasizing the pivotal role of machine learning in addressing the dynamic nature of spam, the review explores the strengths and limitations of popular algorithms such as Naive Bayes, Support Vector Machines, and neural networks. Additionally, it examines feature engineering, dataset characteristics, and evolving threats, offering insights into the challenges and opportunities within the field. With a focus on recent advancements and emerging trends, this review aims to guide researchers, practitioners, and developers in the ongoing pursuit of robust and adaptive email spam classification systems.
Email Spam Detection Using Hierarchical Attention Hybrid Deep Learning Method, 2022
Email is one of the most widely used ways to communicate, with millions of people and businesses relying on it to communicate and share knowledge and information on a daily basis. Nevertheless, the rise in email users has occurred a dramatic increase in spam emails in recent years. Processing and managing emails properly for individuals and companies are getting increasingly difficult. This article proposes a novel technique for email spam detection that is based on a combination of convolutional neural networks, gated recurrent units, and attention mechanisms. During system training, the network is selectively focused on necessary parts of the email text. The usage of convolution layers to extract more meaningful, abstract, and generalizable features by hierarchical representation is the major contribution of this study. Additionally, this contribution incorporates cross-dataset evaluation, which enables the generation of more independent performance results from the model's training dataset. According to cross-dataset evaluation results, the proposed technique advances the results of the present attention-based techniques by utilizing temporal convolutions, which give us more flexible receptive field sizes are utilized. The suggested technique's findings are compared to those of state-of-the-art models and show that our approach outperforms them.
Advances in Engineering and Intelligence Systems, 2025
Spam emails constitute a significant percentage of email traffic and are considered a cybersecurity threat, often leading to phishing attacks, malware infections, and financial fraud. These emails, sent in bulk for commercial and malicious purposes, can bypass traditional spam filters, necessitating the development of high-accuracy models for effective detection. A major challenge in spam filtering is reducing false positives, which can lead to legitimate emails being incorrectly classified as spam, impacting users' email communication. In this study, deep learning (DL) and natural language processing (NLP) methods were employed to develop a spam detection model. Five DL-based models—Dense, CNN, LSTM, CNN-LSTM, and BERT—were evaluated. Data preprocessing included stemming, lemmatization, and text vectorization using Word2Vec to enhance feature extraction. The models were trained on a real dataset, and their accuracy was assessed using multiple evaluation indices. The findings demonstrated that, among the tested models, BERT achieved the highest accuracy (99.33%), outperforming all other approaches in spam detection. Its ability to understand contextual relationships and mitigate false positives makes it highly suitable for real-world applications. Given its computational demands, future research should focus on optimizing BERT for real-time deployment through model compression and parallel execution. Additionally, further testing on larger and more diverse datasets and implementing multilingual spam filtering capabilities will enhance its practical utility.
International Journal of Intelligent Systems and Applications in Engineering, 2020
In this study, we have provided an alternative solution to spam and legitimate email classification problem. The different deep learning architectures are applied on two feature selection methods, including the Mutual Information (MI) and Weighted Mutual Information (WMI). Firstly, feature selection methods including WMI and MI are applied to reduce number of selected terms. Secondly, the feature vectors are constructed with concept of the bag-of-words (BoW) model. Finally, the performance of system is analyzed with using Artificial Neural Network (ANN), Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BILSTM) models. After experimental simulations, we have observed that there is a competition between detection results of using WMI and MI when commented with accuracy rates for the agglutinative language, namely Turkish. The experimental scores show that the LSTM and BILSTM give 100% accuracy scores when combined with MI or WMI, for spam and legitimate emails. However, for particular crossvalidation, the performance WMI is higher than MI features in terms e-mail grouping. It turns out that WMI and MI with deep learning architectures seem more robust to spam email detection when considering the high detection scores.
Journal of intelligent learning systems and applications, 2022
Spam emails pose a threat to individuals. The proliferation of spam emails daily has rendered traditional machine learning and deep learning methods for screening them ineffective and inefficient. In our research, we employ deep neural networks like RNN, LSTM, and GRU, incorporating attention mechanisms such as Bahdanua, scaled dot product (SDP), and Luong scaled dot product self-attention for spam email filtering. We evaluate our approach on various datasets, including Trec spam, Enron spam emails, SMS spam collections, and the Ling spam dataset, which constitutes a substantial custom dataset. All these datasets are publicly available. For the Enron dataset, we attain an accuracy of 99.97% using LSTM with SDP self-attention. Our custom dataset exhibits the highest accuracy of 99.01% when employing GRU with SDP self-attention. The SMS spam collection dataset yields a peak accuracy of 99.61% with LSTM and SDP attention. Using the GRU (Gated Recurrent Unit) alongside Luong and SDP (Structured Self-Attention) attention mechanisms, the peak accuracy of 99.89% in the Ling spam dataset. For the Trec spam dataset, the most accurate results are achieved using Luong attention LSTM, with an accuracy rate of 99.01%. Our performance analyses consistently indicate that employing the scaled dot product attention mechanism in conjunction with gated recurrent neural networks (GRU) delivers the most effective results. In summary, our research underscores the efficacy of employing advanced deep learning techniques and attention mechanisms for spam email filtering, with remarkable accuracy across multiple datasets. This approach presents a promising solution to the ever-growing problem of spam emails.
International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2023
Spam, usually referred to as unsolicited commercial or bulk e-mail, has recently become a major issue on the internet. Time, storage, and transmission bandwidth are all wasted by spam. Spam email has been a growing issue for years. Nowadays, automatic email filtering appears to be the most successful strategy for preventing spam. Only several years ago most of the spam could be reliably dealt with by blocking e-mails coming from certain addresses or filtering out messages with certain subject lines. Spammers started employing a number of cunning strategies to get beyond filtering techniques, such as utilizing random sender addresses and/or adding random characters to the message subject line's beginning or conclusion. Machine learning techniques now a days are used to automatically filter the spam e-mail in a very successful rate. Machine learning field is a subfield from the broad field of artificial intelligence, this aims to make machines able to learn like human. Understanding, observing, and providing knowledge about a statistical occurrence are all terms used here. In the first place, data collection and representation are typically problem-specific (i.e., for email messages), and in the second place, e-mail feature selection and feature reduction aim to lower the dimensionality (i.e. the number of features).Finally, the e-mail classification phase of the process finds the actual mapping between training set and testing set. Machine Learning approach includes lots of algorithms that can be used in e-mail filtering like Naïve Bayes, K-nearest neighbour, Support VSector Machine, classifiers. In conclusion, we try to summarize the performance results of the few machine learning methods in terms of spam precision and accuracy.
2020
here we present an inclusive review of recent and successful content-based e-mail spam filtering techniques. Our focus is majorly on machine learning-based spam filters and variants which inspired from them. We report on relevant ideas, techniques, major efforts, and the state-of-the-art in the field. The initial interpretation of the prior work shows the basics of e-mail spam filtering and feature engineering. In this we conclude by studying techniques, methods, evaluation benchmarks, and explore the promising offshoots of latest developments and suggest lines of future investigations. Keywords—— SVM Classifier, Spam Email Classification, Data Mining, Data Science, Machine Learning.
IEEE Explore, 2025
The increasing importance of SMS spam has made it crucial to detect spam messages in several languages in order to provide accurate information. In this paper, several deep learning techniques, such as recurrent neural networks (RNN), long short-term memory (LSTM), bidirectional LSTM (Bi-LSTM), and gated recurrent units (GRU), are used for multilingual spam identification. The text in the dataset were classified as "spam" or "ham". The dataset consists of spam and ham messages in four languages English, Hindi, French and German. The text data is preprocessed using tokenization, padding to standardize input length, and label encoding to improve model performance. Next, using the deep learning models, spam messages are identified throughout the multilingual dataset. Additionally, the models were applied by combining all text messages from the four languages into a single column for joint analysis. Each algorithm is evaluated based on its ability to capture linguistic patterns in the text and accurately differentiate between spam and non-spam reviews. The deep learning techniques also applied by combining all text messages of four languages. The outcomes highlighted the deep learning model's capacity to identify long-term dependencies in the text data for multilingual spam detection. This work contributes to advancing spam detection techniques across languages using deep learning.
Expert Systems with Applications, 2009
In this paper, we present a comprehensive review of recent developments in the application of machine learning algorithms to Spam filtering, focusing on both textual-and image-based approaches. Instead of considering Spam filtering as a standard classification problem, we highlight the importance of considering specific characteristics of the problem, especially concept drift, in designing new filters. Two particularly important aspects not widely recognized in the literature are discussed: the difficulties in updating a classifier based on the bag-of-words representation and a major difference between two early naive Bayes models. Overall, we conclude that while important advancements have been made in the last years, several aspects remain to be explored, especially under more realistic evaluation settings.
IRJET, 2021
Spam emails are known as unrequested commercialized emails or deceptive emails sent to a specific person or a company [5]. Spams can be detected through natural language processing and machine learning methodologies. Machine learning methods are commonly used in spam filtering. These methods are used to render spam classifying emails to either ham (valid messages) or spam (unwanted messages) with the use of Machine Learning classifiers. The proposed work showcases differentiating features of the content of documents [4]. There has been a lot of work that has been performed in the area of spam filtering which is limited to some domains. Research on spam email detection either focuses on natural language processing methodologies [25] on single machine learning algorithms or one natural language processing technique [22] on multiple machine learning algorithms [2]. In this Project, a modeling pipeline is developed to review the machine learning methodologies.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (10)
- Abduelbaset M. However, Tarik Rashed, Ali S. Elbekaie, and Husien A. Alhammi, "An Anti-Spam System Using Artificial Neural Networks And Genetic Algorithms" (A Neural Model In Anti Spam).
- Er. Seema Rani, Er. Sugandha Sharma, "Survey on E-mail Spam Detection Using NLP", International Journal of Advanced Research in Computer Science and Software Engineering, India, Volume 4, Issue 5, May 2014.
- Masurah Mohamad, Khairulliza Ahmad Salleh, "Independent Feature Selection as Spam-Filtering Technique: An Evaluation of Neural Network", Malaysia.
- El-Sayed M. El-Alfy, "Learning Methods For Spam Filtering", College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals, Saudi Arabia.
- Upasna Attri & Harpreet Kaur, "Comparative Study of Gaussian and Nearest Mean Classifiers for Filtering Spam E-mails", Global Journal of Computer Science and Technology Network, Web & Security, USA, Volume 12 Issue 11 Version June 2012.
- Alia Taha Sabri, Adel Hamdan Mohammads, Bassam Al-Shargabi, Maher Abu Hamdeh, "Developing New Continuous Learning Approach for Spam Detection using Artificial Neural Network (CLA_ANN)", European Journal of Scientific Research, ISSN 1450-216X Vol.42 No.3 (2010), pp.511-521.
- Enrique Puertas Sanz, José María Gómez Hidalgo,José Carlos Cortizo Pérez, "Email Spam Filtering", Universidad Europea de Madrid Villaviciosa de Odón, 28670 Madrid, SPAIN.
- Ravinder Kamboj, "A rule based approach for spam detection" ,Computer Science and Engineering Department, Thapar University, India, July 2010.
- Vandana Jaswal, Nidhi Sood, "Spam Detection System Using Hidden Markov Model", International Journal of Advanced Research in Computer Science and Software Engineering, India, Volume 3, Issue 7, July 2013.
- Sahil Puri, Dishant Gosain, Mehak Ahuja, Ishita Kathuria, Nishtha Jatana, "Comparison and Analysis of Spam Detection.