Academia.eduAcademia.edu

Spam Filtering

description923 papers
group1,962 followers
lightbulbAbout this topic
Spam filtering is the process of identifying and blocking unsolicited or unwanted electronic messages, typically in email, using algorithms and heuristics to classify content as either legitimate or spam. This technique aims to enhance user experience and security by reducing the volume of irrelevant or harmful communications.
lightbulbAbout this topic
Spam filtering is the process of identifying and blocking unsolicited or unwanted electronic messages, typically in email, using algorithms and heuristics to classify content as either legitimate or spam. This technique aims to enhance user experience and security by reducing the volume of irrelevant or harmful communications.

Key research themes

1. How do machine learning techniques address the evolving challenges of email spam filtering?

This theme explores the application and advancement of various machine learning (ML) algorithms in email spam filtering, focusing on handling concept drift, feature extraction, ensemble learning, and hybrid models to improve detection accuracy and adaptability under realistic scenarios where spam characteristics continuously evolve.

Key finding: This comprehensive review highlights that while traditional ML approaches such as Naive Bayes remain foundational, evolving challenges like concept drift and the obfuscation of spam texts necessitate adaptive filters. It... Read more
Key finding: Demonstrates the effectiveness of ensemble learning strategies—bagging and boosting—applied to classifiers including multinomial Decision Trees, Naive Bayes, KNN, Random Forest, and SVM for spam detection. The study finds... Read more
Key finding: Through empirical comparison on the Spambase dataset, the study shows that Naive Bayes outperforms Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) classifiers in email spam detection accuracy. This reinforces... Read more
Key finding: Provides a multi-model evaluation (including Random Forest, AdaBoost, Decision Tree, SVM, and Naive Bayes) using balanced datasets and multiple metrics beyond accuracy. The Random Forest model attains the highest accuracy... Read more
Key finding: Proposes a spam detection system leveraging Naive Bayes classifiers integrated with tokenization and stop word filtering via scikit-learn. Emphasis is on the adaptability of ML techniques to changing spam tactics and the... Read more

2. What are the roles and limitations of pre-acceptance filtering techniques in combating spam at the SMTP server level?

This research area investigates the application of pre-acceptance filtering mechanisms, such as blacklisting, whitelisting, and sender behavior profiling before accepting emails at the SMTP protocol handshake stage, aiming to reduce server load and increase early detection of spam. It also assesses the potential and practical limitations of these techniques in handling diverse spam sources.

Key finding: Empirical analysis over millions of emails shows that well-constructed blacklists can filter up to 86% of spam by identifying offending IP blocks and individual senders during pre-acceptance SMTP interactions. However, a... Read more
Key finding: Introduces a reactive spam filtering system leveraging reporter reputation to enable earlier spam campaign detection. The method prioritizes feedback from trustworthy users to identify spamming quickly before widespread... Read more

3. How can stylometric and content-based features alongside machine learning improve detection of sophisticated and AI-generated spam and phishing emails?

This theme focuses on detecting advanced unsolicited emails, including AI-generated phishing attempts, by extracting linguistic and stylometric features, employing interpretable machine learning models, and analyzing email content beyond traditional signature-based approaches to counteract increasingly sophisticated cyber threats.

Key finding: This work evaluates major email providers' abilities to block GPT-4o generated phishing emails, revealing vulnerabilities especially in Gmail and Outlook. Applying 60 stylometric features to classifiers identified XGBoost as... Read more
Key finding: Investigates spam detection in user-generated comments by employing natural language processing (NLP) techniques, such as broken text flow and topic detection, combined with machine learning classifiers. The approach... Read more
Key finding: While not directly about spam filtering, this paper discusses AI's broader societal impacts, emphasizing the emerging challenges and opportunities in cultural settings due to AI's integration. It underlines the importance of... Read more

All papers in Spam Filtering

Spam has become a major issue in computer security because it is a channel for threats such as computer viruses, worms and phishing. Many solutions feature machine-learning algorithms trained using statistical representations of the terms... more
This paper presents the use of corpus linguistics techniques on supposedly "clean" corpora and identifies potential pitfalls. Our work relates to the task of filtering sensitive content, in which data security is strategically important... more
In this paper we describe an approach to information assurance in which we can prevent breach of confidentiality. Specifically, we examine aspects of the propagation of confidential information via email. Email provides one simple... more
The growing popularity of YouTube video-sharing platforms requires organizations to analyze viewer comments for public opinion assessment and content development. A web application powered by machine learning techniques analyzes sentiment... more
Electronic Mail (E-mail) has established a significant place in information user’s life. Mails are used as a major and important mode of information sharing because emails are faster and effective way of communication. Email plays its... more
Access control in multi-tenant cloud environments faces significant challenges due to encrypted communications, protocol diversity, and dynamic tenant behavior. Traditional access control methods, such as static-ruled-based and... more
Abstract. In the modern digital era, the demand for highly available and resilient systems is constantly increasing, especially in cloud environments and data centers that provide critical services. Xen virtualization is one of the most... more
The exponential growth of mobile communication has intensified the threat of SMS spam, compromising user security and trust in messaging platforms. This study addresses this challenge by designing and deploying a robust spam detection... more
Spam consists of varieties of contents like text, image, embedded HTML, MIME attachments and also the volume of spam mails sent per day is massive. To handle this high volume, high velocity and large varieties of spam, a scalable spam... more
The Autism Spectrum Disorder (ASD) is a neurological disease, which affects the mental, social and physical state of a person. A person of any age group can be found infected by it. It is very difficult to identify, if a person is the... more
An intelligent system that uses Natural Language Processing (NLP) and Machine Learning (ML) to automate resume classification is presented in this paper. Key resume features, such as education, skills, and job titles, are extracted and... more
In the modern era, mobile phones have become ubiquitous, and Short Message Service (SMS) has grown to become a multi-million-dollar service due to the widespread adoption of mobile devices and the millions of people who use SMS daily.... more
With the growth of networking the usage of mails are also enhanced. Due to rapid growth of internet, dependency of communication is mostly based on electronics mails for both commercial and business purposes. According to today's... more
This paper describes our approach towards the ECML/PKDD Discovery Challenge 2010. The challenge consists of three tasks: (1) a Web genre and facet classification task for English hosts, (2) an English quality task, and (3) a multilingual... more
The widespread use of email as a primary communication medium has led to an increase in spam messages, which pose significant threats to privacy, productivity, and cybersecurity. Spam emails, often disguised as legitimate messages, can... more
Phishing remains one of the most prevalent and evolving cybersecurity threats, exploiting humanvulnerabilities through deceptive digital communication. This study proposes a dynamic, Windows-specific phishing detection model leveraging... more
The expressive power of regular expressions has been often exploited in network intrusion detection systems, virus scanners, and spam filtering applications. However, the flexible pattern matching functionality of regular expressions in... more
Classifier performance optimization in machine learning can be stated as a multi-objective optimization problem. In this context, recent works have shown the utility of simple evolutionary multi-objective algorithms (NSGA-II, SPEA2) to... more
Question and Answering system is one of the widely used Mechanism in student Community for learning. This paper mainly focuses on Question and answering system based on Paragraphs. Datasets such as the Stanford Question-Answering Dataset... more
Question and Answering system is one of the widely used Mechanism in student Community for learning. This paper mainly focuses on Question and answering system based on Paragraphs. Datasets such as the Stanford Question-Answering Dataset... more
There is a critical need for organizations to share data within and across infospheres and form coalitions so that analysts could examine the data, mine the data, and make effective decisions. Each organization could share information... more
The advanced architecture of Large Language Models (LLMs) has revolutionised natural language processing, enabling the creation of text that convincingly mimics legitimate human communication, including phishing emails. As AI-generated... more
In this paper, we present a new spam filter which acts as an additional layer in the spam filtering process. This filter is based on what we call a representative vocabulary. Spam e-mails are divided into categories in which each category... more
This paper gives a basic idea how various machine learning techniques may be applied towards processing the data from DEA services to find out whether people use these services for legitimate or non-legitimate purposes.
Summary The attribute-oriented induction (AOI) method is a useful tool for data capable of extracting generalized knowledge from relational data and the user's background knowledge. However, a potential weakness of AOI is that it... more
During the past thirty years, the world of computing has evolved from large centralised computing centres to an increasingly distributed computing environment, where computation and communication capabilities are being embedded in... more
Combatting email spam has remained a very daunting task. Despite the over 99% accuracy in most non-image-based spam email detection, studies on image-based spam hardly attain such a high level of accuracy as new email spamming techniques... more
Digital materials can be protected from failures by replicating them at multiple autonomous, distributed sites. A Peer-to-peer Information Preservation and Exchange (PIPE) network is a good way to build a distributed replication system. A... more
This paper presents EmailValet, a system that learns users' emailreading preferences on email-capable wireless platforms -specifically, on two-way pagers with small "qwerty" keyboards and an 8-line 30-character display. In use by the... more
This paper explores the fundamental principles of internal linking and PageRank optimization. It provides a detailed overview of how internal links influence a website’s SEO, helping distribute link equity and improving user navigation.... more
Malicious websites are those that are created to harm visitors or exploit their information for illegal purposes. These websites are commonly utilized in attacks, such as phishing, malware distribution, and scams. Clicking on a malicious... more
In virtual machine environments each application is often run in its own virtual machine (VM), isolating it from other applications running on the same physical machine. Contention for memory, disk space, and network bandwidth among... more
Email phishing is a manipulative technique aimed at compromising information security and user privacy. To overcome the limitations of traditional detection methods, such as blacklists, this research proposes a phishing detection model... more
In this paper, we examine the application of various grouping techniques to help improve the efficiency and reduce the costs involved in an electronic discovery process. Specifically, we create coherent groups of email documents which... more
Abstract—In recent times, the problem of Unsolicited Bulk Email (UBE) or commonly known as Spam Email, has increased at a tremendous growth rate. We present an analysis of survey based on classifications of UBE in various research works.... more
Email has become a fast and cheap means of online communication. The main threat to email is Unsolicited Bulk Email (UBE), commonly called spam email. The current work aims at identification of unigrams in more than 2700 UBE that... more
In this paper, we introduced a statistical rule-based method to create rules for SpamAssassin to detect spams in different languages. The theoretical framework of generating and maintaining multilingual rules were also illustrated. The... more
Advanced machine learning and natural language techniques enable attackers to launch sophisticated and targeted social engineering based attacks. To counter the active attacker issue, researchers have since resorted to proactive methods... more
Spam which is one of the most popular and also the most relevant topic that needs to be understood in the current scenario. Everyone whether it may be a small child or an old person are using emails everyday all around the world. The... more
In this paper we describe an online/incremental linear binary classifier based on an inter-esting approach to estimate the Fisher subspace. The proposed method allows to deal with datasets having high cardinality, being dynamically... more
Search Engine spam is a web page or a portion of a web page which has been created with the intention of increasing its ranking in search engines. Web spamming refers to actions intended to mislead search engines and give some pages... more
Nowadays, Search Engines have made progress lately and the number of the pages of web sites increases every days. The Search Engines ‫‬ ‫‬ ‫‬ the most common search systems are for meeting the needs of the users in searching the... more
Email spam is a kind of electronic spam, which tends to be a more difficult problem nowadays among all internet challenges. Spam mails are mostly sent in commercial purpose, some of them may contain malware links that lead to phishing... more
The series "Lecture Notes in Networks and Systems" publishes the latest developments in Networks and Systems-quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of... more
Relación entre la calidad del sueño y factores sociodemográficos Esta obra está bajo una licencia Creative Commons de tipo (CC-BY-NC-SA).
This study proposes a new method that utilizes the correlation structure between the number of words in the mail and the Bayesian score. Spam mails usually do not have a stable style and features. Spammers who send such mails, go on... more
The use of email has grown exponentially over the past decade, making it one of the most widely used forms of electronic communication. Recently, spam emails have become a major issue for email users. A spammer is someone who sends out... more
Download research papers for free!