Academia.eduAcademia.edu

Spam Filtering

description923 papers
group1,962 followers
lightbulbAbout this topic
Spam filtering is the process of identifying and blocking unsolicited or unwanted electronic messages, typically in email, using algorithms and heuristics to classify content as either legitimate or spam. This technique aims to enhance user experience and security by reducing the volume of irrelevant or harmful communications.
lightbulbAbout this topic
Spam filtering is the process of identifying and blocking unsolicited or unwanted electronic messages, typically in email, using algorithms and heuristics to classify content as either legitimate or spam. This technique aims to enhance user experience and security by reducing the volume of irrelevant or harmful communications.

Key research themes

1. How do machine learning techniques address the evolving challenges of email spam filtering?

This theme explores the application and advancement of various machine learning (ML) algorithms in email spam filtering, focusing on handling concept drift, feature extraction, ensemble learning, and hybrid models to improve detection accuracy and adaptability under realistic scenarios where spam characteristics continuously evolve.

Key finding: This comprehensive review highlights that while traditional ML approaches such as Naive Bayes remain foundational, evolving challenges like concept drift and the obfuscation of spam texts necessitate adaptive filters. It... Read more
Key finding: Demonstrates the effectiveness of ensemble learning strategies—bagging and boosting—applied to classifiers including multinomial Decision Trees, Naive Bayes, KNN, Random Forest, and SVM for spam detection. The study finds... Read more
Key finding: Through empirical comparison on the Spambase dataset, the study shows that Naive Bayes outperforms Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) classifiers in email spam detection accuracy. This reinforces... Read more
Key finding: Provides a multi-model evaluation (including Random Forest, AdaBoost, Decision Tree, SVM, and Naive Bayes) using balanced datasets and multiple metrics beyond accuracy. The Random Forest model attains the highest accuracy... Read more
Key finding: Proposes a spam detection system leveraging Naive Bayes classifiers integrated with tokenization and stop word filtering via scikit-learn. Emphasis is on the adaptability of ML techniques to changing spam tactics and the... Read more

2. What are the roles and limitations of pre-acceptance filtering techniques in combating spam at the SMTP server level?

This research area investigates the application of pre-acceptance filtering mechanisms, such as blacklisting, whitelisting, and sender behavior profiling before accepting emails at the SMTP protocol handshake stage, aiming to reduce server load and increase early detection of spam. It also assesses the potential and practical limitations of these techniques in handling diverse spam sources.

Key finding: Empirical analysis over millions of emails shows that well-constructed blacklists can filter up to 86% of spam by identifying offending IP blocks and individual senders during pre-acceptance SMTP interactions. However, a... Read more
Key finding: Introduces a reactive spam filtering system leveraging reporter reputation to enable earlier spam campaign detection. The method prioritizes feedback from trustworthy users to identify spamming quickly before widespread... Read more

3. How can stylometric and content-based features alongside machine learning improve detection of sophisticated and AI-generated spam and phishing emails?

This theme focuses on detecting advanced unsolicited emails, including AI-generated phishing attempts, by extracting linguistic and stylometric features, employing interpretable machine learning models, and analyzing email content beyond traditional signature-based approaches to counteract increasingly sophisticated cyber threats.

Key finding: This work evaluates major email providers' abilities to block GPT-4o generated phishing emails, revealing vulnerabilities especially in Gmail and Outlook. Applying 60 stylometric features to classifiers identified XGBoost as... Read more
Key finding: Investigates spam detection in user-generated comments by employing natural language processing (NLP) techniques, such as broken text flow and topic detection, combined with machine learning classifiers. The approach... Read more
Key finding: While not directly about spam filtering, this paper discusses AI's broader societal impacts, emphasizing the emerging challenges and opportunities in cultural settings due to AI's integration. It underlines the importance of... Read more

All papers in Spam Filtering

The growing problem of unsolicited bulk e-mail, also known as "spam", has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative... more
Naive Bayes is very popular in commercial and open-source anti-spam e-mail filters. There are, however, several forms of Naive Bayes, something the anti-spam literature does not always acknowledge. We discuss five different versions of... more
In this paper, we present a comprehensive review of recent developments in the application of machine learning algorithms to Spam filtering, focusing on both textual-and image-based approaches. Instead of considering Spam filtering as a... more
Email spam is one of the major problems of the today's Internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. In this paper... more
Satire is an attractive subject in deception detection research: it is a type of deception that intentionally incorporates cues revealing its own deceptiveness. Whereas other types of fabrications aim to instill a false sense of truth in... more
by David Carmel and 
1 more
We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the com-ments. In contrast to other link spam filtering approaches, our method... more
The rapid growth of Twitter has triggered a dramatic increase in spam volume and sophistication. The abuse of certain Twitter components such as "hashtags", "mentions", and shortened URLs enables spammers to operate efficiently. These... more
In this paper we present Botlab, a platform that continually monitors and analyzes the behavior of spamoriented botnets. Botlab gathers multiple real-time streams of information about botnets taken from distinct perspectives. By combining... more
In the recent years, we have witnessed a dramatic increment in the volume of spam email. Other related forms of spam are increasingly revealing as a problem of importance, specially the spam on Instant Messaging services (the so called... more
In this paper we present a neural network based system for automated e-mail filing into folders and antispam filtering. The experiments show that it is more accurate than several other techniques. We also investigate the effects of... more
Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, also with known label proportions. This problem appears in areas like... more
This paper introduces a novel method, UDmap, to identify dynamically assigned IP addresses and analyze their dynamics pattern. UDmap is fully automatic, and relies only on application-level server logs that are already available today. We... more
In recent years anti-spam filters have become necessary tools for Internet service providers to face up to the continuously growing spam phenomenon. Current server-side anti-spam filters are made up of several modules aimed at detecting... more
In this paper we propose a novel, passive approach for detecting and tracking malicious flux service networks. Our detection system is based on passive analysis of recursive DNS (RDNS) traffic traces collected from multiple large... more
Recent work in P2P overlay networks allow for decentralized object location and routing (DOLR) across networks based on unique IDs. In this paper, we propose an extension to DOLR systems to publish objects using generic feature vectors... more
We consider the problem of content-based spam filtering for short text messages that arise in three contexts: mobile (SMS) communication, blog comments, and email summary information such as might be displayed by a lowbandwidth client.... more
The upsurge in the volume of unwanted emails called spam has created an intense need for the development of more dependable and robust antispam filters. Machine learning methods of recent are being used to successfully detect and filter... more
There are a few key benefits of a case-based approach to spam filtering. First, the many different sub-types of spam suggest that a local learner, such as Case-Based Reasoning (CBR) will perform well. Second, the lazy approach to learning... more
If communication involves some transactions cost to both sender and recipient, what policy ensures that correct messages -those with positive social surplus -get sent? Filters block messages that harm recipients but benefit senders by... more
A new trend in email spam is the emergence of image spam. Although current anti-spam technologies are quite successful in filtering text-based spam emails, the new image spams are substantially more difficult to detect, as they employ a... more
The subject of this research is the development of the architecture of expert system for distributed content aggregation system, the main purpose of which is the categorization of aggregated data. The author examines the advantages and... more
A great amount of machine learning techniques have been applied to problems where data is collected over an extended period of time. However, the disadvantage with many real-world applications is that the distribution underlying the data... more
Under short messaging service (SMS) spam is understood the unsolicited or undesired messages received on mobile phones. These SMS spams constitute a veritable nuisance to the mobile subscribers. This marketing practice also worries... more
İstenmeyen elektronik postalar alıcıya rızası dışında gönderilen ve genellikle kötü niyetli veya tanıtım amaçlı olan kişilerin başvurduğu bir yöntemdir. Elektronik postalar, kullanımının kolaylığı, maliyetlerinin ucuz olmasından dolayı... more
Producing estimates of classification confidence is surprisingly difficult. One might expect that classifiers that can produce numeric classification scores (e.g. k-Nearest Neighbour or Naive Bayes) could readily produce confidence... more
In this paper we show an instance-based reasoning e-mail filtering model that outperforms classical machine learning techniques and other successful lazy learners approaches in the domain of anti-spam filtering. The architecture of the... more
Mobile spam in an increasing threat that may be addressed using filtering systems like those employed against email spam. We believe that email filtering techniques require some adaptation to reach good levels of performance on SMS spam,... more
In this research, we propose a methodology for advert value calculation in CPM, CPC and CPA networks. Accurately estimating this value increases the three previous networks’ incomes by selecting the most profitable advert. By increasing... more
Using statistical machine learning for making security decisions introduces new vulnerabilities in large scale systems. We show how an adversary can exploit statistical machine learning, as used in the SpamBayes spam filter, to render it... more
Due to increase in use of Short Message Service (SMS) over mobile phones in developing countries, there has been a burst of spam SMSes. Content-based machine learning approaches were effective in filtering email spams. Researchers have... more
Because of the changing nature of spam, a spam filtering system that uses machine learning will need to be dynamic. This suggests that a case-based (memory-based) approach may work well. Case-Based Reasoning (CBR) is a lazy approach to... more
In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less... more
Spam has become a major issue in computer security because it is a channel for threats such as computer viruses, worms and phishing. More than 85% of received e-mails are spam. Historical approaches to combat these messages including... more
In their arms race against developers of spam filters, spammers have recently introduced the image spam trick to make the analysis of emails' body text ineffective. It consists in embedding the spam message into an attached image, which... more
The increasing volume of unsolicited bulk e-mail (also known as spam) has generated a need for reliable anti-spam filters. Using a classifier based on machine learning techniques to automatically filter out spam e-mail has drawn many... more
In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and... more
In many security applications a pattern recognition system faces an adversarial classification problem, in which an intelligent, adaptive adversary modifies patterns to evade the classifier. Several strategies have been recently proposed... more
Spam filtering is a text categorization task that shows especial features that make it interesting and difficult. First, the task has been performed traditionally using heuristics from the domain. Second, a cost model is required to avoid... more
Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in... more
We seek to redefine spam and the role of the spam filter in the context of Social Networking Services (SNS). SNS, such as MySpace and Facebook, are increasing in popularity. They enable and encourage users to communicate with previously... more
Spammers are constantly creating sophisticated new weapons in their arms race with anti-spam technology, the latest of which is image-based spam. The newest image-based spam uses simple image processing technologies to vary the content of... more
In this paper we present a Markov Random Field model based approach to filter spam. Our approach examines the importance of the neighborhood relationship (MRF cliques) among words in an email message for the purpose of spam... more
Because of the volume of spam email and its evolving nature, any deployed Machine Learning-based spam filtering system will need to have procedures for case-base maintenance. Key to this will be procedures to edit the case-base to remove... more
Unsolicited bulk email (aka. spam) is a major problem on the Internet. To counter spam, several techniques, ranging from spam filters to mail protocol extensions like hashcash, have been proposed. In this paper we investigate the... more
The explosive growth of unsolicited emails has prompted the development of numerous spam filter techniques. Bayesian spam filters are superior to static keyword-based spam filters in that they can continuously evolve to tackle new spam by... more
We address the problem of recognizing the so-called image spam, which consists in embedding the spam message into attached images to defeat techniques based on the analysis of e-mails' body text, and in using content obscuring techniques... more
Download research papers for free!