Spam Filtering

description923 papers

group1,962 followers

lightbulbAbout this topic

Spam filtering is the process of identifying and blocking unsolicited or unwanted electronic messages, typically in email, using algorithms and heuristics to classify content as either legitimate or spam. This technique aims to enhance user experience and security by reducing the volume of irrelevant or harmful communications.

lightbulbAbout this topic

Key research themes

1. How do machine learning techniques address the evolving challenges of email spam filtering?

This theme explores the application and advancement of various machine learning (ML) algorithms in email spam filtering, focusing on handling concept drift, feature extraction, ensemble learning, and hybrid models to improve detection accuracy and adaptability under realistic scenarios where spam characteristics continuously evolve.

A review of machine learning approaches to Spam filtering

by Rajat Singh

2016

Key finding: This comprehensive review highlights that while traditional ML approaches such as Naive Bayes remain foundational, evolving challenges like concept drift and the obfuscation of spam texts necessitate adaptive filters. It... Read more

articleView Paper downloadDownload

SPAM EMAIL DETECTION USING MACHINE LEARNING INTEGRATED IN CLOUD

by Joyece Jane

2023

Key finding: Demonstrates the effectiveness of ensemble learning strategies—bagging and boosting—applied to classifiers including multinomial Decision Trees, Naive Bayes, KNN, Random Forest, and SVM for spam detection. The study finds... Read more

articleView Paper downloadDownload

Comparison of Three Machine Learning Models for the Detection of Emails Spam

by Raed Alkaied

2024, Research Square (Research Square)

Key finding: Through empirical comparison on the Spambase dataset, the study shows that Naive Bayes outperforms Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) classifiers in email spam detection accuracy. This reinforces... Read more

articleView Paper downloadDownload

Evaluation of Supervised Learning Models for Automatic Spam Email Detection

by Tsehay Assegie

2024, Research Square (Research Square)

Key finding: Provides a multi-model evaluation (including Random Forest, AdaBoost, Decision Tree, SVM, and Naive Bayes) using balanced datasets and multiple metrics beyond accuracy. The Random Forest model attains the highest accuracy... Read more

articleView Paper downloadDownload

ML Approaches to Detect Email Spam Anamoly

by Joyece Jane

2023, various

Key finding: Proposes a spam detection system leveraging Naive Bayes classifiers integrated with tokenization and stop word filtering via scikit-learn. Emphasis is on the adaptability of ML techniques to changing spam tactics and the... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

2. What are the roles and limitations of pre-acceptance filtering techniques in combating spam at the SMTP server level?

This research area investigates the application of pre-acceptance filtering mechanisms, such as blacklisting, whitelisting, and sender behavior profiling before accepting emails at the SMTP protocol handshake stage, aiming to reduce server load and increase early detection of spam. It also assesses the potential and practical limitations of these techniques in handling diverse spam sources.

On the Effectiveness of Pre-Acceptance Spam Filtering

by Zhuoqing Mao

2023

Key finding: Empirical analysis over millions of emails shows that well-constructed blacklists can filter up to 86% of spam by identifying offending IP blocks and individual senders during pre-acceptance SMTP interactions. However, a... Read more

articleView Paper downloadDownload

Trusting Spam Reporters: A Reporter-Based Reputation System for Email Filtering

by Andrei Schrenck

2016

Key finding: Introduces a reactive spam filtering system leveraging reporter reputation to enable earlier spam campaign detection. The method prioritizes feedback from trustworthy users to identify spamming quickly before widespread... Read more

articleView Paper downloadDownload

3. How can stylometric and content-based features alongside machine learning improve detection of sophisticated and AI-generated spam and phishing emails?

This theme focuses on detecting advanced unsolicited emails, including AI-generated phishing attempts, by extracting linguistic and stylometric features, employing interpretable machine learning models, and analyzing email content beyond traditional signature-based approaches to counteract increasingly sophisticated cyber threats.

Evaluating spam filters and Stylometric Detection of AI-generated phishing emails

by Paolo Modesti

2025, Expert Systems With Applications

Key finding: This work evaluates major email providers' abilities to block GPT-4o generated phishing emails, revealing vulnerabilities especially in Gmail and Outlook. Applying 60 stylometric features to classifiers identified XGBoost as... Read more

articleView Paper downloadDownload

Survey of Spam Comments Identification using NLP Techniques

by Vishal Borate

2024, International Journal of Research and Analytical Reviews (IJRAR)

Key finding: Investigates spam detection in user-generated comments by employing natural language processing (NLP) techniques, such as broken text flow and topic detection, combined with machine learning classifiers. The approach... Read more

articleView Paper downloadDownload

Artificial Intelligence and Its Impact on Punjabi Culture

by Devinder Pal Singh

2023, Punjab Dey Rang. Lahore. Pakistan. 17(3). 5-10. July- Sept.

Key finding: While not directly about spam filtering, this paper discusses AI's broader societal impacts, emphasizing the emerging challenges and opportunities in cultural settings due to AI's integration. It underlines the importance of... Read more

articleView Paper downloadDownload

keyboard_arrow_downShow more

All papers in Spam Filtering

An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

by John G. Koutsias

2000, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '00

The growing problem of unsolicited bulk e-mail, also known as "spam", has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative... more

descriptionView Paper arrow_downwardDownload

Spam filtering with naive bayes-which naive bayes

by Vangelis Metsis

2006, Third conference on email …

Naive Bayes is very popular in commercial and open-source anti-spam e-mail filters. There are, however, several forms of Naive Bayes, something the anti-spam literature does not always acknowledge. We discuss five different versions of... more

descriptionView Paper arrow_downwardDownload

A review of machine learning approaches to Spam filtering

by Walmir Caminhas

2009, Expert Systems with Applications

In this paper, we present a comprehensive review of recent developments in the application of machine learning algorithms to Spam filtering, focusing on both textual-and image-based approaches. Instead of considering Spam filtering as a... more

descriptionView Paper arrow_downwardDownload

A survey of learning-based techniques of email spam filtering

by Anton Bryl

2008, Artificial Intelligence Review

Email spam is one of the major problems of the today's Internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. In this paper... more

Figure 1: What to analyze? Message structure from the point of view of feature selection.

Table 1: Measures of feature relevance used for ordering features. Each measure applies to a feature. M is the set of all training messages, Cspam and Cjeg are the labels of spam class and legitimate mail class correspondingly, f; is a binary feature (for example “the word free is present in the message”), and —f; is the negation of the feature f; (for example “the word free is NOT present in the message”). All the probabilities are estimated with frequencies.

Figure 2: Graphical Comparison of Spam Filtering Algorithms in the Literature. An arrow from method A to method B with references on it means that A is outperformed by B according to the given article(s). An arrow is put only if there is an explicit claim on the relative performance of the two methods in the article. For references to the articles see table 8.

Table 2: Spam Filtering Algorithms. The following abbreviations are used: B - body, H - header, W - whole message.

Table 3: Methods used in some software anti-spam solutions. The meanings of the column titles are explained in Section 4. The addresses of websites are given in Table 4.

Table 5: Measures of filtering performance. Following Androutsopoulos et al. [4], nz, and ng_.s are the numbers of legitimate and spam messages classified correctly, nz_.g and ng_,z are the numbers of legitimate and spam messages misclassified, and 4 is the relative cost of the two types of errors.

Table 7: Description of Public Data. ‘YES’ in the ‘Encrypted’ field means that tokens in the messages are encrypted to address personal privacy, or (in Spambase) only some extracted features of the messages are present in the corpus.

descriptionView Paper arrow_downwardDownload

Fake News or Truth? Using Satirical Cues to Detect Potentially Misleading News

by Victoria L Rubin and

Satire is an attractive subject in deception detection research: it is a type of deception that intentionally incorporates cues revealing its own deceptiveness. Whereas other types of fabrications aim to instill a false sense of truth in... more

Figure 1: Ermida's (2012) model of satirical news.

Figure 2: Two news articles about Hillary Clinton: from the Onion and The New Y ork Times. sive’ K hapra beetle intercepted at Pearson”. See Fig- ure 2 for the pairing of the articles about Hillary Clinton in the Elections topic.

Figure 3: News satire detection pipeline for distinguishing sa- tirical from legitimate news.

Table 1: Sample News Topicality. 5 Canadian and 5 Ameri- can satirical and legitimate article pairs were collected on 12 topics across 4 domains. In this study we 360 news article collected and analyzed a dataset of sas a wide-ranging and diverse data sample, representative of the scope of US and Ca- nadian national lected in 2 sets. Satirical news si newspapers. The dataset was col- The first set was collected from 2 tes (The Onion and The Beaverton) and 2 legitimate news sources (The Toronto Star and The New Yo were aggregated rk Times) in 2015. The 240 articles by a2 x 2x 4x 3 design (US/Ca- nadian; satirical/legitimate online news; varying across 4 domai ns (civics, science, business, and “soft” news) with 3 distinct topics within each of the 4 domains (see 1 Table 1).

7. Discussion used Sklearn.svm.SVC (Support Vector Classifica- tion) for supervised training with a linear kernel al- gorithm, which is suitable for 2 class training data. Our model was trained on 270 and tested on a set of 90 news articles, with equal proportions of satirical and legitimate news. Table 2 presents the measures of precision, recall, and F-score with associated 10- fold cross validation confidence results for our sat- ire detection model. The F-score was maximized in the case when Grammar, Punctuation and Absurd- ity features were used. Precision was highest when Punctuation and Grammar were included. Absurd- ity showed the highest recall performance.

descriptionView Paper arrow_downwardDownload

Blocking blog spam with language model disagreement

by David Carmel and

2005, … of the first international workshop on …

We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the com-ments. In contrast to other link spam filtering approaches, our method... more

descriptionView Paper arrow_downwardDownload

Twitter spammer detection using data stream clustering

by Brian Dickinson

The rapid growth of Twitter has triggered a dramatic increase in spam volume and sophistication. The abuse of certain Twitter components such as "hashtags", "mentions", and shortened URLs enables spammers to operate efficiently. These... more

Figure 2: StreamK M ++ Spam Detection Performance Metrics by Epsilon Value

Figure 6: ROC curve representing spam detection using the StreamK M ++ algorithm.

descriptionView Paper arrow_downwardDownload

Studying Spamming Botnets Using Botlab

by john john

2009

In this paper we present Botlab, a platform that continually monitors and analyzes the behavior of spamoriented botnets. Botlab gathers multiple real-time streams of information about botnets taken from distinct perspectives. By combining... more

descriptionView Paper arrow_downwardDownload

A case-based technique for tracking concept drift in spam filtering

by Padraig Cunningham

2005, Knowledge-Based Systems

descriptionView Paper arrow_downwardDownload

Content based SMS spam filtering

by Enrique Puertas

2006, Proceedings of the …

In the recent years, we have witnessed a dramatic increment in the volume of spam email. Other related forms of spam are increasingly revealing as a problem of importance, specially the spam on Instant Messaging services (the so called... more

Figure 2. The ROC curves and Convex Hull, Spanish database.

Table 1. Slope ranges for various settings, English database It is noteworthy that for 100 attributes, SVM makes one FP, capturing an 89% of spam messages. The optimal situation, represented by the SVM-001-200 classifier, allows detecting over 90% of spam messages, in a quite safe-for-the-user environment.

Table 2. Slope ranges for various settings, Spanish database. Let us examine the slope corresponding to the actual distribution of classes and costs in the collection. Given that P(+)=0,146 and P(-)=0,853, and for a cost ratio of one, the slope value is S = 5,81. The most accurate (and appropriate) classifier for this conditions is SVM-i050-IGO (Support Vector Machines trained with all attributes with IG over zero, in a cost ratio of 1/50 — a false

descriptionView Paper arrow_downwardDownload

A neural network based approach to automated e-mail classification

by James Clark

2003, Web Intelligence, 2003. WI 2003. …

In this paper we present a neural network based system for automated e-mail filing into folders and antispam filtering. The experiments show that it is more accurate than several other techniques. We also investigate the effects of... more

3.2 Spam filtering The size of the three corpora used is shown in Table 2. The first two are publicly available [1,2], the third one is a subset of the U5 corpus. PU1 is encrypted for privacy reasons and contains personal and spam messages. LingSpam contains e-mails sent to the Linguist mailing list mixed with spam e-mails. There are four versions of PUI and LingSpam depending on whether stemming and

is due to the different classification styles. Whi It can also be noticed that the e-mails of U2 and U4 were harder to classify than those of U1, U3 and US. This e Ul, U3 and US (like most users) categorized e-mails based on the to pic and sender, U2 do this totally based on he action performed (e.g. Read& Keep, ToActOn) while U4 uses all Ww T hen e-mails needed to be acted upon (e.g. T T strategies - based on the topic, sender, action and also hisWeek). hus, some mailboxes of U2 and U4 contain e-mails grouped by action and time which complicates learning. he classifier cannot determine the priority of an e-mail 4.1.2. Overall performance. Table 3 shows that the simpler feature selector (V) was more effective than IG.

Table 4. Performance on spam filtering for lemm corpora

Table 5. Portability across corpora In the second experiment the training and testing were on the same data but the feature selection was based on the other data set. The results are considerably better and indicate that even if the feature selection is not perfect, NN is able to recover by training and achieve good performance. Thus, training based on the e-mail collection of the user seems to be more important than

5. Conclusions Table 6. Typical confusion matrices

descriptionView Paper arrow_downwardDownload

Estimating labels from label proportions

by quoc viet

2008, Journal of Machine Learning Research

Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, also with known label proportions. This problem appears in areas like... more

descriptionView Paper arrow_downwardDownload

How dynamic are IP addresses

by Fang Yu

2007

This paper introduces a novel method, UDmap, to identify dynamically assigned IP addresses and analyze their dynamics pattern. UDmap is fully automatic, and relies only on application-level server logs that are already available today. We... more

Figure 3: (a) Section of a user-IP matrix (with 1000 users and 500 IPs) from a large matrix (5483 x 2432). A ’*’ denotes 1 and zero otherwise. (b) Normalized usage-entropy vs. normalized sample usage-entropy for the 500 IP addresses shown in (a). Figure 2: Algorithmic overview of dynamic IP block identification.

Figure 6: Distribution of the three categories of IPs in the ad- dress space.

Figure 8: The distribution of dispersion factors across UDmap IP blocks Figure 7: UDmap IP statistics computed with three different metrics on per-IP basis

Figure 10: Distribution of email server IPs.

Figure 11: (a) Number of days an IP was used as a mail server to send emails. (b)Spam ratio per session. We compare the identified dynamic email servers (UDmap IP + Dynablock IP) with the likely static servers (All - UDmap IP - Dynablock IP).

Table 7: Spam sent from UDmap IPs and Dynablock IPs. Table 8: Top 10 ASes that sent most spam.

descriptionView Paper arrow_downwardDownload

Spam filtering based on the analysis of text information embedded into images

by Ignazio Pillai

2006

In recent years anti-spam filters have become necessary tools for Internet service providers to face up to the continuously growing spam phenomenon. Current server-side anti-spam filters are made up of several modules aimed at detecting... more

dexing, and corresponds to the feature extraction phase of pattern recognition systems. It consists of representing a document as a fixed-length feature vector, in which each feature (usually a real number) is associated to a term of the vocabulary. Terms usually correspond to individual words, or to phrases found in training documents. Indexing is usually preceded by the removal of punctuation and of stop words, and by stemming, with the aim of discarding non-discriminant terms and to re- duce the vocabulary size (and thus the computational complexity). The simplest feature extraction techniques are based on the bag-of-words approach, namely only the number of term occurrences in a document is taken into account, discarding their position. Widely used features are the oc- currence of the corresponding terms in a document (boolean values), the number of occurrences (integer values), or their frequencies relative to document length (real values). The number of oc- currences both in the indexed document and in all training documents is taken into account in the tf-idf (term-frequency inverted-document-frequency) kind of features (Sebastiani, 2002). Statistical classifiers can then be applied to the feature-vector representation of documents. The main text cat- egorisation techniques analysed so far for the specific task of spam filtering are based on the Naive Bayes text classifier (McCallum & Nigam, 1998), and are named “Bayesian filters” in this context (Sahami et al., 1998; Graham, 2002). It is worth noting that such techniques are currently used in several client-side spam filters. The use of support vector machine (SV M) classifiers has also been investigated (Drucker et al., 1999; Zhang et al., 2004), given their state-of-the-art performance on text categorisation tasks.

ure 5: High- on bo attach classi ed images. The evel scheme of the ap th the text in the subject and body fields of e-mails, and the text embedded into traditional document processing steps ( fication) are extended by including in the tokenization phase the plain text extraction proach proposed in this work to im by OCR from attached images, besides plain text extraction fields. These two kind classi fication phases. s of text can then be handled in several plement a spam filter based tokenization, indexing and from the subject and body ways in the indexing and

summarized in Cormack & Lynam (2006) seem to show that, for values of FP below 2%, lower FN values than the ones in Figure 6 can be attained by other spam filters based on text classification methods proposed in literature. These results suggest that our spam filter does not provide the best performance, but that it provides good performance and can be used to investigate whether the performance of a given filter on spam e-mails with attached images can be improved by also taking into account the text information embedded into images.

as for Figure 6 can be made in this case also, except for the fact that in this data set the use of text automatically extracted from images allowed the improvement of categorisation accuracy also for lower FN values.

ble 6: Comparison between the fraction of correctly and wrongly classified e-mails among the ones containing attached images in the SpamA rchive data set, attained by using at classi- fication phase only the text in the subject and body fields (T), and by using both the text in the subject and body fields and that automatically extracted from images (T+I,). These results refer to the term-frequency kind of features, and to three different values of the maximum allowed FP value, and are averaged over the four number of features considered and over the ten runs of the experiments. Standard deviation is reported between brackets. ble 5: Fraction of misclassified test set spam e-mails among the ones containing attached images in the SpamA rchive data set, for three different values of the maximum allowed FP value and for all the different numbers of features, when the term-frequency kind of features was used. Reported values are averaged across the ten runs of the experiments, and refer to three indexing methods T, T+I, and Ig. Standard deviation is reported between brackets.

Table 7: The same comparison as in Table 6, but referred to the case in which only the text auto- matically extracted from images is used at classification phase (Ia).

descriptionView Paper arrow_downwardDownload

Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces

by Wenke Le

2009, 2009 Annual Computer Security Applications Conference

In this paper we propose a novel, passive approach for detecting and tracking malicious flux service networks. Our detection system is based on passive analysis of recursive DNS (RDNS) traffic traces collected from multiple large... more

Figure 1: Overview of our detection system. DNS queries and related responses are represented by our system. Let q‘% be a DNS query performed by a user at time t; to resolve the set of IP addresses owned by domain name d. We formally define the information in the query and its related response as a tuple gq = (¢;,7,P™), where T is the time-to- live (TTL) of the DNS response, and P™ is the set of resolved IPs returned by the RDNS server. Also, let prefix(P, 16) be the set of distinct /16 network prefixes extracted from P™, Seite | oe ore residing in one or few different networks. We use the function prefix(P, 16) to estimate the number of different networks in which the resolved IPs reside”, and the ratio p (rule Fl-c) allows us to identify queries to domains that are very unlikely to be part of a malicious flux service.

Figure 2: Cluster Analysis, Sensor 1. just one. We can also think of the second factor as a sort of “confidence” on the first one. The parameter + is chosen a priori, and is only used to shift the sigmoid towards the right with respect to the x-axes. We set 7 = 3 in our experiments so that if min(|R‘|,|R“®|) =3 the weight factor will be equal to 0.5. As the minimum number of resolved IPs grows, the sigmoidal weight tends to its asymptotic value of 1.

Figure 3: Detection of domains in spam emails.

descriptionView Paper arrow_downwardDownload

Approximate Object Location and Spam Filtering on Peer-to-Peer Systems

by Feng Zhou

2003

Recent work in P2P overlay networks allow for decentralized object location and routing (DOLR) across networks based on unique IDs. In this paper, we propose an extension to DOLR systems to publish objects using generic feature vectors... more

descriptionView Paper arrow_downwardDownload

Spam filtering for short messages

by Enrique Puertas

2007, Proceedings of the …

We consider the problem of content-based spam filtering for short text messages that arise in three contexts: mobile (SMS) communication, blog comments, and email summary information such as might be displayed by a lowbandwidth client.... more

descriptionView Paper arrow_downwardDownload

Machine learning for email spam filtering: review, approaches and open research problems

by Emmanuel Gbenga Dada and

2019, Heliyon

The upsurge in the volume of unwanted emails called spam has created an intense need for the development of more dependable and robust antispam filters. Machine learning methods of recent are being used to successfully detect and filter... more

Fig. 1. The volume of spam emails 4th quarter 2016 to 1st quarter 2018. Many researchers and academicians have proposed different email spam classification techniques which have been successfully used to classify data into groups. These methods include probabilistic, decision

Fig. 2. Pictorial Representation of the Structure of this paper. There is a rapid increase in the interest being shown by the global research community on email spam filtering. In this section, we presen! similar reviews that have been presented in the literature in this domain. This method is followed so as to articulate the issues that are yet to be addressed and to highlight the differences with our current review. Lueg [17] presented a brief survey to explore the gaps in whether informatior filtering and information retrieval technology can be applied to postulate Email spam detection in a logical, theoretically grounded manner, ir order to facilitate the introduction of spam filtering technique that coulc The rest of this paper is organized as follows: Section 2 gives a

Fig. 3. Email server spam filtering architecture.

Fig. 4. Architecture of neural network (NN) Classifier.

Fig. 5. Rough Set (RS) email filtering process workflow from user mailbox.

Fig. 6. Decision Tree Algorithm for email spam filtering. (emails contain both spam and ham) of the dataset is reduced. The dataset can be tested using the decision tree algorithm after the tree is created from the training email dataset. The email dataset being tested undergo some processing in the tree using some predefined rules pending the time it will get to a leaf node. The label in the leaf node is then assigned to the tested data. Below in Fig. 6 is a theoretical tree that illustrate how the decision tree algorithm carries out its spam filtering operation. F represents the features or words in the email message. V depicts the values or word frequencies of some words contained in the email message. C depicts the labels which are either spam/ham.

Summary of previous reviews in email spam filtering. Table 1

Publicly available email spam corpus. Table 2 on the detection of spam messages solely. In a real world environment where there is nothing like zero probability of wrongly categorizing < ham message, it is required that a compromise be reached between the two kinds of errors, depending on the predisposition of user and the performance indicators used. The formulae for calculating the classifi cation accuracy and classification error are depicted in Eqs. (1) and (2. below: Spam filters with a drastically reduced FPR and FNR are said to have a better performance. These standard characteristics (FNR and FPR) rep- resents the efficiency of filters that directly aim at the classification de- cision borderline devoid of generating the probability estimate. On the other hand, the efficiency of filters that explicitly estimate the group conditional probabilities and then execute classification based on esti- mated probabilities can be represented by a curve called ROC (Receiver Operating Characteristics) curve. ROC curve, is a graphical plot that demonstrates the analytical capability of a spam filter as its bias level is modified [48]. The ROC curve is generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings [49]. The true positive rate is referred to as sensitivity, recall or probability of detection [49] in machine learning. The false-positive rate is referred to as the squabble or likelihood of false alarm. This is computed by subtracting the value of the specificity from 1 (ie. 1 - specificity). ROC testing are an outstanding standard of performance measure in spam filtering [48]. When the ROC curve of a spam filter closely sits on top of another, such filter can be classified a filter with superior performance in all implementation setups [20]. The two metrics imported from the field of information retrieval ‘recall’ and ‘precision’ are respectively utilised for obtaining the efficiency and characteristic of spam filters [50].

Levels of cost sensitivity of model. Table 3

Algorithm 1 kNN Algorithm for Spam Email Classification 5.2. Naive Bayes classifier In [58], the steps involved in a simple kNN algorithm for filtering spam mails is described in the algorithm below. Here Neighbours(d) return the k nearest neighbours of d, Closest (d, t) return the closest el- ements of t in d, and testClass(S) return the class label of S. A simple kNN algorithm for spam email classification is in the algorithm below:

Algorithm 2 Naive Bayes Classification Algorithm for Email Spam Classification The message is classified as spam if the total spamminess product S [M] is greater than the hamminess product H [M]. The above description in [63] is used in the Naive Bayes classification algorithm for email spam classification depicted below:

Algorithm 5 Email spam classification algorithm using Rough Set

Algorithm 5 Email spam classification algorithm using Rough Set (continued )

Algorithm 6 Support Vector Machine (SVM) algorithm 5.7. Decision tree

Algorithm 7 Decision Tree algorithm for Spam Filtering By partitioning the email dataset in relation to least entropy, the resultant email dataset has the highest information gain and so impurity The decision tree algorithm for classifying email messages using en- tropy algorithm is presented below:

Algorithm 8 AdaBoost Algorithm for Email Spam Classification (Adapted from [127]) centered on the theory of hybridisation of several weak hypotheses, a very good example is the AdaBoost system. The objective of boosting is to obtain a very accurate classification rule by amalgamating several weak rules or weak hypotheses each of which may be only relatively accurate. A learner is trained in every phase of the classification process, and the result of each phase is used to add credence to data for the upcoming phases [87]. AdaBoost is the most popular boosting algorithm. It was proposed by [88]. AdaBoost can produce a good output even when the performance of the weak learners are unsatisfactory. At present Boosting is now been applied in the field of classification, regression, face recog- nition and so on. Boosting algorithms that utilised confidence rated projections are being applied to solve spam filtering problem. Literature have also shown that they can produce classification results that are better than that of Bayesian and decision tree approaches [87]. AdaBoost has become a widely accepted machine learning algorithm because of its astounding performance in solving classification problems. It is believed among some statisticians that AdaBoost has some relationship with lo- gistic regression probability maximisation [89]. The widespread use of AdaBoost according to Rob Schapire is not unconnected with the ad- vantages that the approach have over some other learning algorithm. AdaBoost is fast, the algorithm is straightforward and easy to program, absence of parameter tuning (except T) makes is less cumbersome. It is adaptable and can combine well with any learning algorithm. Also, there no need of any previous knowledge about weak learner. It is verifiably efficient, provided it can always locate rough rules of thumb. The algo- rithm is very adaptable, and can be used with data that is textual, numeric or discrete in nature. It has been expanded further to learning problems that are outside binary classification. The AdaBoost algorithm for detecting spam email is show in algorithm 8 below:

Algorithm 9 Random Forests Algorithm for Email Classification

Algorithm 10 Convolutional Neural Networks for Email Classification

Summary of published papers that attempted spam filtering using Machine Learning techniques. Table 4

descriptionView Paper arrow_downwardDownload

A case-based approach to spam filtering that can track concept drift

by Padraig Cunningham

2003, The ICCBR

There are a few key benefits of a case-based approach to spam filtering. First, the many different sub-types of spam suggest that a local learner, such as Case-Based Reasoning (CBR) will perform well. Second, the lazy approach to learning... more

descriptionView Paper arrow_downwardDownload

An Economic Response to Unsolicited Communication

by Marshall Van Alstyne and

2000, Advances in Economic Analysis & Policy

If communication involves some transactions cost to both sender and recipient, what policy ensures that correct messages -those with positive social surplus -get sent? Filters block messages that harm recipients but benefit senders by... more

descriptionView Paper arrow_downwardDownload

Filtering Image Spam with Near-Duplicate Detection

by Zhe Wang

2007

A new trend in email spam is the emergence of image spam. Although current anti-spam technologies are quite successful in filtering text-based spam emails, the new image spams are substantially more difficult to detect, as they employ a... more

descriptionView Paper arrow_downwardDownload

Hybrid categorical expert system for use in content aggregation

by Denis Kiryanov

2021, Hybrid categorical expert system for use in content aggregation

The subject of this research is the development of the architecture of expert system for distributed content aggregation system, the main purpose of which is the categorization of aggregated data. The author examines the advantages and... more

interface. The general architecture of an expert system is shown in Figure 1 Lt, p./3] The high-level architecture of an expert system which is shown in Figure 1 can be explained

The high-level architecture of the proposed system is shown in Figure 2. Figure 2 —- Architecture of the expert system for aggregated content categorization

The Pre-processor’s architecture is shown in Figure 3. As it follows from Figure 3, the Pre-processor module's architecture consists of the separated applications to perform HTML markup removal, stop words removal, stemming [1031 lemmatization, lowercasing, punctuation marks removal, and keyword extraction using term frequency-inverse document frequency (TF-IDF) algorithm 104],

To identify spamming behaviors, it is supposed to form news, comments, blogs, and other aggregated content in accordance with their keywords, tags, date of creation, information about the author, external links, descriptions of images, etc., and present it in a vector form for further use of the backpropagation neural-network architecture as described in paper L108].

descriptionView Paper arrow_downwardDownload

Applying lazy learning algorithms to tackle concept drift in spam filtering

by Fernando Díaz

2007, Expert Systems with …

A great amount of machine learning techniques have been applied to problems where data is collected over an extended period of time. However, the disadvantage with many real-world applications is that the distribution underlying the data... more

descriptionView Paper arrow_downwardDownload

A Review on Mobile SMS Spam Filtering Techniques

by Shafi'i Muhammad ABDULHAMID

Under short messaging service (SMS) spam is understood the unsolicited or undesired messages received on mobile phones. These SMS spams constitute a veritable nuisance to the mobile subscribers. This marketing practice also worries... more

study. Data extraction was done on the sorted papers and subsequently tabulated and Figure 2 is created. The articles that were returned and identified from the online databases amounted to a total of 1,923. Figure 1: Online academic database used for searching the literature

Figure 2: Search procedure and study selection diagram 3.2. Performance Metrics In order to evaluate or determine the accuracy of the mobile SMS spam filtering techniques, certain performance evaluation metrics were applied to the selected papers. The following parametrics were found to be prominent:

Figure 3: Performance Metric normal arbitrary prediction; and —1 signifies an inverse prediction [28].

Figure 4: Architecture of the SMS spam transmission line based on Chisquare (CHI2) and information gain (IG) methods, where the number of certain features ranges from | to 100% of the entire BoW features. The experimental results and analysis on the relevant test sets show that the mixture of BoW and SFs (instead of BoW characteristics alone) allows for a more effective and precise performance classification on both test sets. It is also found that the efficiency of utilizing characteristics selection processes varies in each language.

Table 1 presents the summary of the review papers. The first, second, third, fourth, fifth and sixth columns represent serial number (S/N), references, the method or techniques proposed by the researchers, description of the data set used, method or technique used for evaluation of the proposed methods or techniques, and major findings or contribution of the study, respectively. Mathew and Issac [37] compared the variety of intelligent Bayesian classifiers with other classifier techniques for mobile spam filtering in mobile SMS. The WEKA does not read strings and therefore all strings are converted into data in Figure 5: Taxonomy of mobile SMS spamming techniques

Table 2: Spam SMS research datasets It provides easy access to credible sources of datasets for the benefit of the researchers. The table contains four columns representing the serial number, name of the dataset, URL address and reference respectively. The different types of the mobile SMS spam datasets are described as follows:

descriptionView Paper arrow_downwardDownload

İstenmeyen E-postaların Tespiti için Kullanılan Yöntemlerin İncelenmesi Review of the Methods Used for the Detection of Spam

by ersin enes eryılmaz

2020

İstenmeyen elektronik postalar alıcıya rızası dışında gönderilen ve genellikle kötü niyetli veya tanıtım amaçlı olan kişilerin başvurduğu bir yöntemdir. Elektronik postalar, kullanımının kolaylığı, maliyetlerinin ucuz olmasından dolayı propaganda, reklam, oltalama yapmak isteyen kişi veya topluluklar tarafından etkin bir biçimde kullanılmaktadır. Amaçlarını gerçekleştirmek isteyen kişi veya topluluklar hiç tanımadıkları e-posta hesaplarına gereksiz ve istenmeyen postalar gönderirler. Bu çalışmada, istenmeyen elektronik postaların filtrelenmesi için literatürde bulunan yöntemler incelenmiştir. Bu istenmeyen e-posta filtreleme yöntemleri temel olarak yapay zekâ tabanlı olmayan ve yapay zekâ tabanlı olan şeklinde iki ana başlık altında incelenmiştir. Yapay zekâ tabanlı olmayan yöntemlerin istenmeyen e-posta tespitinde etkili sonuçlar verdiği ancak literatürde bu yöntemleri atlayabilen tekniklerin olduğu görülmektedir. İstenmeyen e-posta tespitinde yapay zekâ tabanlı makine öğrenmesi algoritmaları kullanan sistemlerin popülaritesinin arttığı ve araştırmaların bu yönde ivme kazandığı görülmektedir. Özellikle derin öğrenme yöntemleri yüksek performansları nedeniyle spam tespitinde tercih edilmeye başlamıştır. Literatürde klasik makine öğrenme yöntemlerinden olan Bayes, Destek Vektör Makinesi, Yapay Sinir Ağı, Rastgele Orman, Çok Katmanlı Algılayıcı, K-En Yakın Komşu gibi algoritmalarının kullanıldığı spam tespit yöntemlerinde yüksek başarım sağladığı görülmektedir. Uzun Kısa Süreli Bellek ve Evrişimsel Sinir Ağı algoritmalarını kullanan derin öğrenme temelli spam tespit yöntemlerinin başarım oranlarını daha da artırdığı farklı veri kümeleri kullanılarak gösterilmiştir. Ayrıca spam tespit sistemlerinde bulunan açık problemler ve Türkçe özelinde bu çalışmaların hangi aşamada olduğu da bu çalışmada irdelenmiştir ve çeşitli öneriler yapılmıştır. ABSTRACT Spam e-mails are a method that is sent to the recipient without his consent and is generally used by people with malicious or promotional purposes. E-mails are actively used by people or communities who want to make propaganda, advertising, phishing because of their ease of use and low cost. People or communities who want to achieve their goals send spam to the e-mail accounts they never knew. In this study, the methods in the literature for filtering spam e-mails were examined. These spam filtering methods are mainly examined under two main headings: non-artificial intelligence-based and artificial intelligence-based. It is seen that non-artificial intelligence-based methods give effective results in detecting spam, but there are techniques in the literature that can bypass these methods. It is seen that the systems that use artificial intelligence-based machine learning algorithms in detecting spam have increased in popularity and research has gained momentum in this direction. Especially deep learning methods have been preferred for spam detection due to their high performance. In the literature, it is seen that it provides high performance in spam detection methods using algorithms such as Bayes, Support Vector Machine, Artificial Neural Network, Random Forest, Multilayer Perceptron, and K-Nearest Neighbour, which are classical machine learning methods. It has been demonstrated using different datasets that deep learning-based spam detection methods using Long Short Term Memory and Convolutional Neural Network algorithms further increase the performance rates. Besides, open problems found in spam detection systems and the stage of these studies in Turkish are also examined in this study and various suggestions have been made.

descriptionView Paper arrow_downwardDownload

Generating Estimates of Classification Confidence for a Case-Based Spam Filter

by Padraig Cunningham

2005, Case-Based Reasoning …

Producing estimates of classification confidence is surprisingly difficult. One might expect that classifiers that can produce numeric classification scores (e.g. k-Nearest Neighbour or Naive Bayes) could readily produce confidence... more

descriptionView Paper arrow_downwardDownload

SpamHunting: An instance-based reasoning system for spam labelling and filtering

by Fernando Díaz

2007, Decision Support …

In this paper we show an instance-based reasoning e-mail filtering model that outperforms classical machine learning techniques and other successful lazy learners approaches in the domain of anti-spam filtering. The architecture of the... more

descriptionView Paper arrow_downwardDownload

Feature engineering for mobile (SMS) spam filtering

by Enrique Puertas

2007, Proceedings of the 30th …

Mobile spam in an increasing threat that may be addressed using filtering systems like those employed against email spam. We believe that email filtering techniques require some adaptation to reach good levels of performance on SMS spam,... more

descriptionView Paper arrow_downwardDownload

A methodology based on Deep Learning for advert value calculation in CPM, CPC and CPA networks

by Dafne Rosso Pelayo, PhD and

In this research, we propose a methodology for advert value calculation in CPM, CPC and CPA networks. Accurately estimating this value increases the three previous networks’ incomes by selecting the most profitable advert. By increasing... more

descriptionView Paper arrow_downwardDownload

Misleading Learners: Co-opting Your Spam Filter

by Benjamin Rubinstein

2009, Machine Learning in Cyber Trust

Using statistical machine learning for making security decisions introduces new vulnerabilities in large scale systems. We show how an adversary can exploit statistical machine learning, as used in the SpamBayes spam filter, to render it... more

descriptionView Paper arrow_downwardDownload

SMSAssassin: Crowdsourcing Driven Mobile-based System for SMS Spam Filtering

by Kuldeep Yadav

ACM HotMobile 2011

Due to increase in use of Short Message Service (SMS) over mobile phones in developing countries, there has been a burst of spam SMSes. Content-based machine learning approaches were effective in filtering email spams. Researchers have... more

Figure 1: Tag cloud generated from the spam and ham that we collected. Left: Shows the tag cloud for spam SMSes; we see the occurrence of words like get, free, noida, apply, bhk. Right: Shows the tag cloud for ham SMSes; most of the words here are regional and not English.

Figure 2: System Architecture of SMSAssasin. same way, SenderBlacklist and GlobalSenderBlacklist lists are used to detect spam based on senders’ address (phone number).

Figure 3: Snapshots of running SMSAssassin Mobile Application in Nokia 5800 phone in PyS60 environment. Application has two different tabs : Inbox and spam. User is able to report any misclassified SMS as spam or ham.

descriptionView Paper arrow_downwardDownload

An Assessment of Case-Based Reasoning for Spam Filtering

by Padraig Cunningham

2005, Artificial Intelligence Review

Because of the changing nature of spam, a spam filtering system that uses machine learning will need to be dynamic. This suggests that a case-based (memory-based) approach may work well. Case-Based Reasoning (CBR) is a lazy approach to... more

descriptionView Paper arrow_downwardDownload

Web spam classification

by András Garzó and

2011, Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality - WebQuality '11

In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less... more

descriptionView Paper arrow_downwardDownload

Enhanced Topic-based Vector Space Model for Semantics-aware Spam Filtering

by Carlos Laorden

2012, Expert Systems With Applications

Spam has become a major issue in computer security because it is a channel for threats such as computer viruses, worms and phishing. More than 85% of received e-mails are spam. Historical approaches to combat these messages including... more

descriptionView Paper arrow_downwardDownload

A survey and experimental evaluation of image spam filtering techniques

by Battista Biggio and

2011, Pattern Recognition Letters

In their arms race against developers of spam filters, spammers have recently introduced the image spam trick to make the analysis of emails' body text ineffective. It consists in embedding the spam message into an attached image, which... more

Figure 1: Examples of real spam images taken from the authors’ mailboxes, and publicly available: clean (top) and obfuscated (middle, bottom) images.

Figure 2: Average ROC curves attained on the three data sets by the five considered techniques for image spam detection.

Figure 3: Examples of legitimate images which were correctly classified only by Image Cerberus.

Figure 4: Average ROC curves obtained on the three data sets by combining the three image classification techniques and the two OCR-based plug-ins. The ROC curves of the individual techniques are also reported, for an easier comparison.

Table 2: Number of images in the three data sets used in our experiments.

descriptionView Paper arrow_downwardDownload

An Empirical Performance Comparison of Machine Learning Methods for Spam E-Mail Categorization

by Chih-Chin Lai

2004

The increasing volume of unsolicited bulk e-mail (also known as spam) has generated a need for reliable anti-spam filters. Using a classifier based on machine learning techniques to automatically filter out spam e-mail has drawn many... more

descriptionView Paper arrow_downwardDownload

A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain

by Fernando Díaz

2006, Industrial Conference on Data Mining

In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and... more

descriptionView Paper arrow_downwardDownload

Adversarial Pattern Classification Using Multiple Classifiers and Randomisation

by Battista Biggio

2008, Lecture Notes in Computer Science

In many security applications a pattern recognition system faces an adversarial classification problem, in which an intelligent, adaptive adversary modifies patterns to evade the classifier. Several strategies have been recently proposed... more

descriptionView Paper arrow_downwardDownload

Combining text and heuristics for cost-sensitive spam filtering

by Enrique Puertas

2000, Proceedings of the 2nd …

Spam filtering is a text categorization task that shows especial features that make it interesting and difficult. First, the task has been performed traditionally using heuristics from the domain. Second, a cost model is required to avoid... more

descriptionView Paper arrow_downwardDownload

A new feature selection algorithm based on binomial hypothesis testing for spam filtering

by Spam Test

2011, Knowledge Based Systems

Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in... more

descriptionView Paper arrow_downwardDownload

Is britney spears spam

by Aaron Zinman

2007, Fourth Conference on Email and Anti-Spam, …

We seek to redefine spam and the role of the spam filter in the context of Social Networking Services (SNS). SNS, such as MySpace and Facebook, are increasing in popularity. They enable and encourage users to communicate with previously... more

descriptionView Paper arrow_downwardDownload

Image spam hunter

by Amir Choudhary

2008

Spammers are constantly creating sophisticated new weapons in their arms race with anti-spam technology, the latest of which is image-based spam. The newest image-based spam uses simple image processing technologies to vary the content of... more

Fig. 1. Sample spam images: image size changes and rotation (1st row), artifacts in the background and images with icons (2nd row).

In this tem Jmage paper, we propose a learning-based prototype sys- Spam Hunter, as shown in Fig 2, to differentiate spam images from normal image attachments. We first clus- ter the col ected disordered spam images into groups based on image similarity measurement on global color and gradi- ent orientation histograms [6]. The training dataset is cho- sen from t he clustered groups. We then build a probabilistic boosting tree (PBT) [7] based on the training dataset to distin- guish image spams from good emails with image attachments. Image Spam Hunter learns to distinguish spam from ham im- ages without need for performing OCR on the image, and is robust in the face of the kinds of random variation that exist in current spam images. The proposed method achieves 0.86% false positive rates versus 89.44% true positive rates in 5-fold cross-validation.

Fig. 3. Color histograms comparison between natural images and spam images in 32 x 32 2D normalized RG plane. We consider two cues, color and gradient orientation histogram: as the features for classification. The observation is that most of spam images are converted from text spams, although they may contain some icons and artifacts. Thus, the range of color components in a typical spam is quite limited compared witha natural scene. As shown in Fig. 3, the color histograms of nat- ural scenes tend to be continuous, while the color histograms of artificial spam images tend to have some isolated peaks. Another observation is that the distribution of gradient orien- tation may reveal the characteristics of texts. Fig. 4 illustrates the comparison of 1D histograms of gradient orientation of spam and natural images. The distributions of gradient orien- tation for natural images appear more uniform and noisy than those of spam images. Gradient orientation histograms are particular effective to deal with gray-level images.

Fig. 4. Gradient orientation histograms comparison between natural images and spam images.

Table 1. Comparison of 5-fold cross-validation performance of dif- ferent D dimensional vectors (6 = 0). Fig. 6. ROC curves for 64D feature vectors using PBT and SVM classifiers, respectively.

performance gain over the SVM for this task. It achieves 89.44% detection rate at the FP rate of 0.86%, while SVM only achieves approximately 80% detection rate at the same FP rate, since the PBT tries to solve this extremely hard clas- sification problem gradually. This preliminary result seems quite positive and acceptable for real email systems. Our ap- proach tests one image within 0.4s on average on a Pentium 3G desktop .

descriptionView Paper arrow_downwardDownload

Spam Filtering using a Markov Random Field Model with Variable Weighting Schemas

by Christian Siefkes

2004, Fourth IEEE International Conference on Data Mining (ICDM'04)

In this paper we present a Markov Random Field model based approach to filter spam. Our approach examines the importance of the neighborhood relationship (MRF cliques) among words in an email message for the purpose of spam... more

descriptionView Paper arrow_downwardDownload

An Analysis of Case-Base Editing In a Spam Filtering System

by Padraig Cunningham

2004, Advances in Case-Based Reasoning

Because of the volume of spam email and its evolving nature, any deployed Machine Learning-based spam filtering system will need to have procedures for case-base maintenance. Key to this will be procedures to edit the case-base to remove... more

descriptionView Paper arrow_downwardDownload

Spam Filter Analysis

by Flávio Garcia

2004, Computing Research Repository

Unsolicited bulk email (aka. spam) is a major problem on the Internet. To counter spam, several techniques, ranging from spam filters to mail protocol extensions like hashcash, have been proposed. In this paper we investigate the... more

descriptionView Paper arrow_downwardDownload

Leveraging Social Networks For Effective Spam Filtering

by Subiya Nadar

The explosive growth of unsolicited emails has prompted the development of numerous spam filter techniques. Bayesian spam filters are superior to static keyword-based spam filters in that they can continuously evolve to tackle new spam by... more

descriptionView Paper arrow_downwardDownload

Image Spam Filtering Using Visual Information

by Battista Biggio and

2007, 14th International Conference on Image Analysis and Processing (ICIAP 2007)

We address the problem of recognizing the so-called image spam, which consists in embedding the spam message into attached images to defeat techniques based on the analysis of e-mails' body text, and in using content obscuring techniques... more

descriptionView Paper arrow_downwardDownload

Spam Filtering

Key research themes

1. How do machine learning techniques address the evolving challenges of email spam filtering?

2. What are the roles and limitations of pre-acceptance filtering techniques in combating spam at the SMTP server level?

3. How can stylometric and content-based features alongside machine learning improve detection of sophisticated and AI-generated spam and phishing emails?

Related Topics

All papers in Spam Filtering