Academia.eduAcademia.edu

Publicly available email spam corpus.  Table 2  on the detection of spam messages solely. In a real world environment where there is nothing like zero probability of wrongly categorizing < ham message, it is required that a compromise be reached between the two kinds of errors, depending on the predisposition of user and the performance indicators used. The formulae for calculating the classifi cation accuracy and classification error are depicted in Eqs. (1) and (2. below:  Spam filters with a drastically reduced FPR and FNR are said to have a better performance. These standard characteristics (FNR and FPR) rep- resents the efficiency of filters that directly aim at the classification de- cision borderline devoid of generating the probability estimate. On the other hand, the efficiency of filters that explicitly estimate the group conditional probabilities and then execute classification based on esti- mated probabilities can be represented by a curve called ROC (Receiver Operating Characteristics) curve. ROC curve, is a graphical plot that demonstrates the analytical capability of a spam filter as its bias level is modified [48]. The ROC curve is generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings [49]. The true positive rate is referred to as sensitivity, recall or probability of detection [49] in machine learning. The false-positive rate is referred to as the squabble or likelihood of false alarm. This is computed by subtracting the value of the specificity from 1 (ie. 1 - specificity). ROC testing are an outstanding standard of performance measure in spam filtering [48]. When the ROC curve of a spam filter closely sits on top of another, such filter can be classified a filter with superior performance in all implementation setups [20]. The two metrics imported from the field of information retrieval ‘recall’ and ‘precision’ are respectively utilised for obtaining the efficiency and characteristic of spam filters [50].

Table 2 Publicly available email spam corpus. Table 2 on the detection of spam messages solely. In a real world environment where there is nothing like zero probability of wrongly categorizing < ham message, it is required that a compromise be reached between the two kinds of errors, depending on the predisposition of user and the performance indicators used. The formulae for calculating the classifi cation accuracy and classification error are depicted in Eqs. (1) and (2. below: Spam filters with a drastically reduced FPR and FNR are said to have a better performance. These standard characteristics (FNR and FPR) rep- resents the efficiency of filters that directly aim at the classification de- cision borderline devoid of generating the probability estimate. On the other hand, the efficiency of filters that explicitly estimate the group conditional probabilities and then execute classification based on esti- mated probabilities can be represented by a curve called ROC (Receiver Operating Characteristics) curve. ROC curve, is a graphical plot that demonstrates the analytical capability of a spam filter as its bias level is modified [48]. The ROC curve is generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings [49]. The true positive rate is referred to as sensitivity, recall or probability of detection [49] in machine learning. The false-positive rate is referred to as the squabble or likelihood of false alarm. This is computed by subtracting the value of the specificity from 1 (ie. 1 - specificity). ROC testing are an outstanding standard of performance measure in spam filtering [48]. When the ROC curve of a spam filter closely sits on top of another, such filter can be classified a filter with superior performance in all implementation setups [20]. The two metrics imported from the field of information retrieval ‘recall’ and ‘precision’ are respectively utilised for obtaining the efficiency and characteristic of spam filters [50].