SMS Spam Detection Using Machine Learning: An Experimental Study
2025, International Journal of Emerging Trends in Engineering Research
https://doi.org/10.30534/IJETER/2025/011372025Abstract
The exponential growth of mobile communication has intensified the threat of SMS spam, compromising user security and trust in messaging platforms. This study addresses this challenge by designing and deploying a robust spam detection system using machine learning. We analyze a publicly available SMS dataset through rigorous pre-processing, including text normalization, tokenization, and feature engineering, followed by TF-IDF vectorization. A comparative evaluation of 11 classifiers-spanning probabilistic models, ensemble methods, and linear classifiers-reveals that ensemble techniques outperform traditional algorithms. The Extra Trees Classifier and XGBoost achieve state-of-the-art results, with 97.9% accuracy and 97.5% precision, demonstrating their efficacy in distinguishing spam from legitimate messages. To bridge the gap between research and practical application, we develop an interactive Streamlit web application that enables real-time spam classification with a user-friendly interface. This work underscores the potential of ensemble learning for text classification tasks and provides a scalable framework for combating SMS spam in real-world scenarios.
FAQs
AI
What machine learning algorithms performed best in SMS spam detection?
The study reveals that Extra Trees Classifier and XGBoost achieved the highest accuracy of 97.87%.
How was the effectiveness of the algorithms evaluated in the study?
The performance was rigorously assessed using metrics like accuracy, precision, recall, and F1-score.
What challenges exist with the current SMS spam detection dataset?
The dataset is limited to English messages, and imbalances in spam types may affect performance.
What preprocessing steps were taken in the SMS data handling?
Preprocessing included lowercasing, tokenization, stopword removal, and feature extraction for improved model performance.
How does the developed web application enhance user interaction?
The interactive web application allows users to classify SMS messages in real-time, demonstrating practical applicability.
References (8)
- "SMS Spam Collection Dataset," UCI Machine Learning Repository. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection.
- A. Almeida, J. M. G. Hidalgo, and A. Yamakami, "Contributions to the study of SMS spam filtering: New collection and results," in Proc. 11th ACM Symp. Document Engineering, 2011, pp. 259-262.
- I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos, "An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages," in Proc. 23rd Annual Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2000, pp. 160-167.
- F. Pedregosa et al., "Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.
- C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
- L. Breiman, "Random forests," Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
- T. Chen and C. Guestrin, "XGBoost: A scalable tree boosting system," in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2016, pp. 785-794.
- NLTK Project. [Online]. Available: https://www.nltk.org/. [Accessed: 28-May-2025].