An Improved Framework for Content-based Spamdexing Detection
Abstract
To the modern Search Engines (SEs), one of the biggest threats to be considered is spamdexing. Nowadays spammers are using a wide range of techniques for content generation, they are using content spam to fill the Search Engine Result Pages (SERPs) with low-quality web pages. Generally, spam web pages are insufficient, irrelevant and improper results for users. Many researchers from academia and industry are working on spamdexing to identify the spam web pages. However, so far not even a single universally efficient method is developed for identification of all spam web pages. We believe that for tackling the content spam there must be improved methods. This article is an attempt in that direction, where a framework has been proposed for spam web pages identification. The framework uses Stop words, Keywords Density, Spam Keywords Database, Part of Speech (POS) ratio, and Copied Content algorithms. For conducting the experiments and obtaining threshold values WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets have been used. An excellent and promising F-measure of 77.38% illustrates the effectiveness and applicability of proposed method.
References (46)
- Z. Gyongyi and H. Garcia-Molina, "Web spam taxonomy," in First international workshop on adversarial information retrieval on the web (AIRWeb), 2005.
- M. R. Henzinger, R. Motwani, and C. Silverstein, "Challenges in web search engines," in IJCAI, 2003, vol. 3, pp. 1573-1579.
- A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, "Detecting spam web pages through content analysis," Proc. 15th Int. Conf. World Wide Web -WWW '06, p. 83, 2006.
- N. Z. J. MCA and P. Prakash, "Document content based web spam detection using cosine similarity measure," vol. 7, no. June, 2016.
- A. Shahzad, N. M. Nawi, E. Sutoyo, M. Naeem, A. Ullah, S. Naqeeb, and M. Aamir, "Search Engine Optimization Techniques for Malaysian University Websites: A Comparative Analysis on Google and Bing Search Engine," Int. J. Adv. Sci. Eng. Inf. Technol., vol. 8, no. 4, pp. 1262-1269, 2018.
- Y. Li, X. Nie, and R. Huang, "Web spam classification method based on deep belief networks," Expert Syst. Appl., vol. 96, pp. 261-270, 2018.
- Z. Guo and Y. Guan, "Active Probing-Based Schemes and Data Analytics for Investigating Malicious Fast-Flux Web-Cloaking Based Domains," in 2018 27th International Conference on Computer Communication and Networks (ICCCN), 2018, pp. 1-9.
- B. Davison, "Recognizing nepotistic links on the web," Artif. Intell. Web Search, pp. 23-28, 2000.
- N. Spirin and J. Han, "Survey on web spam detection: principles and algorithms," Acm Sigkdd Explor. Newsl., vol. 13, no. 2, pp. 50-64, 2012.
- C. Zhai, "Statistical language models for information retrieval," Synth. Lect. Hum. Lang. Technol., vol. 1, no. 1, pp. 1-141, 2008.
- G. Salton, A. Wong, and C.-S. Yang, "A vector space model for automatic indexing," Commun. ACM, vol. 18, no. 11, pp. 613-620, 1975.
- S. Robertson, H. Zaragoza, and M. Taylor, "Simple BM25 extension to multiple weighted fields," in Proceedings of the thirteenth ACM international conference on Information and knowledge management, 2004, pp. 42-49.
- N. El-Mawass and S. Alaboodi, "Data Quality Challenges in Social Spam Research," J. Data Inf. Qual., vol. 9, no. 1, pp. 4:1--4:4, 2017.
- A. Shahzad, N. M. Nawi, N. A. Hamid, S. N. Khan, M. Aamir, A. Ullah, and S. Abdullah, "The Impact of Search Engine Optimization on The Visibility of Research Paper and Citations," JOIV Int. J. Informatics Vis., vol. 1, no. 4-2, pp. 195-198, 2017.
- R. K. Roul, S. R. Asthana, and M. I. T. Shah, "Detection of spam web page using content and link-based techniques : A combined approach," vol. 41, no. 2, pp. 193-202, 2016.
- J. Piskorski, M. Sydow, and D. Weiss, "Exploring linguistic features for web spam detection: a preliminary study," in Proceedings of the 4th international workshop on Adversarial information retrieval on the web, 2008, pp. 25-28.
- D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," J. Mach. Learn. Res., vol. 3, no. Jan, pp. 993-1022, 2003.
- I. Bíró, D. Siklósi, J. Szabó, and A. A. Benczúr, "Linked latent dirichlet allocation in web spam filtering," in Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, 2009, pp. 37-40. www.ijacsa.thesai.org
- I. Bíró, J. Szabó, and A. A. Benczúr, "Latent dirichlet allocation in web spam filtering," in Proceedings of the 4th international workshop on Adversarial information retrieval on the web, 2008, pp. 29-32.
- Y. Tian, G. M. Weiss, and Q. Ma, "A semi-supervised approach for web spam detection using combinatorial feature-fusion," in Proceedings of the Graph Labelling Workshop and Web Spam Challenge at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery, 2007, pp. 16-23.
- D. Fetterly, M. Manasse, and M. Najork, "Detecting phrase-level duplication on the world wide web," in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, 2005, pp. 170-177.
- D. Fetterly, M. Manasse, and M. Najork, "On the evolution of clusters of near-duplicate web pages," in Web Congress, 2003. Proceedings. First Latin American, 2003, pp. 37-45.
- D. Fetterly, M. Manasse, and M. Najork, "Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages," in Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, 2004, pp. 1-6.
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, "Syntactic clustering of the web," Comput. Networks ISDN Syst., vol. 29, no. 8, pp. 1157-1166, 1997.
- A. Z. Broder, "Some applications of Rabin's fingerprinting method," in Sequences II, Springer, 1993, pp. 143-152.
- M. O. Rabin, "Fingerprinting by random polynomials," Tech. Rep., 1981.
- M. Erdélyi, A. Garzó, and A. A. Benczúr, "Web spam classification: a few features worth more," in Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, 2011, pp. 27-34.
- T. Urvoy, T. Lavergne, and P. Filoche, "Tracking Web Spam with Hidden Style Similarity.," in AIRWeb, 2006, pp. 25-31.
- G. Mishne, D. Carmel, and R. Lempel, "Blocking Blog Spam with Language Model Disagreement.," in AIRWeb, 2005, vol. 5, pp. 1-6.
- D. Hiemstra, "Language Models BT -Encyclopedia of Database Systems," L. LIU and M. T. ÖZSU, Eds. Boston, MA: Springer US, 2009, pp. 1591-1594.
- A. Pavlov and B. Dobrov, "Detecting content spam on the web through text diversity analysis," CEUR Workshop Proc., vol. 735, pp. 11-18, 2011.
- E. S. Swirsky, C. Michaels, S. Stuefen, and M. Halasz, "Hanging the digital shingle: Dental ethics and search engine optimization." Elsevier, 2018.
- "Attracting and analyzing spam postings," Nov. 2015.
- R. Hassanian-esfahani and M. Kargar, "Sectional MinHash for near- duplicate detection," Expert Syst. Appl., vol. 99, pp. 203-212, Jun. 2018, doi: 10.1016/J.ESWA.2018.01.014.
- R. Agrawal, "Controlling Unethical Practices in Web Designing by Search Engines," 2018.
- J. J. Whang, Y. S. Jeong, I. S. Dhillon, S. Kang, and J. Lee, "Fast Asynchronous Anti-TrustRank for Web Spam Detection," 2018.
- D. Pawade, A. Sakhapara, M. Jain, N. Jain, and K. Gada, "Story Scrambler-Automatic Text Generation Using Word Level RNN- LSTM," Int. J. Inf. Technol. Comput. Sci., vol. 10, no. 6, pp. 44-53, 2018.
- W. Li, "Consistency checking of natural language temporal requirements using answer-set programming," 2015.
- K. McKeown, Text generation. Cambridge University Press, 1992.
- H. T. Dang, "Overview of DUC 2005," in Proceedings of the document understanding conference, 2005, vol. 2005, pp. 1-12.
- A. Summerville, S. Snodgrass, M. Guzdial, C. Holmgard, A. K. Hoover, A. Isaksen, A. Nealen, et al., "Procedural Content Generation via Machine Learning (PCGML)," IEEE Trans. Games, vol. 10, no. 3, pp. 257-270, Sep. 2018.
- D. Roy, M. Mitra, and D. Ganguly, "To Clean or Not to Clean," J. Data Inf. Qual., vol. 10, no. 4, pp. 1-25, Oct. 2018.
- E. Sadredini, D. Guo, C. Bo, R. Rahimi, K. Skadron, and H. Wang, "A Scalable Solution for Rule-Based Part-of-Speech Tagging on Novel Hardware Accelerators," in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD '18, 2018, pp. 665-674.
- A. Keyaki and J. Miyazaki, "Part-of-speech tagging for web search queries using a large-scale web corpus," in Proceedings of the Symposium on Applied Computing -SAC '17, 2017, pp. 931-937.
- Y. S. Toutanova, Kristina, Dan Klein, Christopher Manning, "Feature- rich part-of-speech tagging with a cyclic dependency network," in In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for computational Linguistics, 2003, pp. 173-180.
- X. Dai, N., Davison, B. D., & Qi, "Looking into the past to better classify web spam," in In Proceedings of the 5th international workshop on adversarial information retrieval on the web, 2009, pp. 1-8.