Web spam classification
2011, Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality - WebQuality '11
https://doi.org/10.1145/1964114.1964121Abstract
In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less attention in the adversarial IR community, yields more improvement than new features and result in low cost yet accurate spam filters. Our original contributions are as follows:
References (48)
- REFERENCES
- J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
- L. D. Artem Sokolov, Tanguy Urvoy and O. Ricard. Madspam consortium at the ecml/pkdd discovery challenge 2010. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.
- J. Attenberg and T. Suel. Cleaning search results using term distance features. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages 21-24. ACM New York, NY, USA, 2008.
- A. A. Benczúr, M. Erdélyi, J. Masanés, and D. Siklósi. Web spam challenge proposal for filtering in archives. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009.
- I. Bíró, D. Siklósi, J. Szabó, and A. A. Benczúr. Linked latent dirichlet allocation in web spam filtering. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009.
- L. Breiman. Random forests. Machine learning, 45(1):5-32, 2001.
- R. Caruana, A. Munson, and A. Niculescu-Mizil. Getting the most out of ensemble selection. In ICDM '06: Proceedings of the Sixth International Conference on Data Mining, pages 828-833, Washington, DC, USA, 2006. IEEE Computer Society.
- R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from libraries of models. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 18, New York, NY, USA, 2004. ACM.
- C. Castillo, K. Chellapilla, and L. Denoyer. Web spam challenge 2008. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
- C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40(2):11-24, December 2006.
- C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423-430, 2007.
- O. Chapelle, Y. Chang, and T.-Y. Liu. The yahoo! learning to rank challenge, 2010.
- N. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1):1-6, 2004.
- K. Chellapilla and D. M. Chickering. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 17-24, Seattle, WA, August 2006.
- G. Cormack. Content-based Web Spam Detection. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2007.
- K. Csalogány, A. Benczúr, D. Siklósi, and L. Lukács. Semi-Supervised Learning: A Comparative Study for Web Spam and Telephone User Churn. In Graph Labeling Workshop in conjunction with ECML/PKDD 2007, 2007.
- N. Dai, B. D. Davison, and X. Qi. Looking into the past to better classify web spam. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009.
- P. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Incremental page rank computation on evolving graphs. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 1094-1095, New York, NY, USA, 2005. ACM.
- P. K. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Divide and conquer approach for efficient pagerank computation. In ICWE '06: Proceedings of the 6th international conference on Web engineering, pages 233-240, New York, NY, USA, 2006. ACM.
- M. Erdélyi and A. A. Benczúr. Temporal analysis for web spam detection: An overview. In 1st International Temporal Web Analytics Workshop (TWAW) in conjunction with the 20th International World Wide Web Conference in Hyderabad, India. CEUR Workshop Proceedings, 2011.
- M. Erdélyi, A. A. Benczúr, J. Masanés, and D. Siklósi. Web spam filtering in internet archives. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009.
- FastRandomForest. Re-implementation of the random forest classifier for the weka environment. http://code.google.com/p/fast-random-forest/.
- D. Fetterly and Z. Gyöngyi. Fifth international workshop on adversarial information retrieval on the web (AIRWeb 2009). 2009.
- J. Fogarty, R. S. Baker, and S. E. Hudson. Case studies in the use of roc curve analysis for sensor-based estimates in human computer interaction. In Proceedings of Graphics Interface 2005, GI '05, pages 129-136, School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, 2005. Canadian Human-Computer Communications Society.
- J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. Annals of statistics, pages 337-374, 2000.
- G. Geng, X. Jin, and C. Wang. CASIA at WSC2008. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
- X.-C. Z. Guang-Gang Geng, Xiao-Bo Jin and D. Zhang. Evaluating web content quality via multi-scale features. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.
- Z. Gyöngyi and H. Garcia-Molina. Spam: It's not just for inboxes anymore. IEEE Computer Magazine, 38(10):28-34, October 2005.
- Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.
- Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 576-587, Toronto, Canada, 2004.
- M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11-22, 2002.
- A. Hotho, D. Benz, R. Jäschke, and B. Krause, editors. Proceedings of the ECML/PKDD Discovery Challenge. 2008.
- Y. joo Chung, M. Toyoda, and M. Kitsuregawa. A study of web spam evolution using a time series of web snapshots. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009.
- C. Kohlschütter, P. A. Chirita, and W. Nejdl. Efficient parallel computation of pagerank, 2007.
- Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In SDM 07, 2007.
- Y. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. Tseng. Splog detection using content, time and link structures. In 2007 IEEE International Conference on Multimedia and Expo, pages 2030-2033, 2007.
- F. McSherry. A uniform approach to accelerated PageRank computation. In Proceedings of the 14th international conference on World Wide Web, pages 575-582. ACM, 2005.
- G. Mohr, M. Stack, I. Rnitovic, D. Avery, and M. Kimpton. Introduction to Heritrix. In 4th International Web Archiving Workshop, 2004.
- A. Niculescu-Mizil, C. Perlich, G. Swirszcz, V. Sindhwani, Y. Liu, P. Melville, D. Wang, J. Xiao, J. Hu, M. Singh, et al. Winning the KDD Cup Orange Challenge with Ensemble Selection. In KDD Cup and Workshop in conjunction with KDD 2009, 2009.
- V. Nikulin. Web-mining with wilcoxon-based feature selection, ensembling and multiple binary classifiers. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.
- A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83-92, Edinburgh, Scotland, 2006.
- S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In In Proceedings of SIGIR'94, pages 232-241. Springer-Verlag, 1994.
- G. Shen, B. Gao, T. Liu, G. Feng, S. Song, and H. Li. Detecting link spam using temporal information. In ICDM'06., pages 1049-1053, 2006.
- A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004.
- S. Webb, J. Caverlee, and C. Pu. Predicting web spam with HTTP session information. In Proceeding of the 17th ACM conference on Information and knowledge management, pages 339-348. ACM, 2008.
- I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005.
- B. Wu, V. Goel, and B. D. Davison. Topical TrustRank: Using topicality to combat web spam. In Proceedings of the 15th International World Wide Web Conference (WWW), Edinburgh, Scotland, 2006.