Data Mining applied on Web Robots Detection: A Systematic Mapping
2021, Anais do 15. Congresso Brasileiro de Inteligência Computacional
https://doi.org/10.21528/CBIC2021-60Abstract
Browsing on Internet is part of the world population’s daily routine. The number of web pages is increasing and so is the amount of published content (news, tutorials, images, videos) provided by them. Search engines use web robots to index web contents and to offer better results to their users. However, web robots have also been used for exploiting vulnerabilities in web pages. Thus, monitoring and detecting web robots’ accesses is important in order to keep the web server as safe as possible. Data Mining methods have been applied to web server logs (used as data source) in order to detect web robots. Then, the main objective of this work was to observe evidences of definition or use of web robots detection by analyzing web server-side logs using Data Mining methods. Thus, we conducted a systematic Literature mapping, analyzing papers published between 2013 and 2020. In the systematic mapping, we analyzed 34 studies and they allowed us to better understand the area of web robots d...
FAQs
AI
What are the most commonly used algorithms for detecting web robots?
The study identifies 33 machine learning algorithms, with SVM featured in 9 papers and Decision Trees in 8.
How does feature selection impact web robot detection performance?
Different studies selected varied feature sets ranging from 2 to 50, affecting detection accuracy and efficacy.
What methodologies were employed in the systematic mapping of relevant studies?
The systematic mapping included selecting studies based on inclusion/exclusion criteria, retrieving 34 relevant papers from 2150.
What challenges exist in labeling sessions for web robot identification?
Manual labeling is impractical due to data volume, and heuristics may fail to detect all robotic behavior.
When did the significance of detecting web robots gain research attention?
Research focused on detecting web robots increased significantly post-2013, as their presence escalated.
References (74)
- Identification and characterization of crawlers through analysis of web logs 2013 [17] Access patterns for robots and humans in web archives 2013 [18] Detecting Impolite Crawler by Using Time Series Analysis 2013
- A comparison of web robot and human requests 2013 [20] Detecting anomalous Web server usage through mining access logs 2013
- Mining web logs to identify search engine behaviour at websites 2013
- An integrated approach to defence against degrading application-layer DDoS attacks 2013
- Detection and confirmation of web robot requests for cleaning the voluminous web log data 2014 [24] Analysis of Aggregated Bot and Human Traffic on E-Commerce Site 2014
- A Supplementary Method for Malicious Detection Based on HTTP-Activity Similarity Features 2014
- A density based clustering approach to distinguish between web robot and human requests to a web server 2014 [27] Lino -An Intelligent System for Detecting Malicious Web-Robots 2015
- Optimized Distributed Association Mining (ODAM) Algorithm for detecting Web Robots 2015
- A Comparative Analysis of Browsing Behavior of Human Visitors and Automatic Software Agents 2015
- Agglomerative approach for identification and elimination of web robots from web server logs to extract knowledge about actual visitors 2015
- HTTP-sCAN: Detecting HTTP-flooding attack by modeling multi-features of web browsing behavior from noisy web-logs 2015
- An integrated method for real time and offline web robot detection 2016
- HTTP Flooding Attack Detection by Modeling Features of Web Browsing behavior from Web Log 2016 [34] Website Navigation Behavior Analysis for Bot Detection 2017
- A study of different web-crawler behaviour 2017
- Analysis of Robot Detection Approaches for Ethical and Unethical Robots on Web Server Log 2017
- A soft computing approach for benign and malicious web robot detection 2017
- Bot or Not? A Case Study on Bot Recognition from Web Session Logs 2018
- User behavior analytics-based classification of application layer HTTP-GET flood attacks 2018 [40] Performance Evaluation of Large Data Clustering Techniques on Web Robot Session Data 2018 [41] Categorization Performance of Unsupervised Learning Techniques for Web Robots Sessions 2018 [42] Performance Evaluation of Density-Based Clustering Methods for Categorizing Web Robot Sessions 2018
- A System Framework for Efficiently Recognizing Web Crawlers 2018
- Towards a framework for detecting advanced Web bots 2019
- A Hybrid Approach for Recognizing Web Crawlers 2019 [14] An Overview of Web Robots Detection Techniques 2020 [46] Determination of User Navigational Patterns from Server Log Files using Hadoop Techniques 2020
- Bot recognition in a web store: an approach based on unsupervised learning 2020
- Identifying legitimate Web users and bots with different traffic profiles-an Information Bottleneck approach 2020 REFERENCES
- N. Kandpal, R. Sinha, and M. Shekhawat, "A survey on web usage mining: Process, application and tools," Suresh Gyan Vihar University Journal of Engineering & Technology, vol. 3, no. 1, pp. 19-25, 2017.
- C. Bomhardt, W. Gaul, and L. Schmidt-Thieme, "Web robot detection -preprocessing web logfiles for robot detection," in New Developments in Classification and Data Analysis, H. Bock and et al, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 113-124.
- D. Doran, "Detection, classification, and workload analysis of web robots," Ph.D. dissertation, University of Connecticut, Storrs, CT, 2014, accessed: 2021-06-16. [Online]. Available: https://opencommons.uconn.edu/dissertations/348
- F. Menczer, G. Pant, P. Srinivasan, and M. E. Ruiz, "Evaluating topic- driven web crawlers," in Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001, pp. 241-249.
- Y. Sun, I. Councill, and C. Giles, "The ethicality of web crawlers," in Proceedings of International Conference on Web Intelligence and Intelligent Agent Technology (IEEE/WIC/ACM), 2010, pp. 668-675.
- G. Chang, M. Healey, J. McHugh, and J. Wang, Mining the World Wide Web. The Information Retrieval Series. Boston: Springer, 2001.
- U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery in databases," AI magazine, vol. 17, no. 3, p. 37, 1996.
- J. Srivastava, R. Cooley, M. Deshpande, and P. Tan, "Web usage mining: discovery and applications of usage patterns from web data," ACM SIGKDD Explorations Newsletter, vol. 1, no. 2, pp. 12-23, 2000.
- M. Srivastava, A. Srivastava, and R. Garg, "Data preprocessing tech- niques in web usage mining: A literature review," in Proceedings of International Conference on Sustainable Computing in Science, Tech- nology and Management (SUSCOM), 2019.
- Apache Software Foundation. Apache HTTP server version 2.4 -log files. Accessed: 2021-06-23. [Online]. Available: https://httpd.apache.org/docs/2.4/logs.html
- M. Srivastava, R. Garg, and P. Mishra, "Preprocessing techniques in web usage mining: A survey," International Journal of Computer Applications, vol. 97, no. 18, pp. 1-9, 2014.
- R. Rao and J. Arora, "A survey on methods used in web usage mining," International Research Journal of Engineering and Technology IRJET, vol. 4, no. 5, pp. 2627-2631, 2017.
- M. Mughal, "Data mining: Web data mining techniques, tools and algorithms: An overview," International Journal of Advanced Computer Science and Applications, vol. 9, no. 6, pp. 208-2015, 2018.
- H. Chen, H. He, and A. Starr, "An overview of web robots detection techniques," in Proceedings of the International Conference on Cyber Security and Protection of Digital Services (Cyber Security 2020), jun 2020, pp. 1-6.
- K. Petersen, R. Feldt, S. Mujtaba, and M. Mattsson, "Systematic map- ping studies in software engineering," in 12th International Conference on Evaluation and Assessment in Software Engineering (EASE) 12, 2008, pp. 1-10.
- N. Algiriyage, S. Jayasena, G. Dias, A. Perera, and K. Dayananda, "Identification and characterization of crawlers through analysis of web logs," in Proceedings of the 8th International Conference on Industrial and Information Systems, dec 2013, pp. 150-155.
- Y. A. AlNoamany, M. C. Weigle, and M. L. Nelson, "Access patterns for robots and humans in web archives," in Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries -JCDL '13, 2013.
- Z. Chen and W. Feng, "Detecting impolite crawler by using time series analysis," in Proceedings of the 25th International Conference on Tools with Artificial Intelligence, nov 2013.
- D. Doran, K. Morillo, and S. S. Gokhale, "A comparison of web robot and human requests," in Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, aug 2013.
- T. Gržinic, T. Kišasondi, and J. Šaban, "Detecting anomalous web server usage through mining access logs," Central European Conference on Information and Intelligent Systems, 2013.
- J. Jose and P. S. Lal, "Mining web logs to identify search engine behaviour at websites," Informatica, vol. 37, no. 2013, pp. 381-386, 2013.
- D. Stevanovic and N. Vlajic, "An integrated approach to defence against degrading application-layer ddos attacks," in Proceedings of the 12th International Conference on Security and Management, 2013.
- T. H. Sardar and Z. Ansari, "Detection and confirmation of web robot requests for cleaning the voluminous web log data," in Proceedings of the International Conference on the IMpact of E-Technology on US (IMPETUS 2014), jan 2014.
- G. Suchacka, "Analysis of aggregated bot and human traffic on e- commerce site," in Proceedings of the 2014 Federated Conference on Computer Science and Information Systems, sep 2014.
- M. Tran and Y. Nakamura, "A supplementary method for malicious detection based on http-activity similarity features," Journal of Commu- nications, vol. 9, no. 12, 2014.
- M. Zabihi, M. V. Jahan, and J. Hamidzadeh, "A density based clustering approach to distinguish between web robot and human requests to a web server," The ISC International Journal of Information Security (ISeCure), vol. 6, no. 1, pp. 1-13, 2014.
- T. Gržinić, L. Mršić, and J. Šaban, "Lino -an intelligent system for detecting malicious web-robots," in Asian Conference on Intelligent Information and Database Systems. Springer International Publishing, 2015, pp. 559-568.
- A. D. Jagtap and V. Kadroli, "Optimized distributed association mining (odam) algorithm for detecting web robots," International Journal of Engineering and Computer Science (IJECS), vol. 4, no. 7, pp. 13 196- 13 200, 2015.
- D. Sisodia, S. Verma, and O. Vyas, "A comparative analysis of browsing behavior ofhuman visitors and automatic software agents," American Journal of Systems and Software, 2015.
- D. S. Sisodia, S. Verma, and O. P. Vyas, "Agglomerative approach for identification and elimination of web robots from web server logs to extract knowledge about actual visitors," Journal of Data Analysis and Information Processing, vol. 03, no. 01, pp. 1-10, 2015.
- J. Wang, M. Zhang, X. Yang, K. Long, and J. Xu, "HTTP-sCAN: Detect- ing HTTP-flooding attack by modeling multi-features of web browsing behavior from noisy web-logs," China Communications, 12, no. 2, pp. 118-128, feb 2015.
- D. Doran and S. S. Gokhale, "An integrated method for real time and offline web robot detection," Expert Systems, vol. 33, no. 6, pp. 592-606, nov 2016.
- A. Verma and D. Xaxa, "Http flooding attack detection by modeling features of web browsing behavior from web log," International Journal of Innovations & Advancement in Computer Science (IJIACS), vol. 5, no. 6, pp. 154-159, 2016.
- R. Haidar and S. Elbassuoni, "Website navigation behavior analysis for bot detection," in Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA 2017), oct 2017, pp. 60-68.
- A. Menshchikov, A. Komarova, Y. Gatchin, A. Korobeynikov, and N. Tishukova, "A study of different web-crawler behaviour," in Proceed- ings of the 20th Conference of Open Innovations Association (FRUCT), apr 2017, pp. 268-274.
- M. Srivastava, A. kumar Srivastava, R. Garg, and P. K. Mishra, "Analysis of robot detection approaches for ethical and unethical robots on web server log," International Journal of Advanced Research in Computer Science (IJARCS), vol. 8, no. 5, pp. 1132-1134, May 2017.
- M. Zabihimayvan, R. Sadeghi, H. N. Rude, and D. Doran, "A soft computing approach for benign and malicious web robot detection," Expert Systems with Applications, vol. 87, pp. 129-140, nov 2017.
- S. Rovetta, A. Cabri, F. Masulli, and G. Suchacka, "Bot or not? a case study on bot recognition from web session logs," Quantifying and Processing Biomedical and Behavioral Signals, pp. 197-206, aug 2018.
- K. Singh, P. Singh, and K. Kumar, "User behavior analytics-based classification of application layer HTTP-GET flood attacks," Journal of Network and Computer Applications, vol. 112, pp. 97-114, jun 2018.
- D. S. Sisodia, R. Borkar, and H. Shrawgi, "Performance evaluation of large data clustering techniques on web robot session data," in Machine Intelligence and Signal Analysis. Springer, aug 2018, vol. 748, pp. 545-553.
- D. S. Sisodia, R. Khandelwal, and A. Anuragi, "Categorization perfor- mance of unsupervised learning techniques for web robots sessions," in Proceedings of the International Conference on Inventive Research in Computing Applications (ICIRCA 2018), jul 2018, pp. 1370-1374.
- D. S. Sisodia and N. Verma, "Performance evaluation of density-based clustering methods for categorizing web robot sessions," in Proceedings of the 2018 International Conference on Advanced Computation and Telecommunication (ICACAT), dec 2018, pp. 1-5.
- W. Zhu, J. Qin, R. Kong, H. Lin, and Z. He, "A system framework for efficiently recognizing web crawlers," in Proceedings of the 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innova- tion (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), oct 2018, pp. 1130-1133.
- C. Iliou, T. Kostoulas, T. Tsikrika, V. Katos, S. Vrochidis, and Y. Kom- patsiaris, "Towards a framework for detecting advanced web bots," in Proceedings of the 14th International Conference on Availability, Reliability and Security (ARES'19), aug 2019, pp. 1-10.
- W. Zhu, H. Gao, Z. He, J. Qin, and B. Han, "A hybrid approach for recognizing web crawlers," in Proceedings of the 14th International Conference on Wireless Algorithms, Systems, and Applications (WASA 2019), 2019, pp. 507-519.
- R. Patil and P. Trivedi, "Determination of user navigational patterns from server log files using hadoop techniques," International Journal for Research in Applied Science & Engineering Technology (IJRASET), vol. 8, no. 6, pp. 1864-1870, jun 2020.
- S. Rovetta, G. Suchacka, and F. Masulli, "Bot recognition in a web store: an approach based on unsupervised learning," Journal of Network and Computer Applications, vol. 157, pp. 1-15, may 2020.
- G. Suchacka and J. Iwanski, "Identifying legitimate web users and bots with different traffic profiles-an information bottleneck approach," Knowledge-Based Systems, vol. 197, pp. 1-18, jun 2020.
- J. P. Barddal, H. M. Gomes, and F. Enembreck, "Sfnclassifier: A scale- free social network method to handle concept drift," in Proceedings of the 29th Annual ACM Symposium on Applied Computing, 2014, pp. 786-791.
- I. Škrjanc, J. A. Iglesias, A. Sanchis, D. Leite, E. Lughofer, and F. Gomide, "Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: a survey," Information Sci- ences, vol. 490, pp. 344-368, 2019.
- C. Garcia, D. Leite, and I. Škrjanc, "Incremental missing-data imputation for evolving fuzzy granular prediction," IEEE Transactions on Fuzzy Systems, vol. 28, no. 10, pp. 2348-2362, 2019.