Academia.eduAcademia.edu

Outline

Site-Level Web Template Extraction Based on DOM Analysis

2016, Lecture Notes in Computer Science

https://doi.org/10.1007/978-3-319-41579-6_4

Abstract

One of the main development resources for website engineers are Web templates. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information leads to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic web template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using an hyperlink analysis. Our implementation and experiments demonstrate the usefulness of the technique.

References (21)

  1. Julián Alarte, David Insa, Josep Silva, and Salvador Tamarit. Automatic detec- tion of webpages that share the same web template. In Maurice H. ter Beek and António Ravara, editors, Proceedings of the 10th International Workshop on Au- tomated Specification and Verification of Web Systems (WWV 14), volume 163 of Electronic Proceedings in Theoretical Computer Science, pages 2-15. Open Pub- lishing Association, July 2014.
  2. Julián Alarte, David Insa, Josep Silva, and Salvador Tamarit. Web template ex- traction based on hyperlink analysis. In Santiago Escobar, editor, Proceedings of the XIV Jornadas sobre Programación y Lenguajes (PROLE 15), volume 173 of Electronic Proceedings in Theoretical Computer Science, pages 16-26. Open Pub- lishing Association, September 2015.
  3. Ziv Bar-Yossef and Sridhar Rajagopalan. Template Detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web (WWW'02), pages 580-591, New York, NY, USA, 2002. ACM.
  4. Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. Cleaneval: a competition for cleaning web pages. In Proceedings of the International Confer- ence on Language Resources and Evaluation (LREC'08), pages 638-643. European Language Resources Association, may 2008.
  5. Radek Burget and Ivana Rudolfova. Web page element classification based on visual features. In Proceedings of the 1st Asian Conference on Intelligent Infor- mation and Database Systems (ACIIDS'09), pages 67-72, Washington, DC, USA, 2009. IEEE Computer Society.
  6. Eduardo Cardoso, Iam Jabour, Eduardo Laber, Rogério Rodrigues, and Pedro Cardoso. An efficient language-independent method to extract content from news webpages. In Proceedings of the 11th ACM symposium on Document Engineering (DocEng'11), pages 121-128, New York, NY, USA, 2011. ACM.
  7. Soumen Chakrabarti. Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International Conference on World Wide Web (WWW'01), pages 211-220, New York, NY, USA, 2001. ACM.
  8. W3C Consortium. Document Object Model (DOM). Available at URL: http: //www.w3.org/DOM/, 1997.
  9. Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. Intro- ducing and evaluating ukWaC, a very large web-derived corpus of english. In Proceedings of the 4th Web as Corpus Workshop (WAC-4), pages 47-54, 2008.
  10. David Gibson, Kunal Punera, and Andrew Tomkins. The volume and evolution of web page templates. In Allan Ellis and Tatsuya Hagino, editors, Proceedings of the 14th International Conference on World Wide Web (WWW'05), pages 830-839. ACM, may 2005.
  11. Thomas Gottron. Content code blurring: A new approach to Content Extraction. In A. Min Tjoa and Roland R. Wagner, editors, Proceedings of the 19th Interna- tional Workshop on Database and Expert Systems Applications (DEXA'08), pages 29-33. IEEE Computer Society, sep 2008.
  12. David Insa, Josep Silva, and Salvador Tamarit. Using the words/leafs ratio in the DOM tree for Content Extraction. The Journal of Logic and Algebraic Program- ming, 82(8):311-325, 2013.
  13. Vidya Kadam and Prakash R. Devale. A methodology for template extraction from heterogeneous web pages. Indian Journal of Computer Science and Engineering (IJCSE), 3(3), jun-jul 2012.
  14. Christian Kohlschütter. A densitometric analysis of web template content. In Juan Quemada, Gonzalo León, Yoëlle S. Maarek, and Wolfgang Nejdl, editors, Proceedings of the 18th International Conference on World Wide Web (WWW'09), pages 1165-1166. ACM, apr 2009.
  15. Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. Boilerplate detec- tion using shallow text teatures. In Brian D. Davison, Torsten Suel, Nick Craswell, and Bing Liu, editors, Proceedings of the 3rd International Conference on Web Search and Web Data Mining (WSDM'10), pages 441-450. ACM, feb 2010.
  16. Christian Kohlschütter and Wolfgang Nejdl. A densitometric approach to web page segmentation. In James G. Shanahan, Sihem Amer-Yahia, Ioana Manolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi, and Abdur Chowd- hury, editors, Proceedings of the 17th ACM Conference on Information and Knowl- edge Management (CIKM'08), pages 1173-1182. ACM, oct 2008.
  17. Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham, and The Duy Bui. A fast template-based approach to automatically identify primary text content of a web page. In Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, KSE 2009, pages 232-236. IEEE Computer Society, 2009.
  18. Davi de Castro Reis, Paulo Braz Golgher, Altigran Soares Silva, and Alberto Hen- rique Frade Laender. Automatic web news extraction using tree edit distance. In Proceedings of the 13th International Conference on World Wide Web (WWW'04), pages 502-511, New York, NY, USA, 2004. ACM.
  19. Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, João M. B. Cavalcanti, and Juliana Freire. A fast and robust method for web page template de- tection and removal. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06), pages 258-267, New York, NY, USA, 2006. ACM.
  20. Tim Weninger, William Henry Hsu, and Jiawei Han. CETR: Content Extrac- tion via tag ratios. In Michael Rappa, Paul Jones, Juliana Freire, and Soumen Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web (WWW'10), pages 971-980. ACM, apr 2010.
  21. Lan Yi, Bing Liu, and Xiaoli Li. Eliminating noisy information in web pages for data mining. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD'03), pages 296-305, New York, NY, USA, 2003. ACM.