Academia.eduAcademia.edu

Outline

Web Template Extraction Based on Hyperlink Analysis

2015, Electronic Proceedings in Theoretical Computer Science

https://doi.org/10.4204/EPTCS.173.2

Abstract

Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using menus information. Our implementation and experiments demonstrate the usefulness of the technique.

References (18)

  1. Ziv Bar-Yossef & Sridhar Rajagopalan (2002): Template detection via data mining and its applications. In: Proceedings of the 11th International Conference on World Wide Web (WWW'02), ACM, New York, NY, USA, pp. 580-591, doi:10.1145/511446.511522.
  2. Marco Baroni, Francis Chantree, Adam Kilgarriff & Serge Sharoff (2008): Cleaneval: a Competition for Cleaning Web Pages. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC'08), European Language Resources Association, pp. 638-643. Available at http: //www.lrec-conf.org/proceedings/lrec2008/summaries/162.html.
  3. Radek Burget & Ivana Rudolfova (2009): Web Page Element Classification Based on Visual Features. In: Proceedings of the 1st Asian Conference on Intelligent Information and Database Systems (ACIIDS'09), IEEE Computer Society, Washington, DC, USA, pp. 67-72, doi:10.1109/ACIIDS.2009.71.
  4. Eduardo Cardoso, Iam Jabour, Eduardo Laber, Rogério Rodrigues & Pedro Cardoso (2011): An effi- cient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM symposium on Document Engineering (DocEng'11), ACM, New York, NY, USA, pp. 121-128, doi:10.1145/2034691.2034720.
  5. Soumen Chakrabarti (2001): Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In: Proceedings of the 10th International Conference on World Wide Web (WWW'01), ACM, New York, NY, USA, pp. 211-220, doi:10.1145/371920.372054.
  6. Consortium (1997): Document Object Model (DOM). Available from URL: http://www.w3.org/ {DOM}/.
  7. Adriano Ferraresi, Eros Zanchetta, Marco Baroni & Silvia Bernardini (2008): Introducing and evaluating ukWaC, a very large web-derived corpus of english. In: Proceedings of the 4th Web as Corpus Workshop (WAC-4), pp. 47-54.
  8. David Gibson, Kunal Punera & Andrew Tomkins (2005): The volume and evolution of web page templates. In Allan Ellis & Tatsuya Hagino, editors: Proceedings of the 14th International Conference on World Wide Web (WWW'05), ACM, pp. 830-839, doi:10.1145/1062745.1062763.
  9. Thomas Gottron (2008): Content Code Blurring: A New Approach to Content Extraction. In A. Min Tjoa & Roland R. Wagner, editors: Proceedings of the 19th International Workshop on Database and Expert Systems Applications (DEXA'08), IEEE Computer Society, pp. 29-33, doi:10.1109/DEXA.2008.43.
  10. David Insa, Josep Silva & Salvador Tamarit (2013): Using the words/leafs ratio in the DOM tree for content extraction. The Journal of Logic and Algebraic Programming 82(8), pp. 311-325, doi:10.1016/j.jlap.2013.01.002.
  11. Christian Kohlschütter (2009): A densitometric analysis of web template content. In Juan Quemada, Gonzalo León, Yoëlle S. Maarek & Wolfgang Nejdl, editors: Proceedings of the 18th International Conference on World Wide Web (WWW'09), ACM, pp. 1165-1166, doi:10.1145/1526709.1526909.
  12. Christian Kohlschütter, Peter Fankhauser & Wolfgang Nejdl (2010): Boilerplate detection using shallow text features. In Brian D. Davison, Torsten Suel, Nick Craswell & Bing Liu, editors: Proceedings of the 3th International Conference on Web Search and Web Data Mining (WSDM'10), ACM, pp. 441-450, doi:10.1145/1718487.1718542.
  13. Christian Kohlschütter & Wolfgang Nejdl (2008): A densitometric approach to web page segmentation. In James G. Shanahan, Sihem Amer-Yahia, Ioana Manolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi & Abdur Chowdhury, editors: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08), ACM, pp. 1173-1182, doi:10.1145/1458082.1458237.
  14. Davi de Castro Reis, Paulo Braz Golgher, Altigran Soares Silva & Alberto Henrique Frade Laender (2004): Automatic web news extraction using tree edit distance. In: Proceedings of the 13th International Conference on World Wide Web (WWW'04), ACM, New York, NY, USA, pp. 502-511, doi:10.1145/988672.988740.
  15. Kuo Chung Tai (1979): The Tree-to-Tree Correction Problem. Journal of the ACM 26(3), pp. 422-433, doi:10.1145/322139.322143.
  16. Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, João M. B. Cavalcanti & Juliana Freire (2006): A fast and robust method for web page template detection and removal. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06), ACM, New York, NY, USA, pp. 258-267, doi:10.1145/1183614.1183654.
  17. Tim Weninger, William Henry Hsu & Jiawei Han (2010): CETR: Content Extraction via Tag Ratios. In Michael Rappa, Paul Jones, Juliana Freire & Soumen Chakrabarti, editors: Proceedings of the 19th Interna- tional Conference on World Wide Web (WWW'10), ACM, pp. 971-980, doi:10.1145/1772690.1772789.
  18. Lan Yi, Bing Liu & Xiaoli Li (2003): Eliminating noisy information in Web pages for data mining. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD'03), ACM, New York, NY, USA, pp. 296-305, doi:10.1145/956750.956785.