Web Template Extraction Based on Hyperlink Analysis

David  Insa; Julian Aleixandre; Josep Silva; Salvador Tamarit

doi:10.4204/EPTCS.173.2

Outline

Web Template Extraction Based on Hyperlink Analysis

2015, Electronic Proceedings in Theoretical Computer Science

https://doi.org/10.4204/EPTCS.173.2

visibility

…

description

11 pages

link

1 file

Abstract

Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using menus information. Our implementation and experiments demonstrate the usefulness of the technique.

Figures (5)

nodes and E is a set of edges between nodes in N (see Figure [I). root(T) denotes the root node of T.
Given a node n € N, link(n) denotes the hyperlink of n when n is a node that represents a hyperlink
(HTML label <a>). parent(n) represents node n’ € N such that (n’,n) € E. Similarly, children(n) repre-
sents the set {n’ € N | (n,n') € E}. subtree(n) denotes the subtree of T whose root is n € N. path(n) is
a non-empty sequence of nodes that represents a DOM path; it can be defined as path(n) = non, ...nn
such that Vi,0 <i < m.n; = parent(nj,1).
In order to identify the part of the DOM tree that is common in a set of webpages, our technique uses
an algorithm that is based on the notion of mapping. A mapping establishes a correspondence between
the nodes of two trees. — nodes and E is a set of edges between nodes in N (see Figure [I). root(T) denotes the root node of T. Given a node n € N, link(n) denotes the hyperlink of n when n is a node that represents a hyperlink (HTML label <a>). parent(n) represents node n’ € N such that (n’,n) € E. Similarly, children(n) repre- sents the set {n’ € N | (n,n') € E}. subtree(n) denotes the subtree of T whose root is n € N. path(n) is a non-empty sequence of nodes that represents a DOM path; it can be defined as path(n) = non, ...nn such that Vi,0 <i < m.n; = parent(nj,1). In order to identify the part of the DOM tree that is common in a set of webpages, our technique uses an algorithm that is based on the notion of mapping. A mapping establishes a correspondence between the nodes of two trees.

Example 4.1 Consider the BBC website. Two of its webpages are shown in Figure|2| In this website all
webpages share the same template, and this template has a main menu that is present in all webpages,
and a submenu for each item in the main menu. The site map of the BBC website may be represented
with the topology shown in Figure|3} — Example 4.1 Consider the BBC website. Two of its webpages are shown in Figure|2| In this website all webpages share the same template, and this template has a main menu that is present in all webpages, and a submenu for each item in the main menu. The site map of the BBC website may be represented with the topology shown in Figure|3}

As in Definition we left the algorithm parametric with respect to the equality function =. This
is done on purpose, because this relation is the only parameter that is subjective and thus, it is a good
design decision to leave it open. For instance, a researcher can decide that two DOM nodes are equal
if they have the same label and attributes. Another researcher can relax this restriction ignoring some
attributes (i.e, the template can be the same, even if there are differences in colors, sizes, or even positions
of elements. It usually depends on the particular use of the extracted template). Clearly, = has a direct
influence on the precision and recall of the technique. The more restrictive, the more precision (and less

recall). Note also that the algorithm uses n, ~ ny to indicate that ny and ny maximize function =.
max — As in Definition we left the algorithm parametric with respect to the equality function =. This is done on purpose, because this relation is the only parameter that is subjective and thus, it is a good design decision to leave it open. For instance, a researcher can decide that two DOM nodes are equal if they have the same label and attributes. Another researcher can relax this restriction ignoring some attributes (i.e, the template can be the same, even if there are differences in colors, sizes, or even positions of elements. It usually depends on the particular use of the extracted template). Clearly, = has a direct influence on the precision and recall of the technique. The more restrictive, the more precision (and less recall). Note also that the algorithm uses n, ~ ny to indicate that ny and ny maximize function =. max

References (18)

Ziv Bar-Yossef & Sridhar Rajagopalan (2002): Template detection via data mining and its applications. In: Proceedings of the 11th International Conference on World Wide Web (WWW'02), ACM, New York, NY, USA, pp. 580-591, doi:10.1145/511446.511522.
Marco Baroni, Francis Chantree, Adam Kilgarriff & Serge Sharoff (2008): Cleaneval: a Competition for Cleaning Web Pages. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC'08), European Language Resources Association, pp. 638-643. Available at http: //www.lrec-conf.org/proceedings/lrec2008/summaries/162.html.
Radek Burget & Ivana Rudolfova (2009): Web Page Element Classification Based on Visual Features. In: Proceedings of the 1st Asian Conference on Intelligent Information and Database Systems (ACIIDS'09), IEEE Computer Society, Washington, DC, USA, pp. 67-72, doi:10.1109/ACIIDS.2009.71.
Eduardo Cardoso, Iam Jabour, Eduardo Laber, Rogério Rodrigues & Pedro Cardoso (2011): An effi- cient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM symposium on Document Engineering (DocEng'11), ACM, New York, NY, USA, pp. 121-128, doi:10.1145/2034691.2034720.
Soumen Chakrabarti (2001): Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In: Proceedings of the 10th International Conference on World Wide Web (WWW'01), ACM, New York, NY, USA, pp. 211-220, doi:10.1145/371920.372054.
Consortium (1997): Document Object Model (DOM). Available from URL: http://www.w3.org/ {DOM}/.
Adriano Ferraresi, Eros Zanchetta, Marco Baroni & Silvia Bernardini (2008): Introducing and evaluating ukWaC, a very large web-derived corpus of english. In: Proceedings of the 4th Web as Corpus Workshop (WAC-4), pp. 47-54.
David Gibson, Kunal Punera & Andrew Tomkins (2005): The volume and evolution of web page templates. In Allan Ellis & Tatsuya Hagino, editors: Proceedings of the 14th International Conference on World Wide Web (WWW'05), ACM, pp. 830-839, doi:10.1145/1062745.1062763.
Thomas Gottron (2008): Content Code Blurring: A New Approach to Content Extraction. In A. Min Tjoa & Roland R. Wagner, editors: Proceedings of the 19th International Workshop on Database and Expert Systems Applications (DEXA'08), IEEE Computer Society, pp. 29-33, doi:10.1109/DEXA.2008.43.
David Insa, Josep Silva & Salvador Tamarit (2013): Using the words/leafs ratio in the DOM tree for content extraction. The Journal of Logic and Algebraic Programming 82(8), pp. 311-325, doi:10.1016/j.jlap.2013.01.002.
Christian Kohlschütter (2009): A densitometric analysis of web template content. In Juan Quemada, Gonzalo León, Yoëlle S. Maarek & Wolfgang Nejdl, editors: Proceedings of the 18th International Conference on World Wide Web (WWW'09), ACM, pp. 1165-1166, doi:10.1145/1526709.1526909.
Christian Kohlschütter, Peter Fankhauser & Wolfgang Nejdl (2010): Boilerplate detection using shallow text features. In Brian D. Davison, Torsten Suel, Nick Craswell & Bing Liu, editors: Proceedings of the 3th International Conference on Web Search and Web Data Mining (WSDM'10), ACM, pp. 441-450, doi:10.1145/1718487.1718542.
Christian Kohlschütter & Wolfgang Nejdl (2008): A densitometric approach to web page segmentation. In James G. Shanahan, Sihem Amer-Yahia, Ioana Manolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi & Abdur Chowdhury, editors: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08), ACM, pp. 1173-1182, doi:10.1145/1458082.1458237.
Davi de Castro Reis, Paulo Braz Golgher, Altigran Soares Silva & Alberto Henrique Frade Laender (2004): Automatic web news extraction using tree edit distance. In: Proceedings of the 13th International Conference on World Wide Web (WWW'04), ACM, New York, NY, USA, pp. 502-511, doi:10.1145/988672.988740.
Kuo Chung Tai (1979): The Tree-to-Tree Correction Problem. Journal of the ACM 26(3), pp. 422-433, doi:10.1145/322139.322143.
Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, João M. B. Cavalcanti & Juliana Freire (2006): A fast and robust method for web page template detection and removal. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06), ACM, New York, NY, USA, pp. 258-267, doi:10.1145/1183614.1183654.
Tim Weninger, William Henry Hsu & Jiawei Han (2010): CETR: Content Extraction via Tag Ratios. In Michael Rappa, Paul Jones, Juliana Freire & Soumen Chakrabarti, editors: Proceedings of the 19th Interna- tional Conference on World Wide Web (WWW'10), ACM, pp. 971-980, doi:10.1145/1772690.1772789.
Lan Yi, Bing Liu & Xiaoli Li (2003): Eliminating noisy information in Web pages for data mining. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD'03), ACM, New York, NY, USA, pp. 296-305, doi:10.1145/956750.956785.

Web Template Extraction Based on Hyperlink Analysis

Sign up for access to the world's latest research

Abstract

Related papers

References (18)

Related papers

Cited by