Dom Tree as the base for webpage content extraction: Review
2022, Journal of Al-Qadisiyah for Computer Science and Mathematics
https://doi.org/10.29304/JQCM.2022.14.3.985Abstract
Because of the fast advancement of internet technology in the last twenty years, which leads to a huge number of web pages that contain a massive amount of information in every domain, the volume of available information has been steadily expanding every minute, so the analyzing and extracting information from web pages is becoming increasingly crucial, add to that information in webpages in an unstructured or semi-structured format need to transform in a structured format. Since it is hard to collect the information manually, scientists have devised a variety of methods to help extract information from different domains in an automatic way. the main information in web pages is mixed with a significant amount of unrelated information (noise) like advertisements, boxes with links to relevant material, boxes with photos or other media, top and/or side navigation bars, animated commercials, etc., effect on the performance of information extraction and web content analysis technologies. to eliminate the noise by using the Document Object Model (DOM) that can easily reach every tag in the structure of the webpages to extract the information or delete the noise. This article explores in-depth DOM tree-based approaches, such as HTML tags and the DOM tree, by reviewing works from 2011 to 2021 and comparing numerous elements comprehensively, including classifier methods, contribution, limitation, and evaluation metrics.
FAQs
AI
What role does DOM tree play in web information extraction techniques?
The paper reveals that the DOM tree enables efficient parsing and noise removal from web content, enhancing extraction accuracy by up to 50%. Techniques utilizing DOM tree elements have demonstrated improved performance over traditional information retrieval methods from 2011 to 2021.
How does the Density-Sum technique improve content extraction accuracy?
Density-Sum enhances content extraction by aggregating text densities from DOM tree ancestors, effectively identifying content blocks in noisy environments. This technique addresses challenges associated with content nodes that have low density values, boosting extraction reliability.
What are common challenges in processing webpage noise for information extraction?
Common challenges include the prevalence of non-informative elements, where 40-50% of webpage data is classified as noise. This complicates the classification algorithms, necessitating sophisticated noise-filtering methodologies like DOM tree utilization.
Which features are crucial for evaluating noise in web content extraction algorithms?
Key features include text density thresholds, semantic labeling, and statistical characteristics of nodes. Incorporating these into machine learning classifiers has shown to significantly elevate extraction performance.
What evaluation measures are commonly used in DOM tree-based extraction methods?
Evaluation measures often utilized include precision, recall, and F1 scores, which assess the accuracy of extracted content versus actual informative elements. The studies indicated substantial improvements in performance metrics through multi-feature fusion approaches.
References (20)
- N. Noori and A. Yassin, "Towards for Designing Intelligent Health Care System Based on Machine Learning," Iraqi J. Electr. Electron. Eng., vol. 17, no. 2, pp. 120-128, 2021, doi: 10.37917/ijeee.17.2.14.
- M. M. Almosawi and S. A. Mahmood, "Lexicon-Based Approach For Sentiment Analysis To Student Feedback," vol. 19, no. 1, pp. 6971-6989, 2022.
- Z. Shu and X. Li, "Automatic Extraction of Web Page Text Information Based on Network Topology Coincidence Degree," Wirel. Commun. Mob. Comput., vol. 2022, 2022, DOI: 10.1155/2022/9220661.
- Z. A. Khalaf and I. A. Sheet, "News retrieval based on short queries expansion and best matching," J. Theor. Appl. Inf. Technol., vol. 97, no. 2, pp. 490-500, 2019.
- J. B. Agbogun and V. A. Akpan, "On the Development of Machine Learning Algorithms for Information Extraction of Structured Academic Data from Unstructured Web Documents," no. October 2021.
- "What is Information Extraction? | Ontotext Fundamentals." https://www.ontotext.com/knowledgehub/fundamentals/information-extraction/ (accessed Jun. 22, 2022).
- S. López, J. Silva, and D. Insa, "Using the DOM tree for content extraction," Electron. Proc. Theor. Comput. Sci. EPTCS, vol. 98, no. Www, pp. 46- 59, 2012, DOI: 10.4204/EPTCS.98.6.
- D. Song, F. Sun, and L. Liao, "A hybrid approach for content extraction with text density and visual importance of DOM nodes," Knowl. Inf. Syst., vol. 42, no. 1, pp. 75-96, 2015, DOI: 10.1007/s10115-013-0687-x.
- D. Gibson, K. Punera, and A. Tomkins, "The volume and evolution of web page templates," 14th Int. World Wide Web Conf. WWW2005, pp. 830- 839, 2005, DOI: 10.1145/1062745.1062763.
- Y. F. Lou, Y. C. Zhang, and Z. J. Yuan, "Website information extraction based on DOM-model," Appl. Mech. Mater., vol. 347-350, pp. 2889-2893, 2013, DOI: 10.4028/www.scientific.net/AMM.347-350.2889.
- N. Utiu and V. S. Ionescu, "Learning web content extraction with DOM features," Proc. -2018 IEEE 14th Int. Conf. Intell. Comput. Commun. Process. ICCP 2018, no. February, pp. 5-11, 2018, DOI: 10.1109/ICCP.2018.8516632.
- A. B. Raut, "Main Content Extraction From Web Page Using," vol. 3, no. 3, pp. 5302-5304, 2014.
- K. Umamageswari and R. Kalpana, "Web data extraction from scientific publishers' website using a heuristic algorithm," Int. J. Intell. Syst. Appl., vol. 9, no. 10, pp. 31-39, 2017, DOI: 10.5815/ijisa.2017.10.04.
- B. Mehta, "Extraction," 2015.
- H. Shah, M. Rezaei, and P. Fränti, "DOM-based keyword extraction from Web pages," ACM Int. Conf. Proceeding Ser., 2019, DOI: 10.1145/3371425.3371495.
- F. Sun, D. Song, and L. Liao, "DOM-based content extraction via text density," SIGIR'11 -Proc. 34th Int. ACM SIGIR Conf. Res. Dev. Inf. Retr., vol. 1, pp. 245-254, 2011, DOI: 10.1145/2009916.2009952.
- X. Yu and Z. Jin, "Web content information extraction based on DOM tree and statistical information," Int. Conf. Commun. Technol. Proceedings, ICCT, vol. 2017-October, pp. 1308-1311, 2018, DOI: 10.1109/ICCT.2017.8359846.
- A. Kumar, K. Morabia, J. Wang, K. C.-C. Chang, and A. Schwing, "CoVA: Context-aware Visual Attention for Webpage Information Extraction," pp. 1-11, 2021, DOI: 10.18653/v1/2022.ecnlp-1.11.
- B. Yu, J. Du, and Y. Shao, "Web Page Content Extraction Based on Multi-feature Fusion," no. 61772083, 2022, DOI: 10.7544/issn1000-1239.201.
- H. J. Carey and M. Manic, "HTML web content extraction using paragraph tags," IEEE Int. Symp. Ind. Electron., vol. 2016-Novem, pp. 1099-1105, 2016, DOI: 10.1109/ISIE.2016.7745047.